Introducing Oml, a Small OCaml Library for Numerical Computing11 Aug 2015
We’d like to announce a library, Oml, that is a small attempt to fulfill a big need: numerical computing in OCaml. For a long time, people have been successfully using OCaml for numerical work, and many libraries have been written that either wrap existing functionality:
- Lacaml and SLAP for vector, matrices and linear algebra routines,
- L-BFGS for solving numerical optimizations,
- LibSVM for Support Vector Machines,
- Pareto for sampling from distributions.
Or provide a good OCaml solution to a specific problem:
- GPR for Gaussian Process Regression,
- FaCiLe for integer constraint solving,
- Libra (recently discovered) for discrete probabilistic networks,
But a simple unified library with the middle pieces has been missing.
To understand Oml, it might be helpful to know about its origin. Oml started as a collection of scripts and routines for exploratory data analysis that developers found themselves reusing between projects. Some have a personal preference for developing software and data projects by using a REPL; it provides an iterative feedback loop that allows one to make lots of mistakes quickly and thereby learn. Unlike more traditional, interpreter-based systems that are not statically typed (ex. R, Matlab, and Python) the OCaml REPL lets you also model the problem domain with types. So we can have a little bit more confidence that crazy code such as
is doing the “right” thing when it confirms that the type is a
float array array.
We needed a library that would fit this workflow.
The second need is when we start modeling the data by running regressions or creating classifiers, we don’t want to turn to a statistics textbook or Google to understand the procedures for these methods in detail. It is at this moment that a type system can be particularly helpful because it specifically encodes the different possible analyses and their required parameters.
Oml is now available on opam so
Here is a quick tour of some of Oml’s capabilities. These examples were performed in the utop REPL (and edited for readability), so to get started
I also added these commands to make the examples easier to see
First let’s demonstrate Oml’s
capabilities by generating some data from a fat-tailed distribution,
using a method that is easier to understand than resorting to a specific distribution;
mix two normal distributions
with different variances.
data below, every
10th data point is from a normal distribution
with standard deviation of
10 while all the rest have a standard deviation of
Let’s investigate it
unbiased_summary deals with sample size adjustments, such as dividing by n-1 for standard deviation.
`Fat should raise some concern for the keen data analyst, but it is by design.
`Negative skew might draw more questions, but repeating the analysis can shed light
on this “balancing” problem; consider
data_skew uses the same technique to generate our data but takes as argument the frequency of when we sample from the
count_big_skew summarizes samples of data,
counting those where the skew classification
Finally, we compare two cases;
one where we generate samples of data from the wider distribution on every
and one where we alternate.
Two sample runs demonstrate that skewed samples are much more frequent
when drawing from the wider distribution is rarer.
Another way to think about this effect is that the distribution is so leptokurtic in relation to the other moments that it dominates.
Since we knew that the original data is supposed to have a mean of 2.0 we can ask for the probability of observing
this data via an
So we shouldn’t be surprised by our data. Lets demonstrate a simple regression.
What if we throw some collinearity in the mix:
As we can see, regular regression gives us back some pretty outrageous coefficients :
[|-80174210710646.3594; -80174210710645.5156; 80174210710652.8594|].
But when we use ridge regression,
even a small
0.1 helps us recover more sensible coefficients
[|1.8613; 2.7559; 4.6172|].
Finally, lets try a little classification. This example comes from Bayesian Reasoning and Machine Learning by David Barber.
Oml uses Lacaml (for SVD, PCA and multivariate regression) and L-BFGS (for a logistic regression classifier) so a big thank you to Markus Mottl and Christophe Troestler for their work–Oml would not be possible without them.
Furthermore, Carmelo Piccione provided valuable contributions and code reviews.
While we think of Oml as a work in progress, we want to release it to the general public to garner feedback. Writing a powerful and flexible library is going to take a community effort because developers have different interests and strengths. Furthermore, while one function name or signature may be obvious to a developer, it may be confusing to others, so we encourage feedback through the issue tracker. In the immediate future, we’d like to: