Introducing Oml, a Small OCaml Library for Numerical Computing
11 Aug 2015We’d like to announce a library, Oml, that is a small attempt to fulfill a big need: numerical computing in OCaml. For a long time, people have been successfully using OCaml for numerical work, and many libraries have been written that either wrap existing functionality:
- Lacaml and SLAP for vector, matrices and linear algebra routines,
- L-BFGS for solving numerical optimizations,
- LibSVM for Support Vector Machines,
- Pareto for sampling from distributions.
Or provide a good OCaml solution to a specific problem:
- GPR for Gaussian Process Regression,
- FaCiLe for integer constraint solving,
- Libra (recently discovered) for discrete probabilistic networks,
- OCamlgraph.
But a simple unified library with the middle pieces has been missing.
Origins
To understand Oml, it might be helpful to know about its origin. Oml started as a collection of scripts and routines for exploratory data analysis that developers found themselves reusing between projects. Some have a personal preference for developing software and data projects by using a REPL; it provides an iterative feedback loop that allows one to make lots of mistakes quickly and thereby learn. Unlike more traditional, interpreter-based systems that are not statically typed (ex. R, Matlab, and Python) the OCaml REPL lets you also model the problem domain with types. So we can have a little bit more confidence that crazy code such as
is doing the “right” thing when it confirms that the type is a float array array
.
We needed a library that would fit this workflow.
The second need is when we start modeling the data by running regressions or creating classifiers, we don’t want to turn to a statistics textbook or Google to understand the procedures for these methods in detail. It is at this moment that a type system can be particularly helpful because it specifically encodes the different possible analyses and their required parameters.
Installing
Oml is now available on opam so
Capabilities
Here is a quick tour of some of Oml’s capabilities. These examples were performed in the utop REPL (and edited for readability), so to get started
I also added these commands to make the examples easier to see
First let’s demonstrate Oml’s Sampling
capabilities by generating some data from a fat-tailed distribution,
using a method that is easier to understand than resorting to a specific distribution;
mix two normal distributions
with different variances.
In the data
below, every 10
th data point is from a normal distribution
with standard deviation of 10
while all the rest have a standard deviation of 1
.
Let’s investigate it
unbiased_summary
deals with sample size adjustments, such as dividing by n-1 for standard deviation.
The `Fat
should raise some concern for the keen data analyst, but it is by design.
The `Negative
skew might draw more questions, but repeating the analysis can shed light
on this “balancing” problem; consider
The function data_skew
uses the same technique to generate our data but takes as argument the frequency of when we sample from the
wider (std=10
) distribution.
Afterwards, count_big_skew
summarizes samples of data,
counting those where the skew classification
is `Positive
or `Negative
.
Finally, we compare two cases;
one where we generate samples of data from the wider distribution on every 10
th draw,
and one where we alternate.
Two sample runs demonstrate that skewed samples are much more frequent
when drawing from the wider distribution is rarer.
Another way to think about this effect is that the distribution is so leptokurtic in relation to the other moments that it dominates.
Since we knew that the original data is supposed to have a mean of 2.0 we can ask for the probability of observing
this data via an Inference
test.
So we shouldn’t be surprised by our data. Lets demonstrate a simple regression.
What if we throw some collinearity in the mix:
As we can see, regular regression gives us back some pretty outrageous coefficients : [|-80174210710646.3594; -80174210710645.5156; 80174210710652.8594|]
.
But when we use ridge regression,
even a small lambda
of 0.1
helps us recover more sensible coefficients [|1.8613; 2.7559; 4.6172|]
.
Finally, lets try a little classification. This example comes from Bayesian Reasoning and Machine Learning by David Barber.
Thanks
Oml uses Lacaml (for SVD, PCA and multivariate regression) and L-BFGS (for a logistic regression classifier) so a big thank you to Markus Mottl and Christophe Troestler for their work–Oml would not be possible without them.
Furthermore, Carmelo Piccione provided valuable contributions and code reviews.
Future Work
While we think of Oml as a work in progress, we want to release it to the general public to garner feedback. Writing a powerful and flexible library is going to take a community effort because developers have different interests and strengths. Furthermore, while one function name or signature may be obvious to a developer, it may be confusing to others, so we encourage feedback through the issue tracker. In the immediate future, we’d like to:
- Refactor the testing framework to have uniform float comparison and perhaps formalize the probabilistic properties.
- Refine some of the interfaces to use Functors.
- Develop a uniform representation of multidimensional data that can handle the awkward inputs to some of these algorithms gracefully.