Introducing Oml, a Small OCaml Library for Numerical Computing

11 Aug 2015

We’d like to announce a library, Oml, that is a small attempt to fulfill a big need: numerical computing in OCaml. For a long time, people have been successfully using OCaml for numerical work, and many libraries have been written that either wrap existing functionality:

Lacaml and SLAP for vector, matrices and linear algebra routines,
L-BFGS for solving numerical optimizations,
LibSVM for Support Vector Machines,
Pareto for sampling from distributions.

Or provide a good OCaml solution to a specific problem:

GPR for Gaussian Process Regression,
FaCiLe for integer constraint solving,
Libra (recently discovered) for discrete probabilistic networks,
OCamlgraph.

But a simple unified library with the middle pieces has been missing.

Origins

To understand Oml, it might be helpful to know about its origin. Oml started as a collection of scripts and routines for exploratory data analysis that developers found themselves reusing between projects. Some have a personal preference for developing software and data projects by using a REPL; it provides an iterative feedback loop that allows one to make lots of mistakes quickly and thereby learn. Unlike more traditional, interpreter-based systems that are not statically typed (ex. R, Matlab, and Python) the OCaml REPL lets you also model the problem domain with types. So we can have a little bit more confidence that crazy code such as

is doing the “right” thing when it confirms that the type is a float array array.

We needed a library that would fit this workflow.

The second need is when we start modeling the data by running regressions or creating classifiers, we don’t want to turn to a statistics textbook or Google to understand the procedures for these methods in detail. It is at this moment that a type system can be particularly helpful because it specifically encodes the different possible analyses and their required parameters.

Installing

Oml is now available on opam so

opam install oml

Capabilities

Here is a quick tour of some of Oml’s capabilities. These examples were performed in the utop REPL (and edited for readability), so to get started

I also added these commands to make the examples easier to see

First let’s demonstrate Oml’s Sampling capabilities by generating some data from a fat-tailed distribution, using a method that is easier to understand than resorting to a specific distribution; mix two normal distributions with different variances. In the data below, every 10th data point is from a normal distribution with standard deviation of 10 while all the rest have a standard deviation of 1.

Let’s investigate it

unbiased_summary deals with sample size adjustments, such as dividing by n-1 for standard deviation. The `Fat should raise some concern for the keen data analyst, but it is by design. The `Negative skew might draw more questions, but repeating the analysis can shed light on this “balancing” problem; consider

The function data_skew uses the same technique to generate our data but takes as argument the frequency of when we sample from the wider (std=10) distribution. Afterwards, count_big_skew summarizes samples of data, counting those where the skew classification is `Positive or `Negative. Finally, we compare two cases; one where we generate samples of data from the wider distribution on every 10th draw, and one where we alternate. Two sample runs demonstrate that skewed samples are much more frequent when drawing from the wider distribution is rarer.

Another way to think about this effect is that the distribution is so leptokurtic in relation to the other moments that it dominates.

Since we knew that the original data is supposed to have a mean of 2.0 we can ask for the probability of observing this data via an Inference test.

So we shouldn’t be surprised by our data. Lets demonstrate a simple regression.

What if we throw some collinearity in the mix:

As we can see, regular regression gives us back some pretty outrageous coefficients : [|-80174210710646.3594; -80174210710645.5156; 80174210710652.8594|]. But when we use ridge regression, even a small lambda of 0.1 helps us recover more sensible coefficients [|1.8613; 2.7559; 4.6172|].

Finally, lets try a little classification. This example comes from Bayesian Reasoning and Machine Learning by David Barber.

Thanks

Oml uses Lacaml (for SVD, PCA and multivariate regression) and L-BFGS (for a logistic regression classifier) so a big thank you to Markus Mottl and Christophe Troestler for their work–Oml would not be possible without them.

Furthermore, Carmelo Piccione provided valuable contributions and code reviews.

Future Work

While we think of Oml as a work in progress, we want to release it to the general public to garner feedback. Writing a powerful and flexible library is going to take a community effort because developers have different interests and strengths. Furthermore, while one function name or signature may be obvious to a developer, it may be confusing to others, so we encourage feedback through the issue tracker. In the immediate future, we’d like to:

Refactor the testing framework to have uniform float comparison and perhaps formalize the probabilistic properties.
Refine some of the interfaces to use Functors.
Develop a uniform representation of multidimensional data that can handle the awkward inputs to some of these algorithms gracefully.