# Introducing Oml, a Small OCaml Library for Numerical Computing

11 Aug 2015We’d like to announce a library, Oml, that is a small attempt to fulfill a big need: numerical computing in OCaml. For a long time, people have been successfully using OCaml for numerical work, and many libraries have been written that either wrap existing functionality:

- Lacaml and SLAP for vector, matrices and linear algebra routines,
- L-BFGS for solving numerical optimizations,
- LibSVM for Support Vector Machines,
- Pareto for sampling from distributions.

Or provide a good OCaml solution to a specific problem:

- GPR for Gaussian Process Regression,
- FaCiLe for integer constraint solving,
- Libra (recently discovered) for discrete probabilistic networks,
- OCamlgraph.

But a simple unified library with the middle pieces has been missing.

## Origins

To understand Oml, it might be helpful to know about its origin.
Oml started as a collection of scripts and routines for exploratory data analysis
that developers found themselves reusing between projects.
Some have a personal preference for developing software and data projects by using a REPL;
it provides an iterative feedback loop that allows one to make lots of mistakes quickly
and thereby learn.
Unlike more traditional, interpreter-based systems that are not statically typed
(ex. R, Matlab, and Python) the OCaml REPL
lets you also model the problem domain with types.
So we can have a little bit more confidence that *crazy* code such as

is doing the “right” thing when it confirms that the type is a `float array array`

.

We needed a library that would fit this workflow.

The second need is when we start modeling the data by running regressions or creating classifiers, we don’t want to turn to a statistics textbook or Google to understand the procedures for these methods in detail. It is at this moment that a type system can be particularly helpful because it specifically encodes the different possible analyses and their required parameters.

## Installing

Oml is now available on opam so

## Capabilities

Here is a quick tour of some of Oml’s capabilities. These examples were performed in the utop REPL (and edited for readability), so to get started

I also added these commands to make the examples easier to see

First let’s demonstrate Oml’s `Sampling`

capabilities by generating some data from a fat-tailed distribution,
using a method that is easier to understand than resorting to a specific distribution;
mix two normal distributions
with different variances.
In the `data`

below, every `10`

th data point is from a normal distribution
with standard deviation of `10`

while all the rest have a standard deviation of `1`

.

Let’s investigate it

`unbiased_summary`

deals with sample size adjustments, such as dividing by n-1 for standard deviation.
The ``Fat`

should raise some concern for the keen data analyst, but it is by design.
The ``Negative`

skew might draw more questions, but repeating the analysis can shed light
on this “balancing” problem; consider

The function `data_skew`

uses the same technique to generate our data but takes as argument the frequency of when we sample from the
wider (`std=10`

) distribution.
Afterwards, `count_big_skew`

summarizes samples of data,
counting those where the skew classification
is ``Positive`

or ``Negative`

.
Finally, we compare two cases;
one where we generate samples of data from the wider distribution on every `10`

th draw,
and one where we alternate.
Two sample runs demonstrate that skewed samples are much more frequent
when drawing from the wider distribution is rarer.

Another way to think about this effect is that the distribution is so leptokurtic in relation to the other moments that it dominates.

Since we knew that the original data is supposed to have a mean of 2.0 we can ask for the probability of observing
this data via an `Inference`

test.

So we shouldn’t be surprised by our data. Lets demonstrate a simple regression.

What if we throw some collinearity in the mix:

As we can see, regular regression gives us back some pretty outrageous coefficients : `[|-80174210710646.3594; -80174210710645.5156; 80174210710652.8594|]`

.
But when we use ridge regression,
even a small `lambda`

of `0.1`

helps us recover more sensible coefficients `[|1.8613; 2.7559; 4.6172|]`

.

Finally, lets try a little classification. This example comes from Bayesian Reasoning and Machine Learning by David Barber.

## Thanks

Oml uses Lacaml (for SVD, PCA and multivariate regression) and L-BFGS (for a logistic regression classifier) so a big thank you to Markus Mottl and Christophe Troestler for their work–Oml would not be possible without them.

Furthermore, Carmelo Piccione provided valuable contributions and code reviews.

## Future Work

While we think of Oml as a work in progress, we want to release it to the general public to garner feedback. Writing a powerful and flexible library is going to take a community effort because developers have different interests and strengths. Furthermore, while one function name or signature may be obvious to a developer, it may be confusing to others, so we encourage feedback through the issue tracker. In the immediate future, we’d like to:

- Refactor the testing framework to have uniform float comparison and perhaps formalize the probabilistic properties.
- Refine some of the interfaces to use Functors.
- Develop a uniform representation of multidimensional data that can handle the awkward inputs to some of these algorithms gracefully.