Testing Oml

Here at the Hammer Lab we have a well documented focus on testing as an important component of good software engineering. For our open source projects, we selfishly want to encourage test writing because it helps with adoption. Oml is no different but presented some novel challenges and experiences that we’d like to share.

Challenges

Writing a test suite for a general numerical library poses a set of unique questions.

How do you make your tests comprehensive?

One unit test matching a specific input to a specific output is fine, but writing a sufficient number of such cases will not scale. Furthermore, the actual input domain, which is usually the set of floats, is huge and contains such special cases as nan, infinity and neg_infinity.

A natural solution is to use property-based testing like QuickCheck. But choosing appropriate properties for algorithms that essentially encode non-linear functions is not trivial. Specifically, fixed points limit the range of values that you test. Furthermore, testing higher-order relationships between functions can require more complex algorithms than the functions to be tested, and hence introduce another source of error.

Within these tests one wants to avoid probabilistic failures due to floating point computations.

Example

To make the example concrete, consider testing a univariate linear regression algorithm. Regression takes as input two arrays of floats: the independent and dependent variables. We can write a specific unit test:

module U = Regression.Univarite;;
let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = [| 2.0; 4.0; 6.0 |] ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = 0. && U.beta r = 2.0) ;;

This is the solution that will not scale. We can think of specific independent/dependent array pairs where we know the generating linear model:

let p = 3.0 (* Some well behaved float that we know. *) ;;
let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = Array.map (( *. ) p) independent ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = 0. && U.beta r = p) ;;

But this solution isn’t much better because it doesn’t challenge the realm of possible use cases for these methods. We can try and test a “higher-order” relationship:

let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = [| 2.0; 4.0; 6.0 |] ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
let t = 2.0 (* Some well behaved float that we know. *) ;;
let independent2 = Array.map (( *. ) t)  [| 1.0; 2.0; 3.0 |] ;;
let dependent2 = Array.map (( *. ) t) [| 2.0; 4.0; 6.0 |] ;;
let r2 = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = U.alpha r2 && U.beta r = U.beta r2) ;;

But finding these relationships isn’t always obvious and isn’t necessarily better than trying to test the algorithm directly:

let independent = Array.init 100 (fun _ -> Random.float 1e3) ;;
let alpha = Random.float 1e3 ;;
let beta = Random.float 1e3 ;;
let dependent = Array.map (fun x -> beta *. x +. alpha) independent ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = alpha && U.beta r = beta) ;;

But this simple example will most probably not pass because there is no guarantee that our inputs are “well behaved” and there are no rounding errors. This leads to one of the main problems that our test suite will deal with: imprecise floating point calculations.

How do you express floating point dependencies between functions?

How do you order the testing and convey that if a function has certain properties, then functions that depend on it will have derived properties? For example, it is well known that Newton’s method has quadratic convergence; it doubles the number of computed significant digits at each iteration. If we were to use this solver to compute a special irrational constant (for example the Golden Ratio, a root of x^2 - x - 1), that accuracy of the constant would dependent quadratically on the number of iterations. This would change if we were to use a different method.

Furthermore, what if the point of the algorithm is to test specific, refined behavior such as faster convergence or more precision per operation? For example, the module that measures means and standard deviations has different performance characteristics than the online versions.

How do the tests fit into the codebase?

Although this question is common to all software projects, it posed interesting concerns. We want the tests to be distributed with the source code, but obviously not built with the library. At the same time we want flexibility. A test could target any part of the code, specifically, internal private functions that may not be exposed as a part of the public interface, but might be the crux of the algorithm that we want to test.

Choosing a framework

We surveyed the available options for testing in OCaml and assembled some brief notes:

  • OUnit: assertion-based testing along the lines of JUnit.
    • Pro: Lots of different assert logic. There are included frameworks for logging, configurations, runners, and setting up and tearing down.
    • Pro: API Documentation.
    • Pro: Mature product, on to the 2.0 release and still being updated.
    • Con: No driver utility; tests have to be compiled into separate executable.
    • Con: No property-based testing.
  • iTeML: formerly known as qtest, provides inline (inside comments) tests via a command line tool that extracts those tests.
    • Pro: Different pragmas to support different styles (assertion-based, property-based).
    • Pro: Used by the Batteries project.
    • Pro: Pretty good documentation.
    • Con: Code in comments. Our tests would be pretty long and this would impact readability.
    • Etc: The command line tool does some of the work of a driver, but one still has to compile the tests to a separate executable.
  • Kaputt: broad range of testing abilities.
    • Pro: Lightweight assert and property based testing.
    • Pro: Some driver capabilities via camlp4 to combine modules.
    • Pro: Best documentation.
    • Con: Camlp4 dependency.
    • Con: Hasn’t been touched in about 2-3 years.
  • Pa_test: inline testing macros.
    • Pro: Used in production.
    • Con: Setup? Is there a driver? No documentation.
  • Qcheck: property-based testing library for OCaml.
    • Pro: Active development.
    • Con: No assert.
    • Con: No driver.
    • Con: Limited to testing the exposed API; what if you want your test to access internal functions?
  • There are other pieces of software available but they either did not seem like full-fledged libraries that could satisfy our demands, or they were not available when we had to make a decision:

Property-based testing, as previously described, is a natural fit for many of the tests of the functionality of this library; this requirement eliminated Pa_test and OUnit. While powerful, coding with inline commented pragmas seemed an inflexible solution, so iTeml was eliminated. QCheck seemed like it would be simple to use but the lack of documentation and examples was concerning. The blending of both property-based checking with simple assert logic into one framework led us to choose Kaputt.

Setup

Setting up the testing proved to be interesting to say the least; our solution is a local maxima constrained by ocamlbuild idiosyncrasies and developer patience.

The Kaputt documentation recommends that a separate foo.mlt file is kept with test code for foo.ml. The test is afterwards concatenated to the original source code via a preprocessing step. This seems like a flexible solution to writing both original and testing code.

To control whether we’re building a test or the regular library we added a .test target to ocamlbuild via a plugin. The plugin makes sure that the .mlt test files are copied over to the ‘sanitized’ build directory of ocamlbuild.

We changed Kaputt’s default preprocessor to modify the original .ml file in place via joiner.ml so that our code coverage would pick up the right filename. The joiner also utilizes an OCaml directive to separate the original source code from the testing source code: starting a line with # 1 "new_name.ml" changes the lexing buffer to new_name.ml. This ensures that compilation errors are correctly traced back to the test .mlt file.

We also wrap the test with bisect ignore comments (*BISECT-IGNORE-BEGIN*) and (*BISECT-IGNORE-END*) to have accurate measures of code coverage.

Lessons learned

Effectively testing this library has been a learning experience.

  1. While most tests represent legitimate mathematical properties, the nature of floating point computation turns these deterministic properties into probabilistic ones; for example, when run 100 times, this test case occasionally (but routinely) fails because of floating point rounding. The nature of the problem is such that it is often not obvious which step of the test is causing the lack of precision. Using means or multiple comparisons might be less precise than the algorithm you’re testing.

  2. Controlling these probabilistic failures has turned into a balancing act. On the input side we can (and do) restrict the size of input values, while on the output side we can fiddle with how many significant digits we use to compare floats. We have tried to be principled about this approach by explicitly failing to generate float values without a developer-specified bound.

  3. These bounds imply an interesting characteristic of numerical algorithms; namely a volume for acceptable input/output pairs. Consider a small list of these parameters:

    • The upper bound on random float generators taken as input to the test, e.g, bfloat 1e10
    • The number of digits we can compare of the outputs, e.g. 3 in equal_floats ~d:1e-3
    • The size of input array, e.g. 100

    Treating these parameters as sides of a box: if we compare two algorithms that do the “same” thing, and we fix a pass rate (e.g. 99%) for these boxes/tests, we want the one with the bigger volumes. Numerically determining these values is an area of future work.

  4. Kaputt has an interesting Specification API; it allows a richer mapping between input to output. This is a powerful mechanism that we are just learning to use. Unfortunately, it is not clear how to specify a failure rate to address the first lesson or to specify a relationship between input sizes and output comparisons from the second lesson.

  5. You won’t see this in the code base or documentation, but property-based testing is highly effective at catching bugs in this domain.