# Extending Gene Set Enrichment Analysis with Cancer Immunology Collections

Gene Set Enrichment Analysis (GSEA) is a well-known and widely-used method in Computational Biology and Bioinformatics. GSEA uses a sorted list of genes (obtained by comparing gene expression levels between groups of patients) and a database of gene sets as input, and it checks whether members of a particular gene set have a non-random ordering, biasing them towards the top or the bottom of the list (i.e. enrichment). We leave the details of the method out of this blog post, but the original paper describing GSEA is a good starting point for newcomers.

Gene sets often represent biologically meaningful groupings: for example, all genes that are active during controlled cell death. It is common to see these gene sets being referred as pathways and it is for this reason that GSEA is also known as pathway analysis. This name, unfortunately, leads to the following misconceptions regarding the whole approach: 1) Pathways are well-defined gene sets and they do not change; 2) There is no room for customization of the database of gene sets used in the enrichment analysis.

First, there is no clear consensus over the contents of a pathway across many of the databases, let alone the researchers, and therefore our knowledge on the list of genes that are members of a particular pathway is constantly evolving. This means that our database of gene sets should evolve with our knowledge, but in reality, it is far behind our current knowledge due to the slow nature of the literature-based curation process.

Second, the enrichment analysis does not necessarily depend on default data sets and is flexible enough to work against custom gene set definitions when needed. As long as a researcher is confident that the gene sets used in the enrichment analysis reflect our current knowledge regarding biological processes and they are uniformly curated, there is no need to worry about using custom data sets as gene sets.

## New Cancer Immunology Collections in GSEA

As part of one of our research projects focusing on immunotherapy in cancer, we needed to run GSEA on a gene expression data set to explore enrichments in patients with different outcomes (phenotypes). Our sample size was on the order of tens, which meant that we had low statistical power for the analysis. To account for this, we decided to stick to only one of the curated gene sets; but this collection was lacking gene sets that were highly interesting to us. Specifically we were interested in including tumor-associated antigens recognized by CD4+/8+ T-cells and gene signatures associated with immune cell infiltration to the tissue.

Our solution to this was to collect curated gene lists from relevant resources that are not part of the MSigDB, extract the official gene symbols and include them in our analysis as additional gene set collections. For the former, we used the up-to-date list of tumor-associated antigens from Cancer Immunity’s Peptide Database; and for the latter, we used gene sets inferred by Senbabaoglu et al that builds on the work of Bindea et al and infers gene expression signatures associated with immune cell filtration in kidney cancer:

## Creating GSEA-compatible gene set collections

You will see from the examples above that the default Gene Matrix Transposed (GMT) file format features a separate gene set on each row, two metadata columns for each gene set followed by an arbitrary number of gene identifiers that belong to the corresponding gene set. The file format, therefore, is an extension of tab-separated values (TSV), meaning that almost any text editor or other third party software that can handle TSV format can be used to curate these files.

One way to tackle this problem is to generate these files programmatically and often researchers are interested in adding new gene sets to the analysis based on lists of genes published as supplemental material to papers. A practical solution that works nicely for us is to import such gene sets into Google Spreadsheets (for collaborative editing, taking care to ensure that gene names aren’t converted to dates) and then reduce them to GMT files to be used in the analysis. You can find a Python-based solution to this common task in this notebook as an example that converts gene lists extracted from the papers we mentioned above and turns them into individual GMT files.

## Some words of caution about custom gene sets

Although adding one or more custom gene sets into the analysis is pretty straightforward, it still requires careful planning and some basic bioinformatics skills. It is, for example, really important to be consistent with the use of common identifiers across default and custom collections. Not all gene sets are curated with gene identifiers that are consistent across databases; therefore, gene identification mapping from one provider to another often becomes essential. Note that problems in such mappings can easily cause over- or under-representation of gene sets due to cross identification mismatches.

Furthermore, once these custom gene sets start accumulating, it becomes important to check the degree of overlap across multiple gene set collections to prevent redundant enrichment tests. Although this is an imperfect science, one way to check this would be to compare all pairs of gene sets across two GMT files and calculate an asymmetric score that represents to what extent the first collection is represented within the second collection.

Another thing to consider when using custom gene sets in the analysis is to run the analysis on the combined collection of gene sets (including both the default and the custom ones) and not individually to properly account for false discovery rate. With that, it often does not make sense to restrict the analysis to only the custom gene set, but it is good practice to combine it with another collection as a good background for the enrichment tests.

## Final words

It is crucial for the community to share such custom gene sets and explain the rationale behind them for encouraging others to build on such efforts. We wanted to share our experience and solutions that worked for us so that not everybody has to reinvent the wheel.

# Our Experiences with Flow

We’ve spent the past year building pileup.js, an interactive JavaScript genome browser. We’ve used Facebook’s Flow system throughout the development process to bring the benefits of static type analysis to our code. This post describes our experiences and provides some advice for getting the most benefit out of Flow.

## What are the benefits of using Flow?

The most often-cited rationales for type systems are that they:

1. Catch errors early
3. Facilitate tooling
4. Improve runtime performance

In the case of Flow, runtime performance is not in play. The other factors are the main motivators.

For catching errors early, Flow winds up being a great way to avoid round-trips through the “save and reload in the browser” cycle. If you use Flow throughout your code, you’ll see far fewer Syntax Errors or undefined is not a function messages pop up in your JavaScript console. Similarly, it’s a huge help in refactoring. If you rename a module or function, Flow will immediately find all the references to the old name.

That being said, this aspect of Flow has proven to be mostly a time saver, rather than an effective way to find bugs. Most egregious bugs would have been caught by a unit test or manual testing before they were ever committed to source control. The type system just lets us find them more quickly.

The primary long-term reason to use Flow (or TypeScript) is for the improved readability of code. Far too often in JavaScript (or Python), you’ll run into a function like this:

function updateUI(data) {
...
}


It was surely obvious to whomever wrote the function (perhaps you, six months ago!) what data was. You’d sure like to update the UI. But what exactly do you pass to this function? For simple code, you may be able to infer the parameter type by inspection. But for complex functions which call out to other functions in other modules, this quickly becomes difficult. Type annotations and type inference can both immediately tell you what’s expected.

For large codebases, this winds up being a huge aid. Complex UIs often use large, nested objects to represent their state. This is very common with React Components, for example. With type annotations, it’s clear what type of object each component expects. And, if you’re still confused, Flow will let you know. What’s more, unlike types which are documented in comments, you can be confident that type annotations are up-to-date and correct. If they weren’t, Flow would have complained.

## Why use Flow instead of TypeScript?

An aside: why did we use Flow instead of TypeScript? When we started work on pileup.js, TypeScript had poor support for both Node-style require statements and for React.js and its JSX syntax. These issues are both less relevant now: both Flow and TS support ES6 import statements, which is what all projects should be using going forward. TypeScript 1.6 added support for JSX, so that’s less relevant too.

Both Flow and TypeScript are good choices for a new project. Flow is a more sophisticated type system, with better support for patterns like type unions and nullable types. TypeScript is more established and has better support, e.g. more editor plugins and third-party type definitions.

## Practical experiences / gotchas

### Editor integration

While you can run flow check and flow status from the command line and as part of your continuous integration, you’ll be missing out on many of its benefits if this is all you do. Flow works best when it integrates with your editor. With the vim-flow plugin, for example, Flow will show any new errors inside of vim whenever you save a JS file. This is fast (or should be – see below!) and helps you catch errors without the context switch of going to your web browser.

### Flow is a server

Flow is intended to be run as a background server. When you run flow status, it starts up the server, reads and analyzes all your code and then waits for files to change. When you run flow status again, you’re just asking the server for the latest errors. Incremental changes are fast, but the server startup is slow. flow check runs a full check and is always slow. So use flow status, except in contexts which are not latency-sensitive, e.g. your continuous integration build.

Sometimes you’ll make a change that forces the Flow server to restart. These include changes to your .flowconfig file or changes to a type declaration (lib) file. There’s no way around this. You’ll just have to wait 20-30 seconds to get type errors again.

Incremental compiles should take under a second. You can check this time for your project by running:

echo '' >> some/js/file.js; time flow status


If this time gets too large (e.g. more than a second), it might be because Flow is wasting time reading everything under node_modules. You can check how many files Flow is tracking by running this sequence:

flow check --verbose 2> /tmp/flow-all.txt
grep -o '/node_modules.*' /tmp/flow-all.txt | perl -pe 's/:.*//; s/[\\]?".*//; s/ .*//' | sort | uniq | wc -l


If this is significantly larger than the number of JS files in your project, then Flow is wasting time. You should add as much of node_modules as you can to the [ignore] section of your .flowconfig. See this issue for details.

### Creeping any types

One of the pitches for Flow is that it performs type inference on your JavaScript. You don’t even have to provide type annotations, Flow will figure out your code! For a few reasons, however, this does not work in practice. First, Flow requires that you explicitly annotate all functions and classes which are exported from modules (this is for performance reasons). Secondly, there are many ways that any types can creep into Flow code:

• Third-party libraries If you import a third-party module (e.g. underscore), then the type of the entire module will be any. This means that any data that passes through functions in that module will come out with an any type. The solution to this is to use a type declaration file. While there is some movement towards providing a central registry of Flow type definitions, it’s nowhere near as mature as TypeScript’s DefinitelyTyped.
• Props and State If you use React, you need to explicitly specify types for your props and state. Otherwise, they default to any. You can get slightly better behavior by using propTypes, which Flow understands. But the best way is to explicitly define types for props and state using an ES6 class:

type Props = {
name: string;  // note: not null!
};

React.Component<void /* DefaultProps */,
Props,
void /* state -- none in this component */> {
...
}


This will check types for props, state and setState within your component. It will also check the props on AddressBookEntry when you create one using JSX syntax elsewhere in your code (i.e. <AddressBookEntry name="Dan" /> would be an error because address is missing). If you define props and state members, you will get the former but not the latter check.

• Missing @flow annotations If you forget to put @flow at the top of your source file, Flow will ignore it by making its type any. We added a lint check to ensure that all our JS files had this annotation.

You can run flow coverage path/to/file.js to see the fraction of expressions in a file whose type is any. You should try to keep this as low as possible. The :FlowType vim command shows you what Flow thinks the type of the expression under your cursor is. This can also help to find gaps in what Flow will catch.

### Surprising non-errors

JavaScript is a very permissive language. Flow models some elements of this permissiveness but disallows others. For example, '1' + 1 evaluates to '11'. Flow complains about this because it’s likely to be a mistake (and at the very least indicates unclear thinking).

JavaScript also allows you to pass any number of arguments to functions. For example, Math.pow(2, 4, 6, 8) == 16. Arguments after the second are simply ignored. This is also likely to be a mistake, but Flow is OK with passing too many arguments. Other examples include functions are objects.

We often find ourselves wondering “would Flow catch this?” It’s a good habit to introduce errors and typos to double-check that Flow is understanding your code.

### Misattributed / cryptic errors

Many students have a sinking realization during their Compilers classes that, while parsing a correctly-formed program is fun and easy, 95% of the challenge in writing a good parser is the tedious process of catching errors, determining where they are, and recovering from them.

Something similar seems to be at play with Flow. If Flow reports that there are no errors, then (modulo any types and non-errors, see above) you can be confident that there are none. If it reports a problem, then there probably is one. But the error message may or may not point you in the right direction. There are many open issues around error messages.

For example, we’ve learned from experience that intersection types can be a source of cryptic errors. For example:

var el = document.createElement('canvas');
var ctx = el.getContext('2d');
ctx.fillRect(0, 0, 200, 100);


This produces an odd-looking error:

2:     var ctx = el.getContext('2d');
^^^^^^^^^^^^^^^^^^^ call of method getContext. Function cannot be called on
2:     var ctx = el.getContext('2d');
^^^^^^^^^^^^^^^^^^^ HTMLCanvasElement


getContext is most certainly a method on HTMLCanvasElement. The real issue is on the third line, not the second. getContext is allowed to return null, which does not have a fillRect method. This error bubbles up to the point where these two cases diverge, resulting in the cryptic message. The solution is to add an if (ctx) { ... } check around the drawing code.

This example illustrates both the strengths and weaknesses of Flow. It’s impressive that Flow can track the way that functions return different types based on parameter values, e.g. HTMLCanvasElement from document.createElement('canvas'). This eliminates a common source of typecasts in other type systems. It also illustrates a case where null-checking is not helpful. In practice, getContext('2d') will not return null.

## Conclusions

Flow is a powerful tool for performing static type analysis on JavaScript codebases, bringing their maintainers many of the benefits that this yields. Flow is not without its quirks, though, some of which we’ve documented here. Understanding these will help you get the most benefit out of Flow.

Flow is developing rapidly, with dozens of commits being merged every week. This is one of the very best things about it. In the year that we’ve been using it, many new features have been added and many fundamental issues have been resolved. It’s likely (we hope!) that some of the gotchas in this post will no longer be gotchas by the time you read it.

# Ketrew Industries Two Hundred

We went to Vancouver to present Ketrew and its younger sibling Biokepi at the OCaml 2015 workshop (cf. video and slides). As for many other projects, the talk deadline triggered a new release for Ketrew a few days earlier: 2.0.0.

The main novelty is a WebUI, so we can be all webly and start by pasting a huge animated GIF:

The new graphical interface fully runs in the browser while sharing the same protocol as the Text-based UI (cf. protocol.mli). It was written in about two months thanks to js_of_ocaml. We used Tyxml_js for typed HTML5 tree manipulation within OCaml, together with the react library (the original one, not Facebook’s name-stealing “React”).

Overall, going on a programming spree with Tyxml_js + React has been a really fun and pleasant experience. It’s difficult to believe that some people in the world build reactive UIs without a decent type system.

This release comes with a few other miscellaneous changes:

• We made it easier to get started:
• We made the command ketrew init much more clever: it can create much more complex ready-to-use configuration files.
• We added the option to run the HTTP server without TLS.
• We wrote a more accessible “Getting Started” documentation.
• We improved the YARN backend (incl. talking a bit with its HTTP API).
• We took the chance to do some backwards incompatible changes to the EDSL API since other changes were already triggering a major version change.

The next release will be coming soon, with many backwards compatible changes.

• Ketrew, through Trakeva, will be able to use other database backends for the configuration file (currently PostgresQL is being stress-tested). One can also backup and synchronize between database backends.
• Much better scalability of the engine and WebUI.
• Parsable and searchable logs for admins.
• More configurability for the WebUI.

We have been using Ketrew 2.0.0 with Biokepi for a while now, and would appreciate help breaking it!

# SVG→Canvas, the pileup.js Journey

Modern web browsers provide two main APIs for graphics: SVG and canvas. The relative merits of these two technologies have been discussed at length. We’d like to add our experiences with pileup.js to the conversation. In this post, we’ll explain why we chose to migrate from SVG/D3 to canvas and look at the data-canvas library we built to ease the transition.

pileup.js is a library for visualizing and exploring genomes in the browser:

When we introduced it three months ago, we touted the merits of SVG:

By using SVG, we allow the tracks to be styled and positioned using CSS. This facilitates complex layouts and is more approachable for web developers than custom styling systems.

Styling with CSS is one advantage of SVG but there are others:

• It’s easy to test: you can use querySelector to find SVG elements and assert that they have certain properties.
• It works well with the browser’s event system, especially when used with D3’s event handlers.
• It’s inherently resolution independent, a big plus in the age of retina displays.

While we were generally happy using D3 and SVG, we weren’t altogether happy with the performance of pileup.js. Its frame rate while panning was noticeably worse than other genome browsers like Dalliance and IGV.

As an experiment, we rewrote a subset of the pileup track using canvas instead of D3/SVG. We didn’t try to do anything clever: on every frame we cleared the canvas and redrew the entire scene from scratch.

1. The canvas code was considerably simpler than the D3 equivalent. Whereas D3 required elaborate keying and joining to keep our data and visualization in sync, the canvas code was stateless. It simply took data and drew a scene.
2. Because the Flow type checker has canvas declarations, it was able to provide significantly better static analysis than it did for the D3 code, for which declarations do not exist.

The CPU profile showed a dramatic performance win. The D3/SVG pileup track had been the biggest CPU hog, using about 30% of cycles while panning. After migrating to canvas, it only used 2% of cycles. Migrating to canvas was a 15x speedup! This matches the experiences of other developers, who have found canvas to be between 10x and 200x faster than SVG.

In the face of a 15x performance win, our arguments for SVG no longer held much water: we needed to switch to canvas.

## Switching to Canvas

To ease this transition, we set out to find libraries that made up for canvas’s drawbacks, e.g. difficulty in styling, testing and click-tracking.

There are many libraries which wrap the canvas tag in something a bit more stateful. For example, fabric exposes Rect and Circle classes which get rendered onto the canvas.

Mapping our data objects into a permanent hierarchy of visual objects would require a different coding style, however. This is essentially what D3 does with its joins and selections. We wanted to preserve the simple character of the native canvas code.

## Data Canvas

“All problems in computer science can be solved by another level of indirection.”

We came up with a simple solution which we’re calling the Data Canvas. We augmented the browser’s built-in canvas rendering context with two methods:

DataCanvasRenderingContext2D extends CanvasRenderingContext2D {
pushObject(o: any): void;
popObject(): void;
}


These maintain a stack of objects which are currently being rendered. We call this the “data stack”. Here’s what the stack looks like for a simple drawing:

While each object is being rendered, it sits on the data stack. Since re-rendering whole scenes is cheap on a canvas, we can easily perform other useful functions in the data stack while it renders objects by swapping in custom “contexts”:

• A context which records all calls allows us to write tests which assert that particular objects or shapes were drawn to the canvas (rather than checking the color of individual pixels).
• A hit-testing context allows us to determine which objects were on the stack when a particular point was filled or stroked. This makes it possible to determine what was clicked.

Check out the data-canvas README for a tutorial and a more in-depth treatment of this approach.

## Results

After migrating all of our visualizations from D3/SVG to canvas, we were able to achieve a roughly a 5x improvement in frame rate while panning. We found the resulting code to be simpler than our D3 code, with less need to build temporary data structures and keys for use as D3 data.

While we’re happy with the improved speed of our visualization and clarity of our code, there are still some shortcomings of this approach:

As we continue developing data-canvas, we hope to find solutions to all of these.

If you’re using canvas in your web application, please give data-canvas a try. Please visit the github repo, file issues in its issue tracker and ask us questions on its gitter channel.

# Testing Oml

Here at the Hammer Lab we have a well documented focus on testing as an important component of good software engineering. For our open source projects, we selfishly want to encourage test writing because it helps with adoption. Oml is no different but presented some novel challenges and experiences that we’d like to share.

## Challenges

Writing a test suite for a general numerical library poses a set of unique questions.

### How do you make your tests comprehensive?

One unit test matching a specific input to a specific output is fine, but writing a sufficient number of such cases will not scale. Furthermore, the actual input domain, which is usually the set of floats, is huge and contains such special cases as nan, infinity and neg_infinity.

A natural solution is to use property-based testing like QuickCheck. But choosing appropriate properties for algorithms that essentially encode non-linear functions is not trivial. Specifically, fixed points limit the range of values that you test. Furthermore, testing higher-order relationships between functions can require more complex algorithms than the functions to be tested, and hence introduce another source of error.

Within these tests one wants to avoid probabilistic failures due to floating point computations.

#### Example

To make the example concrete, consider testing a univariate linear regression algorithm. Regression takes as input two arrays of floats: the independent and dependent variables. We can write a specific unit test:

module U = Regression.Univarite;;
let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = [| 2.0; 4.0; 6.0 |] ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = 0. && U.beta r = 2.0) ;;


This is the solution that will not scale. We can think of specific independent/dependent array pairs where we know the generating linear model:

let p = 3.0 (* Some well behaved float that we know. *) ;;
let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = Array.map (( *. ) p) independent ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = 0. && U.beta r = p) ;;


But this solution isn’t much better because it doesn’t challenge the realm of possible use cases for these methods. We can try and test a “higher-order” relationship:

let independent = [| 1.0; 2.0; 3.0 |] ;;
let dependent = [| 2.0; 4.0; 6.0 |] ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
let t = 2.0 (* Some well behaved float that we know. *) ;;
let independent2 = Array.map (( *. ) t)  [| 1.0; 2.0; 3.0 |] ;;
let dependent2 = Array.map (( *. ) t) [| 2.0; 4.0; 6.0 |] ;;
let r2 = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = U.alpha r2 && U.beta r = U.beta r2) ;;


But finding these relationships isn’t always obvious and isn’t necessarily better than trying to test the algorithm directly:

let independent = Array.init 100 (fun _ -> Random.float 1e3) ;;
let alpha = Random.float 1e3 ;;
let beta = Random.float 1e3 ;;
let dependent = Array.map (fun x -> beta *. x +. alpha) independent ;;
let r = U.regress None ~pred:independent ~resp:dependent ;;
assert (U.alpha r = alpha && U.beta r = beta) ;;


But this simple example will most probably not pass because there is no guarantee that our inputs are “well behaved” and there are no rounding errors. This leads to one of the main problems that our test suite will deal with: imprecise floating point calculations.

### How do you express floating point dependencies between functions?

How do you order the testing and convey that if a function has certain properties, then functions that depend on it will have derived properties? For example, it is well known that Newton’s method has quadratic convergence; it doubles the number of computed significant digits at each iteration. If we were to use this solver to compute a special irrational constant (for example the Golden Ratio, a root of x^2 - x - 1), that accuracy of the constant would dependent quadratically on the number of iterations. This would change if we were to use a different method.

Furthermore, what if the point of the algorithm is to test specific, refined behavior such as faster convergence or more precision per operation? For example, the module that measures means and standard deviations has different performance characteristics than the online versions.

### How do the tests fit into the codebase?

Although this question is common to all software projects, it posed interesting concerns. We want the tests to be distributed with the source code, but obviously not built with the library. At the same time we want flexibility. A test could target any part of the code, specifically, internal private functions that may not be exposed as a part of the public interface, but might be the crux of the algorithm that we want to test.

## Choosing a framework

We surveyed the available options for testing in OCaml and assembled some brief notes:

• OUnit: assertion-based testing along the lines of JUnit.
• Pro: Lots of different assert logic. There are included frameworks for logging, configurations, runners, and setting up and tearing down.
• Pro: API Documentation.
• Pro: Mature product, on to the 2.0 release and still being updated.
• Con: No driver utility; tests have to be compiled into separate executable.
• Con: No property-based testing.
• iTeML: formerly known as qtest, provides inline (inside comments) tests via a command line tool that extracts those tests.
• Pro: Different pragmas to support different styles (assertion-based, property-based).
• Pro: Used by the Batteries project.
• Pro: Pretty good documentation.
• Con: Code in comments. Our tests would be pretty long and this would impact readability.
• Etc: The command line tool does some of the work of a driver, but one still has to compile the tests to a separate executable.
• Kaputt: broad range of testing abilities.
• Pro: Lightweight assert and property based testing.
• Pro: Some driver capabilities via camlp4 to combine modules.
• Pro: Best documentation.
• Con: Camlp4 dependency.
• Con: Hasn’t been touched in about 2-3 years.
• Pa_test: inline testing macros.
• Pro: Used in production.
• Con: Setup? Is there a driver? No documentation.
• Qcheck: property-based testing library for OCaml.
• Pro: Active development.
• Con: No assert.
• Con: No driver.
• Con: Limited to testing the exposed API; what if you want your test to access internal functions?
• There are other pieces of software available but they either did not seem like full-fledged libraries that could satisfy our demands, or they were not available when we had to make a decision:

Property-based testing, as previously described, is a natural fit for many of the tests of the functionality of this library; this requirement eliminated Pa_test and OUnit. While powerful, coding with inline commented pragmas seemed an inflexible solution, so iTeml was eliminated. QCheck seemed like it would be simple to use but the lack of documentation and examples was concerning. The blending of both property-based checking with simple assert logic into one framework led us to choose Kaputt.

## Setup

Setting up the testing proved to be interesting to say the least; our solution is a local maxima constrained by ocamlbuild idiosyncrasies and developer patience.

The Kaputt documentation recommends that a separate foo.mlt file is kept with test code for foo.ml. The test is afterwards concatenated to the original source code via a preprocessing step. This seems like a flexible solution to writing both original and testing code.

To control whether we’re building a test or the regular library we added a .test target to ocamlbuild via a plugin. The plugin makes sure that the .mlt test files are copied over to the ‘sanitized’ build directory of ocamlbuild.

We changed Kaputt’s default preprocessor to modify the original .ml file in place via joiner.ml so that our code coverage would pick up the right filename. The joiner also utilizes an OCaml directive to separate the original source code from the testing source code: starting a line with # 1 "new_name.ml" changes the lexing buffer to new_name.ml. This ensures that compilation errors are correctly traced back to the test .mlt file.

We also wrap the test with bisect ignore comments (*BISECT-IGNORE-BEGIN*) and (*BISECT-IGNORE-END*) to have accurate measures of code coverage.

## Lessons learned

Effectively testing this library has been a learning experience.

1. While most tests represent legitimate mathematical properties, the nature of floating point computation turns these deterministic properties into probabilistic ones; for example, when run 100 times, this test case occasionally (but routinely) fails because of floating point rounding. The nature of the problem is such that it is often not obvious which step of the test is causing the lack of precision. Using means or multiple comparisons might be less precise than the algorithm you’re testing.

2. Controlling these probabilistic failures has turned into a balancing act. On the input side we can (and do) restrict the size of input values, while on the output side we can fiddle with how many significant digits we use to compare floats. We have tried to be principled about this approach by explicitly failing to generate float values without a developer-specified bound.

3. These bounds imply an interesting characteristic of numerical algorithms; namely a volume for acceptable input/output pairs. Consider a small list of these parameters:

• The upper bound on random float generators taken as input to the test, e.g, bfloat 1e10
• The number of digits we can compare of the outputs, e.g. 3 in equal_floats ~d:1e-3
• The size of input array, e.g. 100

Treating these parameters as sides of a box: if we compare two algorithms that do the “same” thing, and we fix a pass rate (e.g. 99%) for these boxes/tests, we want the one with the bigger volumes. Numerically determining these values is an area of future work.

4. Kaputt has an interesting Specification API; it allows a richer mapping between input to output. This is a powerful mechanism that we are just learning to use. Unfortunately, it is not clear how to specify a failure rate to address the first lesson or to specify a relationship between input sizes and output comparisons from the second lesson.

5. You won’t see this in the code base or documentation, but property-based testing is highly effective at catching bugs in this domain.