Introducing Cohorts

Cohorts is a Python library for managing, analyzing and plotting clinical and genetic data from patient cohorts. It was spun out of our work analyzing data from clinical trials of checkpoint blockade in melanoma, lung, and bladder cancer. The sponsors of these trials typically collect a wide range of data for each patient in order to discover and evaluate potential biomarkers for association with benefit from therapy. While some aspects of Cohorts are specific to that context (e.g. the integration with neoantigen prediction), other components are more generally usable.

A prototypical use case for Cohorts is a Mann-Whitney comparison and associated box plot of the number of missense SNV mutations found in patients who did benefit from therapy (where benefit is defined by the user, e.g. progression-free survival greater than 6 months) versus those who did not benefit.

In its simplest form, this looks like:



But where does this cohort come from, what is this missense_snv_count, and what else can we plot? Let’s explore how Cohorts works!

Note: all code in this post is available in a Jupyter notebook.

Creating a Cohort

A cohort must be populated before analysis and plotting can proceed. Within Cohorts, each Cohort consists of a set of Patients, each of whom may have tumor and normal Samples.

Samples hold paths to associated BAM files as well as Cufflinks or kallisto output. Patients contain key clinical variables (clinical benefit status, overall survival, progression-free survival, deceased status, progression status), tumor and normal Samples, paths to tumor-normal VCFs, HLA alleles, as well as any additional relevant data. Cohorts contain all Patients as well as a variety of cohort-level settings.

Patients do not need to contain Samples, and they won’t when used for purely clinical data analyses. If you have data that isn’t easily modeled as described, but would like to make use of Cohorts, please file an issue on GitHub!

Analyzing a Cohort

The analysis and plotting functions of Cohorts generally rely on Pandas DataFrames. The simplest way to begin analysis is via direct access to the DataFrames that Cohorts produces. For example:

df = cohort.as_dataframe()

To go further, Cohorts will enhance user-provided data by calling out to external libraries like Topiary, Varcode and Isovar. These results are then exposed as potential biomarker datapoints for analysis.

For example, the function missense_snv_count, which requires paths to SNV VCF(s) to be specified in the Patients, utilizes Varcode to filter those mutations to those that are missense. neoantigen_count uses Topiary to compute predicted neoantigens from those SNVs.

These functions can be plugged into Cohorts:

# Return a DataFrame with "neoantigen_count" as a column,
# in addition to user-specified patient data.
df = cohort.as_dataframe(on=neoantigen_count)

# Return a DataFrame with "Neoantigen Count" and
# "Missense SNV Count" as additional columns.
df = cohort.as_dataframe(on={"Neoantigen Count": neoantigen_count, "Missense SNV Count": missense_snv_count})

While a number of these data-enhancing functions are available out of the box, users can always write their own custom functions to support their analyses of interest.

Below is a simple custom function example:

def smoker_or_senior(row):
    return row["smoker"] == True or row["age"] >= 65

This can be used just like the functions above:

df = cohort.as_dataframe(on=smoker_or_senior)


In addition to plot_benefit, Cohorts comes with various analysis and plotting functions. To name a few: plot_survival, plot_correlation, coxph, and bootstrap_auc. plot_survival uses Lifelines to plot two survival curves split by a specified variable. The following code generates a survival curve for patients with more neoantigens than the median and a survival curve for patients with fewer, based on Topiary’s neoantigen prediction:



Analyses and plots based only on data supplied directly to the Patient, as opposed to computed features using data-enhancing functions, are also fair game:

# note: resultant image not shown here

For binary features, Fisher’s exact test is used along with an associated bar plot:



All calls to plot_benefit result in a p-value significance indicator at the top of the plot, thresholded by 0.05 by default (* for significant and ns for not significant).

Results from other data locations can be loaded into an analysis using join_with:

# note: resultant images not shown here
cohort.plot_benefit(on="PD-L1 Expression", join_with="pdl1")
cohort.plot_correlation(on={"Missense SNV Count": missense_snv_count,
                            "PD-L1 Expression": "PD-L1 Expression"},

The above uses a DataFrameLoader, which is defined when a Cohort is created. It associates "pdl1" with a given DataFrame and specifies how to join that DataFrame with the Cohort. We typically load results from IHC analysis, tumor clonality estimates, mutational signature deconvolution, and TCR-Seq in this way. While extra data sources can also be added to the Cohort data itself to avoid joining (see Patient.additional_data), keeping certain data sources separate is helpful when a data source has a one-to-many relationship with the Cohort or is slow to load.

Consistency and Reproducibility

Many of the features in Cohorts have a goal of promoting consistency and reproducibility across various analyses. Examples include:

Variant Filtering

A user can define custom variant filtering functions, and filter variants accordingly. For example:

# Filter variants to those with at least 3 reads in the tumor sample.
def tumor_read_filter(filterable_variant):
    somatic_stats = variant_stats_from_variant(
    return somatic_stats.tumor_stats.depth >= 3

# Use a default filter function for the Cohort.
# note: resultant images not shown here
cohort.filter_fn = tumor_read_filter

# ...or specify a filter function ad-hoc.

Data Caching and Provenance

Cohorts caches many results, such as variant effects and predicted neoantigens, to mitigate the computational overhead frequently involved. A file-based approach enables easy sharing and manipulation of cached objects.

Cohorts also keeps track of the context in which the cache was generated by tracking the versions of software used to generate the cache. This context, as well as a hash of the base (unenhanced) data, is available for printing and optional version checking.

Other Resources

Ready to take Cohorts for a spin? The following notebooks should help you get started:

We’d love to hear from you if you have any feedback, suggestions, or bug reports!

Graph visualizations of MHC alleles

The MHC gene family is well known for its polymorphism. The IPD-IMGT/HLA database provides the official reference sequences for the alleles found in this region. In addition, the Anthony Nolan HLA informatics group provides a reference alignment of these alleles.

Here is a pared clip (we insert ellipses to skip lines) of the output for the genetic DNA of HLA-A:

This format textually encodes differences between the reference sequence (the first one, A*01:01:01:01) and the other, alternate, allele sequences, along the file columns. Using special characters, the reader can tell if the alternate nucleotide is unknown (‘*’), the same (‘-‘), different (‘A’, ‘C’ …etc), or a gap (‘.’) to the known reference nucleotide in the same column.

While this format has its place, it isn’t useful for quickly visualizing the diversity of this region. We’ve created a small utility to marginally improve visualizing these alignments.


As a graph

Our first intuition for this task was to compress the redundant sequence information. Grouping shared sequence segments as nodes led to us thinking about the entire alignment as a graph, and from this starting premise some properties followed naturally:

  • Alleles are represented by edges, such that one can trace the sequence of the allele by following edges.
  • Alignment information is preserved by adding a numerical label of alignment position. Different nucleotides at the same position are represented by different nodes: “336A” vs “336G”.
  • Gaps are encoded by the edges that point to nodes with a position farther than the position at the “end” of a node (the start label plus the length of the node sequence). In the example above, the “337GATGGAGCCG” node ends its position at 347 since len(GATGGAGCCG) = 10. Therefore there is a 20 nucleotide gap, in all of the alleles except “A*68:18N”, as indicated by the edge to “367CGGGC”.

Underneath the hood, mhc2gpdf first parses the alignment file; then constructs the desired graph representation by creating the desired nodes and edges; finally, it writes a dot file and calls Graphviz to render the dot file to PDF.


With these pdfs we can quickly determine allelic differences:


Appreciate the dense polymorphism of these genes:


And see long gaps in the Class II DRB nucleic sequences:


We’ve used our tool on the latest IMGT release (2016-10-19) and provide the dot and rendered pdfs: happy browsing!

Reading the graph

We’ve made some stylistic choices to make the graph easier to understand:

  • We run-length encode the leaf nodes of the tree that represents the allele edges: instead of “A*01:01:02,A*01:01:04,A*01:01:05…A*01:01:37”, we’ll write “A*01:01:02,04-37”.
  • If an edge represents more than half of the allele set, we’ll describe its complement and prefix the string with a “C. of”, short for “Complement of”.
  • We add “Start” nodes, prefixed with an “S” to indicate where the alignment has sequence data for each allele. These are similarly encoded as the edges.
  • We add “End” nodes for the same reason. These just have the final position.
  • We add “Boundary” nodes, which are represented by “|” in the alignment file for exome boundaries.


We hope that the genetics community finds this tool useful and we look forward to your feedback on GitHub.

Extending Gene Set Enrichment Analysis with Cancer Immunology Collections

Gene Set Enrichment Analysis (GSEA) is a well-known and widely-used method in Computational Biology and Bioinformatics. GSEA uses a sorted list of genes (obtained by comparing gene expression levels between groups of patients) and a database of gene sets as input, and it checks whether members of a particular gene set have a non-random ordering, biasing them towards the top or the bottom of the list (i.e. enrichment). We leave the details of the method out of this blog post, but the original paper describing GSEA is a good starting point for newcomers.

Gene sets often represent biologically meaningful groupings: for example, all genes that are active during controlled cell death. It is common to see these gene sets being referred as pathways and it is for this reason that GSEA is also known as pathway analysis. This name, unfortunately, leads to the following misconceptions regarding the whole approach: 1) Pathways are well-defined gene sets and they do not change; 2) There is no room for customization of the database of gene sets used in the enrichment analysis.

First, there is no clear consensus over the contents of a pathway across many of the databases, let alone the researchers, and therefore our knowledge on the list of genes that are members of a particular pathway is constantly evolving. This means that our database of gene sets should evolve with our knowledge, but in reality, it is far behind our current knowledge due to the slow nature of the literature-based curation process.

Second, the enrichment analysis does not necessarily depend on default data sets and is flexible enough to work against custom gene set definitions when needed. As long as a researcher is confident that the gene sets used in the enrichment analysis reflect our current knowledge regarding biological processes and they are uniformly curated, there is no need to worry about using custom data sets as gene sets.

New Cancer Immunology Collections in GSEA

As part of one of our research projects focusing on immunotherapy in cancer, we needed to run GSEA on a gene expression data set to explore enrichments in patients with different outcomes (phenotypes). Our sample size was on the order of tens, which meant that we had low statistical power for the analysis. To account for this, we decided to stick to only one of the curated gene sets; but this collection was lacking gene sets that were highly interesting to us. Specifically we were interested in including tumor-associated antigens recognized by CD4+/8+ T-cells and gene signatures associated with immune cell infiltration to the tissue.

Our solution to this was to collect curated gene lists from relevant resources that are not part of the MSigDB, extract the official gene symbols and include them in our analysis as additional gene set collections. For the former, we used the up-to-date list of tumor-associated antigens from Cancer Immunity’s Peptide Database; and for the latter, we used gene sets inferred by Senbabaoglu et al that builds on the work of Bindea et al and infers gene expression signatures associated with immune cell filtration in kidney cancer:

Creating GSEA-compatible gene set collections

You will see from the examples above that the default Gene Matrix Transposed (GMT) file format features a separate gene set on each row, two metadata columns for each gene set followed by an arbitrary number of gene identifiers that belong to the corresponding gene set. The file format, therefore, is an extension of tab-separated values (TSV), meaning that almost any text editor or other third party software that can handle TSV format can be used to curate these files.

One way to tackle this problem is to generate these files programmatically and often researchers are interested in adding new gene sets to the analysis based on lists of genes published as supplemental material to papers. A practical solution that works nicely for us is to import such gene sets into Google Spreadsheets (for collaborative editing, taking care to ensure that gene names aren’t converted to dates) and then reduce them to GMT files to be used in the analysis. You can find a Python-based solution to this common task in this notebook as an example that converts gene lists extracted from the papers we mentioned above and turns them into individual GMT files.

Some words of caution about custom gene sets

Although adding one or more custom gene sets into the analysis is pretty straightforward, it still requires careful planning and some basic bioinformatics skills. It is, for example, really important to be consistent with the use of common identifiers across default and custom collections. Not all gene sets are curated with gene identifiers that are consistent across databases; therefore, gene identification mapping from one provider to another often becomes essential. Note that problems in such mappings can easily cause over- or under-representation of gene sets due to cross identification mismatches.

Furthermore, once these custom gene sets start accumulating, it becomes important to check the degree of overlap across multiple gene set collections to prevent redundant enrichment tests. Although this is an imperfect science, one way to check this would be to compare all pairs of gene sets across two GMT files and calculate an asymmetric score that represents to what extent the first collection is represented within the second collection.

Another thing to consider when using custom gene sets in the analysis is to run the analysis on the combined collection of gene sets (including both the default and the custom ones) and not individually to properly account for false discovery rate. With that, it often does not make sense to restrict the analysis to only the custom gene set, but it is good practice to combine it with another collection as a good background for the enrichment tests.

Final words

It is crucial for the community to share such custom gene sets and explain the rationale behind them for encouraging others to build on such efforts. We wanted to share our experience and solutions that worked for us so that not everybody has to reinvent the wheel.

Our Experiences with Flow

We’ve spent the past year building pileup.js, an interactive JavaScript genome browser. We’ve used Facebook’s Flow system throughout the development process to bring the benefits of static type analysis to our code. This post describes our experiences and provides some advice for getting the most benefit out of Flow.

What are the benefits of using Flow?

The most often-cited rationales for type systems are that they:

  1. Catch errors early
  2. Improve readability of code
  3. Facilitate tooling
  4. Improve runtime performance

In the case of Flow, runtime performance is not in play. The other factors are the main motivators.

For catching errors early, Flow winds up being a great way to avoid round-trips through the “save and reload in the browser” cycle. If you use Flow throughout your code, you’ll see far fewer Syntax Errors or undefined is not a function messages pop up in your JavaScript console. Similarly, it’s a huge help in refactoring. If you rename a module or function, Flow will immediately find all the references to the old name.

That being said, this aspect of Flow has proven to be mostly a time saver, rather than an effective way to find bugs. Most egregious bugs would have been caught by a unit test or manual testing before they were ever committed to source control. The type system just lets us find them more quickly.

The primary long-term reason to use Flow (or TypeScript) is for the improved readability of code. Far too often in JavaScript (or Python), you’ll run into a function like this:

function updateUI(data) {

It was surely obvious to whomever wrote the function (perhaps you, six months ago!) what data was. You’d sure like to update the UI. But what exactly do you pass to this function? For simple code, you may be able to infer the parameter type by inspection. But for complex functions which call out to other functions in other modules, this quickly becomes difficult. Type annotations and type inference can both immediately tell you what’s expected.

For large codebases, this winds up being a huge aid. Complex UIs often use large, nested objects to represent their state. This is very common with React Components, for example. With type annotations, it’s clear what type of object each component expects. And, if you’re still confused, Flow will let you know. What’s more, unlike types which are documented in comments, you can be confident that type annotations are up-to-date and correct. If they weren’t, Flow would have complained.

Why use Flow instead of TypeScript?

An aside: why did we use Flow instead of TypeScript? When we started work on pileup.js, TypeScript had poor support for both Node-style require statements and for React.js and its JSX syntax. These issues are both less relevant now: both Flow and TS support ES6 import statements, which is what all projects should be using going forward. TypeScript 1.6 added support for JSX, so that’s less relevant too.

Both Flow and TypeScript are good choices for a new project. Flow is a more sophisticated type system, with better support for patterns like type unions and nullable types. TypeScript is more established and has better support, e.g. more editor plugins and third-party type definitions.

Practical experiences / gotchas

Editor integration

While you can run flow check and flow status from the command line and as part of your continuous integration, you’ll be missing out on many of its benefits if this is all you do. Flow works best when it integrates with your editor. With the vim-flow plugin, for example, Flow will show any new errors inside of vim whenever you save a JS file. This is fast (or should be – see below!) and helps you catch errors without the context switch of going to your web browser.

Flow is a server

Flow is intended to be run as a background server. When you run flow status, it starts up the server, reads and analyzes all your code and then waits for files to change. When you run flow status again, you’re just asking the server for the latest errors. Incremental changes are fast, but the server startup is slow. flow check runs a full check and is always slow. So use flow status, except in contexts which are not latency-sensitive, e.g. your continuous integration build.

Sometimes you’ll make a change that forces the Flow server to restart. These include changes to your .flowconfig file or changes to a type declaration (lib) file. There’s no way around this. You’ll just have to wait 20-30 seconds to get type errors again.

Flow reads your entire node_modules directory

Incremental compiles should take under a second. You can check this time for your project by running:

echo '' >> some/js/file.js; time flow status

If this time gets too large (e.g. more than a second), it might be because Flow is wasting time reading everything under node_modules. You can check how many files Flow is tracking by running this sequence:

flow check --verbose 2> /tmp/flow-all.txt
grep -o '/node_modules.*' /tmp/flow-all.txt | perl -pe 's/:.*//; s/[\\]?".*//; s/ .*//' | sort | uniq | wc -l

If this is significantly larger than the number of JS files in your project, then Flow is wasting time. You should add as much of node_modules as you can to the [ignore] section of your .flowconfig. See this issue for details.

Creeping any types

One of the pitches for Flow is that it performs type inference on your JavaScript. You don’t even have to provide type annotations, Flow will figure out your code! For a few reasons, however, this does not work in practice. First, Flow requires that you explicitly annotate all functions and classes which are exported from modules (this is for performance reasons). Secondly, there are many ways that any types can creep into Flow code:

  • Third-party libraries If you import a third-party module (e.g. underscore), then the type of the entire module will be any. This means that any data that passes through functions in that module will come out with an any type. The solution to this is to use a type declaration file. While there is some movement towards providing a central registry of Flow type definitions, it’s nowhere near as mature as TypeScript’s DefinitelyTyped.
  • Props and State If you use React, you need to explicitly specify types for your props and state. Otherwise, they default to any. You can get slightly better behavior by using propTypes, which Flow understands. But the best way is to explicitly define types for props and state using an ES6 class:

    type Props = {
      name: string;  // note: not null!
      address: string;
    class AddressBookEntry extends
        React.Component<void /* DefaultProps */,
                        void /* state -- none in this component */> {

    This will check types for props, state and setState within your component. It will also check the props on AddressBookEntry when you create one using JSX syntax elsewhere in your code (i.e. <AddressBookEntry name="Dan" /> would be an error because address is missing). If you define props and state members, you will get the former but not the latter check.

  • Missing @flow annotations If you forget to put @flow at the top of your source file, Flow will ignore it by making its type any. We added a lint check to ensure that all our JS files had this annotation.

You can run flow coverage path/to/file.js to see the fraction of expressions in a file whose type is any. You should try to keep this as low as possible. The :FlowType vim command shows you what Flow thinks the type of the expression under your cursor is. This can also help to find gaps in what Flow will catch.

Surprising non-errors

JavaScript is a very permissive language. Flow models some elements of this permissiveness but disallows others. For example, '1' + 1 evaluates to '11'. Flow complains about this because it’s likely to be a mistake (and at the very least indicates unclear thinking).

JavaScript also allows you to pass any number of arguments to functions. For example, Math.pow(2, 4, 6, 8) == 16. Arguments after the second are simply ignored. This is also likely to be a mistake, but Flow is OK with passing too many arguments. Other examples include functions are objects.

We often find ourselves wondering “would Flow catch this?” It’s a good habit to introduce errors and typos to double-check that Flow is understanding your code.

Misattributed / cryptic errors

Many students have a sinking realization during their Compilers classes that, while parsing a correctly-formed program is fun and easy, 95% of the challenge in writing a good parser is the tedious process of catching errors, determining where they are, and recovering from them.

Something similar seems to be at play with Flow. If Flow reports that there are no errors, then (modulo any types and non-errors, see above) you can be confident that there are none. If it reports a problem, then there probably is one. But the error message may or may not point you in the right direction. There are many open issues around error messages.

For example, we’ve learned from experience that intersection types can be a source of cryptic errors. For example:

var el = document.createElement('canvas');
var ctx = el.getContext('2d');
ctx.fillRect(0, 0, 200, 100);

This produces an odd-looking error:

2:     var ctx = el.getContext('2d');
                 ^^^^^^^^^^^^^^^^^^^ call of method `getContext`. Function cannot be called on
2:     var ctx = el.getContext('2d');
                 ^^^^^^^^^^^^^^^^^^^ HTMLCanvasElement

getContext is most certainly a method on HTMLCanvasElement. The real issue is on the third line, not the second. getContext is allowed to return null, which does not have a fillRect method. This error bubbles up to the point where these two cases diverge, resulting in the cryptic message. The solution is to add an if (ctx) { ... } check around the drawing code.

This example illustrates both the strengths and weaknesses of Flow. It’s impressive that Flow can track the way that functions return different types based on parameter values, e.g. HTMLCanvasElement from document.createElement('canvas'). This eliminates a common source of typecasts in other type systems. It also illustrates a case where null-checking is not helpful. In practice, getContext('2d') will not return null.


Flow is a powerful tool for performing static type analysis on JavaScript codebases, bringing their maintainers many of the benefits that this yields. Flow is not without its quirks, though, some of which we’ve documented here. Understanding these will help you get the most benefit out of Flow.

Flow is developing rapidly, with dozens of commits being merged every week. This is one of the very best things about it. In the year that we’ve been using it, many new features have been added and many fundamental issues have been resolved. It’s likely (we hope!) that some of the gotchas in this post will no longer be gotchas by the time you read it.

Ketrew Industries Two Hundred

We went to Vancouver to present Ketrew and its younger sibling Biokepi at the OCaml 2015 workshop (cf. video and slides). As for many other projects, the talk deadline triggered a new release for Ketrew a few days earlier: 2.0.0.

The main novelty is a WebUI, so we can be all webly and start by pasting a huge animated GIF:

Ketrew's WebUI In Action

The new graphical interface fully runs in the browser while sharing the same protocol as the Text-based UI (cf. protocol.mli). It was written in about two months thanks to js_of_ocaml. We used Tyxml_js for typed HTML5 tree manipulation within OCaml, together with the react library (the original one, not Facebook’s name-stealing “React”).

Overall, going on a programming spree with Tyxml_js + React has been a really fun and pleasant experience. It’s difficult to believe that some people in the world build reactive UIs without a decent type system.

This release comes with a few other miscellaneous changes:

  • We made it easier to get started:
    • We made the command ketrew init much more clever: it can create much more complex ready-to-use configuration files.
    • We added the option to run the HTTP server without TLS.
    • We wrote a more accessible “Getting Started” documentation.
  • We improved the YARN backend (incl. talking a bit with its HTTP API).
  • We took the chance to do some backwards incompatible changes to the EDSL API since other changes were already triggering a major version change.

The next release will be coming soon, with many backwards compatible changes.

  • Ketrew, through Trakeva, will be able to use other database backends for the configuration file (currently PostgresQL is being stress-tested). One can also backup and synchronize between database backends.
  • Much better scalability of the engine and WebUI.
  • Parsable and searchable logs for admins.
  • More configurability for the WebUI.

We have been using Ketrew 2.0.0 with Biokepi for a while now, and would appreciate help breaking it!