Streaming from HDFS with igv-httpfs

05 Dec 2014

Genomic data sets can be quite large, with BAM (binary alignment) files easily weighing in at hundreds of gigabytes or terabytes for a whole genome at high read depth. Distributed file systems like HDFS are a natural fit for storing these beasts.

This presents interactive genomic exploration tools like IGV with a challenge: how can you interact with a terabyte data file that may be on a remote server? Downloading the entire file could take days, so this isn’t an option.

IGV and other tools create interactive experiences using a few tricks:

They require a BAI (BAM Index) file, which lets them quickly determine the locations in the BAM file which contain reads in a particular range of interest.
They set the rarely-used Range: bytes HTTP header to request small portions of the BAM file from the remote web server.

In practice, this means that IGV can display reads from any location in a 100 GB BAM file while only transferring ~100KB of data over the network.

Popular web servers like Apache and nginx support the Range: bytes header, but WebHDFS, the standard HTTP server for content on HDFS, does not. It does support requesting ranges of bytes in a file, however, but it does so using its own URL parameters. IGV does not know about these parameters, so it can’t speak directly to WebHDFS.

We can solve this problem by building an adapter. And that’s exactly what igv-httpfs is.

igv-httpfs exports a WSGI server which understands Range: bytes HTTP headers and converts them to appropriate HTTP requests to WebHDFS. This allows IGV to load files from HDFS.

igv-httpfs also sets appropriate CORS headers so that web-based genome visualizations such as BioDalliance which make cross-origin XHRs can enjoy the same benefits.

As with any WSGI app, we recommend running igv-httpfs behind a WSGI HTTP server (we use gunicorn) and an HTTP server (we use nginx). This improves performance and lets you enable gzip compression, an essential optimization for serving genomic data over the network.

Update (2015-12-14): Another approach to serving HDFS content over HTTP is to create an NFS mount for the HDFS file system, say at /hdfs. With this in place, you can serve HDFS files using a standard HTTP server like nginx or Apache. Some care is required to correctly support CORS, caching and bioinformatics MIME types, but support for range requests comes built-in. You can use our nginx configuration as a template.