Bioinformatics Workflows on a Hadoop Cluster

Quick experience report on our adventures running old-school batch-processing-style computational workflows on a YARN-managed Hadoop cluster.

Ketrew on YARN

We run a whole lot of bioinformatics software artifacts in the form of long pipelines of finely tuned analysis steps. The project hammerlab/biokepi is our repository of bioinformatics workflows; it uses Ketrew as a workflow engine and exposes a high-level Lego-style pipeline building API in OCaml.

Biokepi is quite agnostic of the computing environment; any Ketrew backend with a shared filesystem can be used, by creating a Run_environment.Machine.t. In house, we want to use our Hadoop cluster. Nowadays the Hadoop ecosystem is using the YARN scheduler. Ketrew gained support for YARN a few months ago, and this got into the 1.0.0 release that recently hit the shelves.

Ketrew has to be able to run both Hadoop applications (with built-in knowledge of YARN resources) and arbitrary programs (like classical batch-application clusters). The implementation in provides both. Running YARN-enabled Hadoop applications is just a matter of daemonizing their “master process” on one of the cluster’s login-nodes. This process will then request resources from YARN.

To run arbitrary programs we had to wrap them within the cutely named org.apache.hadoop.yarn.applications.distributedshell.Client class; it runs the “application master” that requests (part of) a cluster node from YARN, to then launch the given shell script.

Once this was implemented, there were a few more obstacles yet to kick through.

Actually Making It Work

A Hadoop cluster is built around a non-POSIX distributed filesystem: HDFS. Most bioinformatics tools won’t know about HDFS, and will assume a good old Unix-based filesystem access. Luckily, we have a decent NFS mount configured.

The gods of distributed computing won’t let you use NFS without cursing you; this time it was the user IDs.

YARN’s default behavior is to run “containers” as the yarn Unix-user (a container is a unit of computational resources; like a core of a node in the cluster), hence all the POSIX files created by bioinformatics tools would be manipulated by the yarn user.

The first hacky attempt was to clumsily chmod 777 everything, with a permissive umask, and hope that the yarn user will be able to read and write.


Of course, NFS is based on UIDs not usernames, the user yarn has different UIDs on different nodes of the cluster (which can be seen as a misconfiguration bug). A given step of the pipeline will write files as the user 42 and a latter one, on a different node, will try to write as 51 … even with a lot of black magic chmod/setfacl incantations, we end up with pipelines failing after a few hours and leaving terabytes of unusable files on our then inconsistent NFS mount.

To save our poor souls, there just so happens to be something called “YARN Security.” The scheduler can be configured to run the containers as the Unix user that started the application, just like other schedulers have been doing for the past 30 years (cf. documentation).

After updating to the latest Cloudera distribution to enjoy a few bug fixes, it finally works! We’ve been running hundreds of CPU-hours of bioinformatics pipelines, generating terabytes of data, and posting results to our instance of hammerlab/cycledash.

Next Steps

Always outnumbered never outgunned, we’re now working on:

  • Getting more run-time information from YARN (issue hammerlab/ketrew#174): So far we get basic “application status information” from YARN as well as post mortem application logs. We want to show in Ketrew’s UI more real-time information, especially the logs of the cluster node running the YARN-container.
  • Improving Biokepi to become a “production ready” pipeline framework. This means adding more tools, improving configurability and scalability; cf. hammerlab/biokepi/issues.

As always, help is warmly welcome!