Bioinformatics Workflows on a Hadoop Cluster11 Jun 2015
Quick experience report on our adventures running old-school batch-processing-style computational workflows on a YARN-managed Hadoop cluster.
Ketrew on YARN
We run a whole lot of bioinformatics software artifacts in the form of long
pipelines of finely tuned analysis steps.
is our repository of bioinformatics workflows; it
as a workflow engine and exposes a high-level Lego-style pipeline building API
Biokepi is quite agnostic of the computing environment; any Ketrew backend with
a shared filesystem can be used, by creating a
In house, we want to use our Hadoop cluster. Nowadays the Hadoop ecosystem is
scheduler. Ketrew gained support for YARN a
few months ago,
and this got into the
1.0.0 release that
recently hit the shelves.
Ketrew has to be able to run both Hadoop applications (with built-in knowledge
of YARN resources) and arbitrary programs (like classical batch-application
The implementation in
Running YARN-enabled Hadoop applications is just a matter of daemonizing their
“master process” on one of the cluster’s login-nodes. This process will then
request resources from YARN.
To run arbitrary programs we had to wrap them within the cutely named
it runs the “application master” that requests (part of) a cluster node
from YARN, to then launch the given shell script.
Once this was implemented, there were a few more obstacles yet to kick through.
Actually Making It Work
A Hadoop cluster is built around a non-POSIX distributed filesystem: HDFS. Most bioinformatics tools won’t know about HDFS, and will assume a good old Unix-based filesystem access. Luckily, we have a decent NFS mount configured.
The gods of distributed computing won’t let you use NFS without cursing you; this time it was the user IDs.
YARN’s default behavior is to run “containers” as the
yarn Unix-user (a
container is a unit of computational resources; like a core of a node in the
cluster), hence all the POSIX files created by bioinformatics tools would be
manipulated by the
The first hacky attempt was to clumsily
chmod 777 everything, with a
umask, and hope that the
yarn user will be able to read and
Of course, NFS is based on UIDs not usernames, the user
different UIDs on different nodes of the cluster (which can be seen as a
A given step of the pipeline will write files
as the user
42 and a latter one, on a different node, will try to write as
51 … even with a lot of black magic
chmod/setfacl incantations, we end up
with pipelines failing after a few hours and leaving terabytes of unusable files on
our then inconsistent NFS mount.
To save our poor souls, there just so happens to be something called “YARN Security.” The scheduler can be configured to run the containers as the Unix user that started the application, just like other schedulers have been doing for the past 30 years (cf. documentation).
After updating to the latest Cloudera distribution to enjoy a few bug fixes, it
We’ve been running hundreds of CPU-hours of bioinformatics pipelines,
generating terabytes of data,
and posting results to our instance of
Always outnumbered never outgunned, we’re now working on:
- Getting more run-time information from YARN (issue
hammerlab/ketrew#174): So far we get basic “application status information” from YARN as well as post mortem application logs. We want to show in Ketrew’s UI more real-time information, especially the logs of the cluster node running the YARN-container.
- Improving Biokepi to become a “production ready” pipeline framework. This
means adding more tools, improving configurability and scalability; cf.
As always, help is warmly welcome!