Introducing Ketrew 0.0.0

We just released Ketrew 0.0.0 in opam’s main repository.

Ketrew means Keep Track of Experimental Workflows: it is a workflow engine based on an Embedded Domain Specific Language.

  • The EDSL is a simple OCaml library providing combinators to:
    • define complex workflows (interdependent steps/programs using a lot of data, with many parameter variations, running on different hosts with various schedulers)
    • submit them to the engine.
  • The engine orchestrates those workflows while keeping track of everything that succeeds, fails, or gets lost.

Ketrew can be run as a standalone application (i.e. run the engine from the command line), or using a client-server architecture (i.e. run a proper server, and connect through HTTPS).

What’s in 0.0.0?

  • An EDSL that can be used to build complex workflows: we wanted to keep the EDSL usable by OCaml beginners (read: it is not a bunch of GADTs carrying 1st class modules).
  • Client-server (HTTPS + JSON) and standalone modes.
  • An implementation of the engine which is naive/slow but testable & understandable.
  • A command-line client that can explore the state of workflows and pilot the engine interactively.
  • A persistence layer based on an “explorable” (but slow) git database.
  • Access through SSH to LSF and PBS-based computing clusters, and also to Unix machines that do not have a batch scheduler.
  • A plugin infrastructure to add backends (to be linked with the library or using Dynlink and/or Findlib (cf. documentation).

What We Are Doing With It

Ketrew has been used for very diverse things like running backups or building documentation websites (see smondet/build-docs-workflow). But we actually develop and use Ketrew to run bioinformatics pipelines on very large amounts of data:

  • Biomedical computational pipelines involve various long-running computational steps; for each of them, we want to run many parameter variations.
  • At the same time, bioinformatics tools are infamous for being quite poorly engineered; they tend to fail in mysterious ways, for seemingly valid inputs.
  • The computing infrastructure is also very diverse and gives little control to the end user.

Ketrew is designed with those adverse conditions in mind.

By the way, we just started open-sourcing our Ketrew pipelines for genomics experiments, see hammerlab/biokepi: it is a target library (i.e. a set of functions to create Ketrew.Target.t values which wrap the installation and running of bioinformatics tools), but it also contains a cool GADT-based pipeline description module; we’re on the road to concise and well-typed bioinformatics pipelines!

Limitations & Future Work

Performance

We are switching to faster database backends: see the library smondet/trakeva (see PR #98).

We had to slow down Ketrew quite artificially because submitting too many targets (≥ 100-ish), that all check-up on files/conditions on the same SSH host, would cause connection errors, we will work on less naive approaches for accessing remote hosts (see issue #69).

We are also working on the engine itself, see issue #73.

(G)UI

We routinely successfully run workflows that contain more than 1000 steps each, but we need better user-interfaces to go much further (both web- and command-line-based even emacs/vim friendly).

Deployment

Because of ocp-build, Ketrew 0.0.0 requires OCaml 4.01.0; we are switching to an ocamlbuild/oasis -based build while waiting for Assemblage to be ready.

Closing Remarks

We hope that you will find Ketrew useful for any of the various kinds of workflows you may want to track! Feel free to file issues or reach out at hammerlab/ketrew if you have any questions.

If you are in New York City by the end of January, we will be at the Compose conference to present Ketrew.