Pages

Motivation

spark-bam improves on hadoop-bam in 3 ways:

Parallelization

hadoop-bam computes splits sequentially on one node. Depending on the storage backend, this can take many minutes for modest-sized (10-100GB) BAMs, leaving a large cluster idling while the driver bottlenecks on an emminently-parallelizable task.

For example, on Google Cloud Storage (GCS), two factors causing high split-computation latency include:

  • high GCS round-trip latency
  • file-seek/-access patterns that nullify buffering in GCS NIO/HDFS adapters

spark-bam identifies record-boundaries in each underlying file-split in the same Spark job that streams through the records, eliminating the driver-only bottleneck and maximally parallelizing split-computation.

Correctness

An important impetus for the creation of spark-bam was the discovery of two TCGA lung-cancer BAMs for which hadoop-bam produces invalid splits:

HTSJDK threw an error when trying to parse reads from essentially random data fed to it by hadoop-bam:

MRNM should not be set for unpaired read

These BAMs were rendered unusable, and questions remain around whether such invalid splits could silently corrupt analyses.

Improved record-boundary-detection robustness

spark-bam fixes these record-boundary-detection “false-positives” by adding additional checks:

Validation check spark-bam hadoop-bam
Negative reference-contig idx
Reference-contig idx too large
Negative locus
Locus too large 🚫
Read-name ends with \0
Read-name non-empty 🚫
Invalid read-name chars 🚫
Record length consistent w/ #{bases, cigar ops}
Cigar ops valid 🌓*
Subsequent reads valid
Non-empty cigar/seq in mapped reads 🚫
Cigar consistent w/ seq len 🚫 🚫

* Cigar-op validity is not verified for the “record” that anchors a record-boundary candidate BAM position, but it is verified for the subsequent records that hadoop-bam checks

Checking correctness

spark-bam detects BAM-record boundaries using the pluggable Checker interface.

Four implementations are provided:

eager

Default/Production-worthy record-boundary-detection algorithm:

full

Debugging-oriented Checker:

seqdoop

Checker that mimicks hadoop-bam’s BAMSplitGuesser as closely as possible.

indexed

This Checker simply reads from a .records file (as output by index-records) and reflects the read-positions listed there.

It can serve as a “ground truth” against which to check either the eager or seqdoop checkers (using the -s or -u flags to check-bam, resp.).

Future-proofing

hadoop-bam is poorly suited to handling increasingly-long reads from e.g. PacBio and Oxford Nanopore sequencers.

For example, a 100kbp-long read is likely to span multiple BGZF blocks, causing hadoop-bam to reject it as invalid.

spark-bam is robust to such situations, related to its agnosticity about buffer-sizes / reads’ relative positions with respect to BGZF-block boundaries.

Algorithm/API clarity

Analyzing hadoop-bam’s correctness (as discussed above) proved quite difficult due to subtleties in its implementation.

Its record-boundary-detection is sensitive, in terms of both output and runtime, to:

  • position within a BGZF block
  • arbitrary (256KB) buffer size
  • JVM heap size (!!! 😱)

spark-bam’s accuracy is dramatically easier to reason about:

  • buffer sizes are irrelevant
  • OOMs are neither expected nor depended on for correctness
  • file-positions are evaluated hermetically

This allows for greater confidence in the correctness of computed splits and downstream analyses.

Case study: counting on OOMs

While evaluating hadoop-bam’s correctness, BAM positions were discovered that BAMSplitGuesser would correctly deem as invalid iff the JVM heap size was below a certain threshold; larger heaps would avoid an OOM and mark an invalid position as valid.

An overview of this failure mode:

This resulted in positions that hadoop-bam correctly ruled out in sufficiently-memory-constrained test-contexts, but false-positived on in more-generously-provisioned settings, which is obviously an undesirable relationship to correctness.