API

With spark-bam on the classpath, a SparkContext can be “enriched” with relevant methods for loading BAM files by importing:

import spark_bam._

loadReads

The primary method exposed is loadReads, which will load an RDD of HTSJDK SAMRecords from a .sam, .bam, or .cram file:

sc.loadReads(path)
// RDD[SAMRecord]

Arguments:

path (required)
- an hammerlab.paths.Path
- can be constructed from a URI, String (representing a URI), or java.nio.file.Path:
```
  import hammerlab.path._
  val path = Path("test_bams/src/main/resources/2.bam")
```
bgzfBlocksToCheck: optional (default: 5)
readsToCheck:
- optional (default: 10)
- number of consecutive reads to verify when determining a record/split boundary
maxReadSize:
- optional ()default: 10000000)
- throw an exception if a record boundary is not found in this many (uncompressed) positions
splitSize:
- optional (default: taken from underlying Hadoop filesystem APIs)
- shorthands accepted, e.g. 16m, 32MB

loadBamIntervals

When the path is known to be an indexed .bam file, reads can be loaded that from only specified genomic-loci regions:

import org.hammerlab.genomics.loci.parsing.ParsedLoci
import org.hammerlab.genomics.loci.set.LociSet
import org.hammerlab.bam.header.ContigLengths
import org.hammerlab.hadoop.Configuration

implicit val conf: Configuration = sc.hadoopConfiguration

val parsedLoci = ParsedLoci("1:11000-12000,1:60000-")
val contigLengths = ContigLengths(path)

// "Join" `parsedLoci` with `contigLengths to e.g. resolve open-ended intervals
val loci = LociSet(parsedLoci, contigLengths)

sc.loadBamIntervals(
	path, 
	loci
)
// RDD[SAMRecord] with only reads overlapping [11000-12000) and [60000,∞) on chromosome 1

Arguments:

path (required)
loci (required): LociSet indicating genomic intervals to load
splitSize: optional (default: taken from underlying Hadoop filesystem APIs)
estimatedCompressionRatio
- optional (default: 3.0)
- minor parameter used for approximately balancing Spark partitions; shouldn’t be necessary to change

loadReadsAndPositions

Implementation of loadReads: takes the same arguments, but returns SAMRecords keyed by BGZF position (Pos).

Primarly useful for analyzing split-computations, e.g. in the compute-splits command.

loadSplitsAndReads

Similar to loadReads, but also returns computed Splits alongside the RDD[SAMRecord].

Primarly useful for analyzing split-computations, e.g. in the compute-splits command.

spark-bam

Load genomic BAM files using Apache Spark

Pages

API

loadReads

loadBamIntervals

loadReadsAndPositions

loadSplitsAndReads