Pages

API

With spark-bam on the classpath, a SparkContext can be “enriched” with relevant methods for loading BAM files by importing:

import spark_bam._

loadReads

The primary method exposed is loadReads, which will load an RDD of HTSJDK SAMRecords from a .sam, .bam, or .cram file:

sc.loadReads(path)
// RDD[SAMRecord]

Arguments:

  • path (required)
  • bgzfBlocksToCheck: optional (default: 5)
  • readsToCheck:
    • optional (default: 10)
    • number of consecutive reads to verify when determining a record/split boundary
  • maxReadSize:
    • optional ()default: 10000000)
    • throw an exception if a record boundary is not found in this many (uncompressed) positions
  • splitSize:
    • optional (default: taken from underlying Hadoop filesystem APIs)
    • shorthands accepted, e.g. 16m, 32MB

loadBamIntervals

When the path is known to be an indexed .bam file, reads can be loaded that from only specified genomic-loci regions:

import org.hammerlab.genomics.loci.parsing.ParsedLoci
import org.hammerlab.genomics.loci.set.LociSet
import org.hammerlab.bam.header.ContigLengths
import org.hammerlab.hadoop.Configuration

implicit val conf: Configuration = sc.hadoopConfiguration

val parsedLoci = ParsedLoci("1:11000-12000,1:60000-")
val contigLengths = ContigLengths(path)

// "Join" `parsedLoci` with `contigLengths to e.g. resolve open-ended intervals
val loci = LociSet(parsedLoci, contigLengths)

sc.loadBamIntervals(
	path, 
	loci
)
// RDD[SAMRecord] with only reads overlapping [11000-12000) and [60000,∞) on chromosome 1

Arguments:

  • path (required)
  • loci (required): LociSet indicating genomic intervals to load
  • splitSize: optional (default: taken from underlying Hadoop filesystem APIs)
  • estimatedCompressionRatio
    • optional (default: 3.0)
    • minor parameter used for approximately balancing Spark partitions; shouldn’t be necessary to change

loadReadsAndPositions

Implementation of loadReads: takes the same arguments, but returns SAMRecords keyed by BGZF position (Pos).

Primarly useful for analyzing split-computations, e.g. in the compute-splits command.

loadSplitsAndReads

Similar to loadReads, but also returns computed Splits alongside the RDD[SAMRecord].

Primarly useful for analyzing split-computations, e.g. in the compute-splits command.