Pages

Process BAM files using Apache Spark and HTSJDK; inspired by hadoop-bam.

$ spark-shell --packages=org.hammerlab.bam:load_2.11:1.1.0
import spark_bam._, hammerlab.path._

val path = Path("test_bams/src/main/resources/2.bam")

// Load an RDD[SAMRecord] from `path`; supports .bam, .sam, and .cram
val reads = sc.loadReads(path)
// RDD[SAMRecord]

reads.count
// 2500

import hammerlab.bytes._

// Configure maximum split size
sc.loadReads(path, splitSize = 16 MB)
// RDD[SAMRecord]

Linking

SBT

libraryDependencies += "org.hammerlab.bam" %% "load" % "1.1.0"

Maven

<dependency>
       <groupId>org.hammerlab.bam</groupId>
       <artifactId>load_2.11</artifactId>
       <version>1.1.0</version>
</dependency>

From spark-shell

spark-shell --packages=org.hammerlab.bam:load:1.1.0

On Google Cloud

spark-bam uses Java NIO APIs to read files, and needs the google-cloud-nio connector in order to read from Google Cloud Storage (gs:// URLs).

Download a shaded google-cloud-nio JAR:

GOOGLE_CLOUD_NIO_JAR=google-cloud-nio-0.20.0-alpha-shaded.jar
wget https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.20.0-alpha/$GOOGLE_CLOUD_NIO_JAR

Then include it in your --jars list when running spark-shell or spark-submit:

spark-shell --jars $GOOGLE_CLOUD_NIO_JAR --packages=org.hammerlab.bam:load:1.1.0
…
import spark_bam._, hammerlab.path._

val reads = sc.loadBam(Path("gs://bucket/my.bam"))