Methods and Coming Advances in Guacamole

12 Feb 2015

Last week we introduced our Spark-based variant calling platform, Guacamole. By parallelizing over large clusters, Guacamole can identify mutations in a human genome in minutes, while comparable programs typically take hours.

Guacamole provides a set of primitives for distributed processing of DNA sequencing data. It also includes a few proof of concept variant callers implemented using these primitives. The GermlineStandard variant caller identifies differences in a sample with respect to the reference genome, commonly referred to as germline variants. Germline variant calling can discover mutations that might cause a congenital disorder. The SomaticStandard caller is for cancer studies and identifies mutations that occur in a patient’s cancer cells but not in his or her healthy cells.

SomaticStandard

Similar to many other variant callers, we use a simple likelihood model to find possible variants, then filter these to remove common sources of error.

The input to a variant caller is a collection of overlapping reads, where each read has been mapped to a position on the reference genome. In addition, the sequencer provides base qualities, which represent the machine’s estimate of a read error. For example:

chr1 724118  AATAAGCATTAAACCA @B@FBF4BAD?FFA

where AATAAGCATTAAACCA is positioned at chr1, position 724118, and the sequencer provided the error probabilities @B@FBF4BAD?FFA. Each letter represents a Phred-scaled probability of the sequencer having correctly identified the corresponding base.

This read and others that overlap position 724118, collectively referred to as a “pileup,” can be used to determine the genotype at that position. A genotype is the pair of bases, one from each parent, that an individual has at a particular location in the genome. We can compute the posterior probability of each possible genotype, given the observed reads, using Bayes’ Rule. For example, if R is the collection of reads that overlap position p and AT is the genotype we want to evaluate:

$P (AT \vert R ) \propto P ( R \vert AT) * P(AT)$

We compute the likelihood $P ( R \vert AT)$ (i.e. how likely is it that we would see this collection of reads if the patient truly had one chromosome with an A and one with a T at this position) using the base qualities. We assume that each read is generated independently:

$P ( R \vert AT) = \prod_{i} P( R_i \vert AT)$

Then, if $\epsilon_i$ is the probability of a sequencing error on read i at this position:

$P( R_i \vert AT) = \begin{cases} 1 - \epsilon_i &\mbox{if } R_i = A/T \\ \epsilon_i & \mbox{if } R_i \neq A/T. \end{cases}$

We consider the tumor and normal samples independently at each position and compute the likelihood that the tumor has a different genotype than the normal tissue, indicating a somatic variant.

This likelihood model is simple and highly sensitive, but it ignores important complicating factors that contribute to false positive variants. For example, Illumina sequencing machines have well-documented issues toward the beginning and end of reads and sometimes sample one DNA strand more frequently than the other. Additionally, reads in high-complexity regions may be misaligned by the aligner. A number of filters are applied to remove erroneous variants due to these factors.

Improvements

Errors in variant callers tend to come from three main areas:

sequencing error
alignment error
cell variability

Sequencing Error

Guacamole uses base quality scores generated from the sequencer, but these quality scores often have their own biases. Many strategies exist to correct these biases, either through recalibration or by removing sequence segments that occur less frequently.

Guacamole will correct sequencing errors through the examination of unlikely sequence segments and the recalibration of base quality scores.

Alignment Error

Alignment error is usually the most prominent source of error due to the difficulty of mapping a read to the human reference genome. Guacamole is currently designed to work with short reads of length 100-300 bases. Each of these reads is mapped to the reference genome to determine whether it contains a variant, but a read may not map unambiguously to a single position. Also, true mutations, unique to the individual, can make mapping a read difficult, especially when the mutation involves many added or removed bases. Paired-end reads, in which each read has a “mate” read within a certain genomic distance, can help aligners resolve ambiguity.

The alignment step can be avoided altogether by assembling the reads—connecting the reads together to generate a longer sequence. Ideally these would be a full chromosome, but for a human-scale genome this tends to result in many small fragments instead. Doing this in a narrow region can be useful, however, to help improve alignment around complex mutations.

Guacamole’s insertion- and deletion-calling will be improved by reassembling reads in complex regions.

Multiple Sequencing Technologies

Complementary sequencing technologies can be used to remove alignment and sequencing error as well. While many aligners use paired-end read information to resolve alignments, additional information to resolve the proximity of reads can be useful. For example, Hi-C data can generate “short reads” that are close in physical proximity (even across chromosomes). This information can be used to order and arrange fragmented assemblies into full chromosomes.

Other sequencing devices can be used as well. PacBio SMRT sequencing is one of a few new technologies that generate much longer (~10k bases) reads. While PacBio data is more expensive to generate and misidentifies single bases more often, it can be used in tandem with short-read data to identify variants more accurately. Short-read data can be used to correct the more error-prone long reads, and longer reads can be used to connect the many fragments generated when assembling short reads. Additionally, PacBio reads provide a complementary base error profile.

Guacamole will soon support alternative sequencing technologies to improve variant calling.

Cell Variability

Another source of error we hope to address is cell variability. While this can be a problem in germline variant calling (due to mosaicism) it arises most often in somatic variant calling and cancer genomes. Tumors typically contain many cell populations (subclones) each with a differing set of mutations. When the cells are sequenced in bulk, it can be difficult to assess whether an apparent variant is a sequencing or alignment artifact or a true mutation present in only a small fraction of cells.

Guacamole will improve somatic variant calling by improving sensitivity to variants with low allele frequency.

Roadmap

To summarize, the next few months of development in Guacamole will be dedicated to the following:

Improved sequencing error correction
Local assembly for somatic variant calling
Supporting reads from multiple sequencing technologies
Robust handling of low allele frequency mutations

As always, you can follow our progress, ask questions, or raise issues on GitHub.