Graph visualizations of MHC alleles01 Nov 2016 | by Leo Rozenberg
The MHC gene family is well known for its polymorphism. The IPD-IMGT/HLA database provides the official reference sequences for the alleles found in this region. In addition, the Anthony Nolan HLA informatics group provides a reference alignment of these alleles.
Here is a pared clip (we insert ellipses to skip lines) of the output for the genetic DNA of HLA-A:
This format textually encodes differences between the reference sequence (the
A*01:01:01:01) and the other, alternate, allele sequences, along
the file columns.
Using special characters, the reader can tell if the alternate nucleotide
is unknown (‘*’), the same (‘-‘), different (‘A’, ‘C’ …etc), or a gap (‘.’)
to the known reference nucleotide in the same column.
While this format has its place, it isn’t useful for quickly visualizing the diversity of this region. We’ve created a small utility to marginally improve visualizing these alignments.
Our first intuition for this task was to compress the redundant sequence information. Grouping shared sequence segments as nodes led to us thinking about the entire alignment as a graph, and from this starting premise some properties followed naturally:
- Alleles are represented by edges, such that one can trace the sequence of the allele by following edges.
- Alignment information is preserved by adding a numerical label of alignment position. Different nucleotides at the same position are represented by different nodes: “336A” vs “336G”.
- Gaps are encoded by the edges that point to nodes with a position
farther than the position at the “end” of a node (the start label plus the
length of the node sequence). In the example above, the “337GATGGAGCCG”
node ends its position at 347 since
len(GATGGAGCCG) = 10. Therefore there is a 20 nucleotide gap, in all of the alleles except “A*68:18N”, as indicated by the edge to “367CGGGC”.
Underneath the hood,
mhc2gpdf first parses the alignment file;
then constructs the desired graph representation by creating
the desired nodes and edges;
finally, it writes a dot file and calls Graphviz to
render the dot file to PDF.
With these pdfs we can quickly determine allelic differences:
Appreciate the dense polymorphism of these genes:
And see long gaps in the Class II DRB nucleic sequences:
Reading the graph
We’ve made some stylistic choices to make the graph easier to understand:
- We run-length encode the leaf nodes of the tree that represents the allele edges: instead of “A*01:01:02,A*01:01:04,A*01:01:05…A*01:01:37”, we’ll write “A*01:01:02,04-37”.
- If an edge represents more than half of the allele set, we’ll describe its complement and prefix the string with a “C. of”, short for “Complement of”.
- We add “Start” nodes, prefixed with an “S” to indicate where the alignment has sequence data for each allele. These are similarly encoded as the edges.
- We add “End” nodes for the same reason. These just have the final position.
- We add “Boundary” nodes, which are represented by “|” in the alignment file for exome boundaries.
We hope that the genetics community finds this tool useful and we look forward to your feedback on GitHub.