1) We explore the performance of four methods for isolating genes or expression profiles that are unique to immune cell types:
2) We also download 390 enriched cell type samples and do EDA to understand how much expression variance exists within and across cell types, especially when comparing genes that were or were not considered to be representative of a particular cell type.
3) We compare deconvolution performance in several synthetic mixtures, with and without noise.
We observe a lot of variance in the expressions of marker genes for a particular cell type (CD8 T cell) from samples of that cell type across datasets. The reference expression value chosen by Cibersort does not seem to be the mean of the distributions. This could be due to mixed tumor vs blood origins or varying tissues of origin.
For selected genes for a certain cell type: how much variance in expression is there within this cell type for these gene? Is it worth putting variance into our model?
I took many CD8 T cell samples from across papers and normalized them all (log base 2). Then I took selected genes that Bindea had flagged as CD8 T cell-specific. I plotted a histogram of each gene's expressions, and overlayed Cibersort LM22's chosen value for this gene. (notebook)
Note that there are sometimes multiple plots per gene, because each plot actually is for a particular probeset (there can be multiple probesets for a single gene). However, Cibersort's matrix has gene-level specificity, not probeset-level.
There is a lot of variance in the expression of genes of interest, and the point estimates that Cibersort uses do not seem to be the means of the expression distributions of all samples. Thus, better means could be found, and variance information might improve prediction.
Are we going to see that unselected genes are fairly uniform and meaningless? Are they all low expression throughout? Are these methods selecting very high expression genes only? "Selected" could mean the ones Cibersort includes in expression profiles, or ones flagged by Bindea (probably better)
When looking at CD8 T cell genes (as flagged by Bindea) in samples from other cell types, we see wider expression distributions that seem more Gaussian. This may be due to having much higher sample size in the histogram. (notebook)
When looking at non-CD8 T cell genes in CD8 T cell samples, we see that lots of other genes have high expression. Perhaps they're not unique enough or not high enough to be an outlier. (notebook)
An example of how the processing methods extract reference profile signal.
Here is a pairwise Pearson correlation matrix from the raw data of Abbas 2009's first experiment. Notice the scale -- poor absolute differentiation, though there seems to be signal hidden in the data.
Here is a similar matrix from the basis matrix Cibersort made out of this. Much better differentiation, in terms of the absolute scale.
Biological pathways of individual cell types and connections between similar cell types are recovered -- even when combining the outputs of different methods from different datasets. Some methods seem to be noisier than others. Marker gene lists are different but include the same pathways.
The marker gene lists have mostly different genes but represent many of the same pathways. IRIS has more noise though -- e.g. lots of cell division pathways. See section 2.1.1 of the writeup for the GO tables.
Pairwise Pearson correlation and hierarchical clustering on the basis matrices recover biological similarities between certain cell types. (There is one exception: gamma delta T cells -- however these have been flagged as problematic and may be ignored.) These patterns are generally preserved across methods and across datasets.
# convert the LM22 heatmap made in R to a PNG and rotate it
!convert -density 300 data_engineering/plots/lm22.pdf -resize 50% -rotate 90 data_engineering/plots/lm22.png
Overall, the condition number of the basis matrices is low, but there are some cell types that are very similar and hard to differentiate.
The condition number of the basis matrices is low: LM22's is 11.38 (0 is best). When one of each of the two pairs of most similar cell types is removed, the condition number decreases to 9.30.
The most similar cell types are:
We expect there to be several pathological cases:
It will be interesting to see what the diagnostics tell us in both cases.
First, using Cibersort, I reproduced deconvolution of example mixtures that come with Cibersort. I got $R^2 > 0.99$. (notebook)
Next, I deconvolved weighted sums of two quite different columns: naive B cells and Tregs. I either added no noise, or added one of two types of noise:
Results (notebook):
B cells naive T cells regulatory (Tregs)
0 -0.001063 -0.000304
1 -0.001573 -0.000498
2 -0.013319 -0.032256
B cells naive T cells regulatory (Tregs)
0 -0.106105 -0.010682
1 -0.107512 -0.011141
2 -0.089686 -0.031552
In first experiment, we have pretty good deconvolution throughout w/ p-values = 0. But in second experiment, naive B cells are consistently 10% off, but Cibersort still says p-value = 0. $R^2$ is still high, so maybe this is not bad.
I deconvolved weighted sums of two very similar cell types: naive and memory B cells. Again, I used reference profiles and raw cell lines with three noise conditions, as above.
Results (notebook):
B cells naive B cells memory
0 0.000715 -0.001720
1 0.000361 -0.001691
2 -0.007141 0.000923
B cells naive B cells memory
0 -0.106118 0.035980
1 -0.104301 0.036218
2 -0.090453 0.005164
This particular naive B cell line is always underestimated by 10%, weird.
Even when $p=0, R^2$ is high, we get bad deconvolution -- errors of 10% or more.
I retrained Cibersort on some superclasses from the same input data: B cells, T cells, and myeloblasts. I created a new basis matrix exactly in the way that LM22 was created (which I had verified earlier).
The basis consisted of:
Then I used the new basis matrix with Cibersort to deconvolve the same mixtures as above:
Here is the plot of absolute differences:
I downloaded and processed 63 samples of purified T cells and B cells (see behavior_of_rnaseq_data).
The specific subtypes:
According to the paper that describes the voom transform, RNA-seq counts/million data can be modeled just like microarray data after a log transformation and with specific weights to control for heteroscedasticity. I have only done the log transformation so far -- I have not examined the weights. Also, I used log2-TPM, not log2-CPM. Within-sample normalization may play a role, so I should re-examine this.
First, as a baseline, what do microarray lines from the same cell type look like when we take the absolute difference between the two? The difference is hard to notice:
This is from two raw B cell lines. On the left are all genes. On the right, I've filtered to the genes that Bindea called B cell marker genes.
Now, as another baseline, what do microarray lines from different cell types look like when compared? Now, the difference is noticeable (see the horizontal axis), especially when looking at B cell marker genes in particular. (We're comparing a neutrophil raw line to a B cell raw line.) There's noise in raw lines, so the filtering is important here.
Let's now look at the same comparisons for RNA-seq data. What do RNA-seq lines from the same cell type (B cells) look like?
Like with the analogous microarray analysis, even without filtering, you can easily tell that they're pretty similar.
When comparing RNA-seq lines from different cell types -- in this case, a Treg line vs a B cell line -- the story becomes more interesting. Without filtering, it's hard to tell that there's a difference. With filtering, you can tell by the scale:
We've demonstrated that you can tell apart RNAseq lines from different cell types, but not from the same cell type. Phenotype distinctions exist and are readily visible if you filter to the right genes! The same is known to be true about microarray data, and we saw it above.
Now let's compare microarray lines to RNA-seq lines directly. When the lines are from the same cell type, there's some, but not all that much of a difference. Filtering helps slightly. Note that there is more noise in this comparison than before.
But when you compare a neutrophil microarray raw line to a B cell RNA-seq raw line, you see a difference. However, in these comparisons, it's much less clear whether different or same. The max mark on the scale is the only giveaway, and it's a change of ~ 1; in microarray vs. microarray or RNAseq vs RNAseq, the difference was around 3. This is all on the log scale though, so there's some signal there, but RNAseq vs microarray seems a little more unruly.
Yes, there's a constant offset but they track quite well (see the correlation scores). On the right are B cell marker genes only. On the left are all genes -- and there's some bizarre intersection there, but overall a similar constant absolute difference throughout.
How does the ratio between most highly and second most highly expressed genes compare between microarray and RNA-seq data?
We saw above that supersets can be classified with low error. We can build a hierarchical classification model using a local binary classifier at each node in the hierarchy. There are several key questions:
{Class 1: false; Class 1.1: true}
.I made a proof-of-concept with classifying B cells naive vs B cells memory vs Tregs raw lines -- purified raw lines only, for now. I compare a flat classifier to one that models the hierarchy of root → {B cells, T cells}, B cells → {naive, memory}, T cells → {Tregs}.
I train on LM22 reference profiles, binary classifier for whether a sample s is of that type For training, positive examples are the class and its descendants, and negative examples are everything except the class, its descendants, and its ancestors. Then I test on raw lines from all expression data (not just from LM22). I compute multiplicative probability of each leaf class and return the highest.
Here are the confusion matrices of each classifier -- there is clearly some work to be done to improve the classification.