- Open Access
Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts
Genome Biologyvolume 17, Article number: 112 (2016)
Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.
Single-cell RNA-seq (scRNA-seq) has proved to be a powerful tool for probing cell states [1–5], defining cell types [6–9], and describing cell lineages [10–13]. These applications of scRNA-seq all rely on two computational steps: quantification of gene or transcript abundances in each cell and clustering of the data in the resulting abundance × cell expression matrix [14, 15]. There are a number of challenges in both of these steps that are specific to scRNA-seq analysis. While methods for transcript/gene abundance estimation from bulk RNA-seq have been extensively tested and benchmarked , the wide variety of assay types in scRNA-seq [17–25] have required a plethora of customized solutions [2, 6, 7, 9, 11–13, 24, 26–37] that are difficult to compare to each other. Furthermore, the quantification methods used all rely on read alignment to transcriptomes or genomes, a time-consuming step that will not scale well with the increasing numbers of reads predicted for scRNA-seq [15, 38]. Clustering based on scRNA-seq expression matrices can also require domain-specific information, e.g., temporal information  or functional constraints , so that in some cases hand curation of clusters is performed after unsupervised clustering .
In  a method of collapsing bulk read alignments into “equivalence classes” of reads was introduced for the purpose of estimating alternative splicing isoform frequencies from bulk RNA-seq data. Each equivalence class consists of all the reads that are compatible with the same set of transcripts (See Fig. 1 for an example). The collapsing of reads into equivalence classes was initially introduced to allow for significant speedup of the E-step in the expectation-maximization (EM) algorithm used in some RNA-seq quantification programs [40, 41], as the read counts in the equivalence classes, or transcript-compatibility counts (TCCs), correspond to the sufficient statistics for a standard RNA-seq model . In other words, the use of transcript-compatibility counts was an intermediate computation step towards quantifying transcript abundances. In this paper we instead consider the direct use of such counts for the comparison and clustering of scRNA-seq cells. Figure 2 shows an outline of a method we have developed for clustering and analyzing scRNA-seq data; the key idea is to base clustering not on the quantification of transcripts or genes but on the transcript-compatibility counts for each cell. We note that equivalence classes have also been used in [43, 44] to define similarity scores between de novo assembled transcripts.
To better understand the relevance of transcript-compatibility counts, consider their relationship to the “gene-level” counts used in many RNA-seq analyses. In the same way that “genes” represent groupings of transcripts , equivalence classes as introduced by  are also groups of transcripts. However, while the former is a biologically motivated construction, the latter is technical, consisting of groupings that capture the extent of ambiguous multiple mappings among reads. The lack of direct biological interpretation of equivalence classes makes transcript-compatibility counts less intuitive. However, as we will show, there are two significant advantages to working with them: (1) unlike transcript or gene-level quantifications, transcript-compatibility counts can be computed without a read-generating model, and hence a single clustering pipeline based on transcript-compatibility counts can be used across a wide range of scRNA-seq assays; (2) transcript-compatibility counts can be computed by pseudoalignment, a process that does not require read alignment and can be done extremely efficiently .
To demonstrate both the general applicability of our method as well as its accuracy, we reanalyzed data from the topics of two recently published scRNA-seq papers: the pseudotemporal ordering of primary human myoblasts  and the cell classification in the mouse cortex and hippocampus . We show that not only are we able to recapitulate the analyses of the papers two orders of magnitude faster than previously possible, but we also provide a refinement of the published results, suggesting that our approach is both fast and accurate. The speedup of our method makes single-cell RNA-seq analysis interactive for the first time: sensitivity of results to parameters and annotations can be easily explored and analyses can be easily reproduced by individuals without access to significant computing resources. Furthermore, the efficiency of our methods will take on increasing significance as single-cell RNA sequencing scales to experiments with hundreds of thousands of cells and improved technologies make the acquisition of single-cell data easier and faster (see, for example, ). We also illustrate the advantages of the broad applicability of our approach via its suitability to a multitude of assays. Existing pipelines must be tailored to specific assays, making it difficult to perform meta-analyses and to compare results across experiments.
Transcript-compatibility counts from pseudoalignments
To demonstrate the effectiveness of transcript-compatibility counts for scRNA-seq analysis, we first examined how efficiently they can be computed. While transcript-compatibility counts can be extracted from read alignments (e.g., in SAM/BAM format), they do not require the full information contained in alignments. Instead, we examined the speedup possible with pseudoalignment , which obtains for each read the set of transcripts it is compatible with and therefore can be directly used to obtain transcript-compatibility counts.
Figure 3 shows the speed of obtaining transcript-compatibility counts via pseudoalignment in comparison to the time required to quantify RNA-seq data with other approaches. The key result relevant for single-cell analysis is the scalability of pseudoalignment for obtaining transcript-compatability counts (Fig. 3 and Additional file 1: Figure S1). The fixed extra cost for aligning (rather than pseudoaligning) reads for each cell is small, but when extrapolated to hundreds of thousands of cells it becomes a significant (computational) cost.
Pseudotime for differentiating human myoblasts
The recently published Monocle software  that builds on the Cufflinks program  is rapidly becoming a standard tool for scRNA-seq analysis. We therefore sought to compare our approach to Monocle, and in order to do so began with a reanalysis of the data in . Figure 4 shows the temporal ordering of differentiating primary human myoblasts using transcript-compatibility counts clustering based on the Jensen-Shannon metric and the affinity propagation algorithm (see Methods). We note that unlike Cufflinks, which consists of an explicit model of RNA-seq suitable for the data in  but not necessarily for other assays, our transcript-compatibility counts make no assumption about the nature of the data. Furthermore, while the reanalysis appears to match that of , affinity propagation with different parameters provided a more refined clustering, possibly capturing seven stages of myoblast differentiation (see also Additional file 1: Figure S2).
A central idea in the pseudo-temporal ordering of cells relies upon the construction of a minimum spanning tree (MST) over the pairwise distances of their corresponding gene expression vectors . This attempts to capture the trajectory of a hypothetical cell that gradually “moves” through different cellular states or differentiation stages in a high-dimensional gene expression space. Our results show that the same concept can be applied to transcript-compatibility counts. A key step in Monocle is to first reduce the dimensionality of the data by independent component analysis (ICA) and then compute the MST based on Euclidean distances on the plane. Here we take a different approach (see Fig. 4 a–d) and compute the MST on “cluster centers” in high dimensions (see Methods). Both approaches aim to battle the biological and technical noise that is inevitably introduced in scRNA-seq experiments. Even though we could have used Monocle directly on transcript-compatibility counts, the design and comparison of specialized tools is beyond the scope of this paper.
Figure 4 d validates the three primary clusters and the pseudo-temporal ordering obtained by our method based on three key myoblast differentiation markers, MYOG, CDK1, and PDGFRA (see Additional file 1: Figure S3 for an additional set of genes taken from ). Interestingly, the expression of these genes gradually evolves over the pseudo-temporally ordered clusters, capturing both the underlying differentiation trajectory of proliferating cells to myoblasts and the corresponding branching towards mesenchymal cells, as was observed in .
Finally, we should point out that although the three primary clusters of  are evident in our results, they are not identical. This naturally raises the question of whether clustering on (high-dimensional) transcript-compatibility counts could possibly lead to cell misclassification. Our results show that this is not the case. In Fig. 5 we investigated one cell that seemed to have been severely misclassified by our method as a differentiating myoblast while it was identified as a proliferating cell by Monocle. However, an analysis of the expression levels of 12 marker genes obtained from  shows that this cell displays more similarity to differentiating myoblasts than proliferating cells. Overall our results seem to suggest that transcript-compatibility counts, being directly obtained from sequenced reads, might constitute a less noisy representation of the “transcriptomic state” of a cell compared to the one obtained by quantifying its gene expression.
Cell classification in the mouse cortex and hippocampus
The reanalysis of  shows that clustering of transcript-compatibility counts can be useful on a single dataset, but we believe that the true power of our approach lies in its broad applicability to multiple single-cell assays. In contrast to the standard quantification pipeline, obtaining transcript-compatibility counts does not require a read-generating model; our method can be directly applied to a wide range of scRNA-seq datasets, and transcript-compatibility counts can be used to analyze sequenced reads without any assay-specific information. To make this point, we reanalyzed a recent large scRNA-seq experiment published earlier this year  that uses an assay based on unique molecular identifiers (UMIs). In contrast to , where paired-end reads were sampled from fragments covering the entire length of the transcripts,  used single-end reads that were only obtained from the 3’-end of the transcripts.
Zeisel et al.  examined a very diverse population of 3005 cells obtained from the cortical and hippocampal regions of the mouse brain. In order to analyze this complex dataset, the authors developed a state-of-the-art hierarchical biclustering method called BackSPIN (based on SPIN ) and were able to identify 47 distinct subpopulations of cells within nine major brain cell types. This fine-grained analysis also revealed a previously unknown post-mitotic oligodendrocyte subclass, referred to as Oligo1 in .
Figure 6 shows the clusters obtained by applying our method to the above dataset and compares our method’s clustering accuracy to various quantification-based methods. In order to systematically assess the clustering accuracy, we iteratively subsampled cells from two different cell types at random and evaluated the ability of each method to distinguish between these types. Since the development of specialized clustering algorithms is orthogonal to our paper, we compared based on the same clustering algorithm throughout (see Methods). Our results indicate that transcript-compatibility counts can be more accurate than standard model-based RNA-seq quantification tools (such as eXpress) that try to estimate the underlying read-generating model from the data. Our transcript-compatibility counts based method is in fact able to achieve similar accuracy with the assay-specific quantification approach used in  (which explicitly takes into account the significant 3’-end bias in this dataset). Clustering transcript abundance quantifications output by kallisto results in lower accuracy due to the mismatch between kallisto’s read-generating model and this dataset, further emphasizing the importance of using transcript-compatibility counts, which are computed without using any such model.
Quite remarkably, our method (via affinity propagation on all cells) was further able to recover the Oligo1 cluster of cells, showing that transcript-compatibility counts can indeed capture distinct cell signatures without actually quantifying their gene expression (Fig. 6, Methods). Overall, in our experiments we observed that unsupervised clustering of transcript-compatibility counts typically yielded more than 47 clusters, which was also the case in . Some of our clusters were very small, probably capturing outlier cells, while others seemed to be further splitting the 47 cell subtypes identified in .
To further investigate this, we focused on another oligodendrocyte subpopulation, referred to as Oligo3 in . As reported in , Oligo3 cells were almost exclusively observed in the somatosensory cortex and were identified by the authors as being in an intermediate stage of maturation — in between premyelinating and myelinating oligodendrocytes. Even though the Oligo3 cells appear to be well clustered together, as visualized by t-distributed stochastic neighbor embedding (t-SNE) (Fig. 7 a), affinity propagation on transcript-compatibility counts with various parameters consistently separated them into two subclusters. Our results in Fig. 7 b seem to suggest that a subpopulation of Oligo3 cells (captured by one of our subclusters) expresses an unusual signature of endothelial/vascular genes on top of the expected myelin-related genes. Interestingly, similar findings have been reported recently in , suggesting a possible (experimental) contamination of several oligodendrocyte cells in the dataset at hand.
In this paper we introduced a novel method that uses transcript-compatibility counts — instead of gene expression profiles — as distinct cell signatures for clustering single-cell data. Note, however, that the main focus of our method is not about how to cluster (i.e., the particular choice of clustering algorithms) but rather what to cluster on. To emphasize this point, we considered simple, “off-the-shelf” clustering methods that directly use the corresponding TCCs as their input. Interestingly, while these methods may not be able to recover accurate clusters when applied to gene expression vectors (see Additional file 1: Figure S4 or (, Figure S3), for example), our results showed that TCCs maintain all the necessary information to recover the analyses of  and .
Even though clustering alone can reveal important information about a single-cell RNA-seq experiment, further biological interpretation of the results (marker gene identification or differential expression) requires some form of quantification of expression profiles within and in between clusters. So it is natural for one to think that eventually the quantification bottleneck will still manifest itself in single-cell analysis. A key observation, however, is that given an accurate clustering of the cells, each and every individual cell’s gene expression profile is no longer needed; one can extract an accurate statistical representation of the gene expression within each cluster — without having to quantify all cells separately. In particular, one can quantify the aggregate gene expression in each cluster (cluster centers) by pooling single-cell TCCs together and further estimate the corresponding gene variability by subsampling and quantifying only a few cells per cluster. For example, in Additional file 1: Figure S5 we used kallisto to quantify subsampled cells and the corresponding cluster centers (after clustering on TCCs) for Trapnell et al.’s dataset and generated results that are very similar to the ones obtained in Fig. 4 (where the corresponding gene expression profiles were obtained from ). Our method can therefore be used to effectively reverse the quantification and clustering steps in the conventional pipeline and potentially provide further end-to-end processing gains, depending on the needs/goals of each scRNA-seq experiment. Overall, we believe that clustering before quantifying is a promising future direction for scRNA-seq analysis which may lead to more robust and accurate quantification algorithms.
The extraordinary developments in single-cell RNA-seq technology over the past few years have demonstrated that “single-cell resolution” is not just a gimmick but an unprecedented tool for probing transcriptomes that can reveal the inner workings of developmental programs and their resulting tissues. However, the computational challenges of scRNA-seq analysis, already very high due to the large number of cells to analyze, have been further exacerbated by the smorgasbord of assays, each of which introduces unique technical challenges.
The new method we have proposed and evaluated in this paper, namely analysis of scRNA-seq based on transcript-compatibility counts, offers an efficient, accurate, and broadly applicable solution for extracting information from scRNA-seq experiments. In the same way that single-cell analysis can be viewed as the ultimate resolution for transcriptomics, transcript-compatibility counts are the most direct way to “count” reads. While we have focused on clustering of cells in this paper, we believe that transcript-compatibility counts may have applications in many other sequencing-based assays, and that further development of methods based on such counts offers a fruitful avenue of exploration.
The ability to obtain transcript-compatibility counts by pseudoalignment is a benefit that has its own implications and applications. For example, the speed of pseudoalignment facilitated quick experimentation with our method, and in assessing our accuracy on different datasets, one discovery was that much less sampling than is currently performed is necessary to cluster cells. In the reanalysis of , we found that the main results, namely the clustering of cells and identification of cell types, were achievable with only 1 % of the data (see Fig. 8 and Additional file 1: Figure S6). This observation has significant implications for scRNA-seq as it suggests that, for clustering of cells, low-coverage sequencing may be sufficient, thus allowing for larger experiments with more cells. Moreover, this low-coverage clustering performance can be achieved using our method, which is not tailored to the specific scRNA-seq assay.
The code used to generate the results presented in this paper is available online on GitHub . The Mus musculus transcriptome assembly used was GRCm38. The Homo sapiens transcriptome assembly used was GRCh38. The reference genome used for HISAT was build 10 of the mouse genome (mm10) from the UCSC genome browser. The genome annotation used in the analysis is the release 83 of Ensembl.
Computation of transcript-compatibility counts
In our implementation of the method, we use kallisto to compute transcript-compatibility counts via pseudoalignment (avoiding the quantification step that is usually performed when running kallisto altogether). In particular, we utilized the “pseudo” option of the kallisto RNA-seq program, which computes equivalence classes of reads after pseudoalignment. We used kallisto version 0.42.3 with k set to kallisto’s default value of 31. Even though kallisto pseudoalignment is a natural approach to obtain transcript-compatibility counts, one can in principle extract the same information from exact read alignments. To explore this alternative, we used HISAT (with the no-spliced-alignment option enabled) to align reads on the mouse transcriptome (GRCm38) in the case of Zeisel’s dataset and the human transcriptome (GRCh38) in the case of Trapnell’s dataset. Then, we generated the corresponding “alignment-based TCCs” by directly counting the number of multi-mapped reads aligned to each set of transcripts, and evaluated their performance in Additional file 1: Figure S7.
Transcript-compatibility counts based on UMI information
The dataset of  has reads with unique molecular identifiers (UMIs). UMIs are typically used in scRNA-seq to correct for PCR bias; biological copies of a transcript (distinct molecules) can be identified based on their UMIs. This information can be utilized in generating the transcript-compatibility counts from equivalence classes. Instead of counting all the reads in each equivalence class, we only count the reads with distinct UMIs. Transcript-compatibility counts with UMIs are shown in Figs. 6 b and 8 a (represented as “TCC with UMI” in the figures).
On obtaining the transcript-compatibility counts for each cell, we normalize by the total number of mapped reads to obtain a probability distribution called the transcript-compatibility count distribution or TCC distribution. We then compute the square root of the Jensen-Shannon divergence  between the TCC distributions for each pair of cells. As a distance metric which satisfies the triangle  inequality, the square root of the Jensen-Shannon divergence is a natural choice for computing pairwise distances between two probability distributions. However, the results obtained here are not contingent on using the square root of Jensen-Shannon divergences as the measure of distances, and quite similar results are obtained when we use other distances between probability distributions, such as the ℓ 1 distance to compute the pairwise distance matrix (see Additional file 1: Figure S6). The ℓ 1 distance (which is just twice the total-variation distance) in fact seems to perform better than the Jensen-Shannon distance for low coverage (Additional file 1: Figure S6b). In contrast, Euclidean distance (ℓ 2 distance) seems to perform much worse (see Additional file 1: Figure S6). The fact that Euclidean distance is not a good distance metric to measure distances between probability distributions is widely documented (see, for instance, ).
All clustering carried out in this paper was done using off-the-shelf clustering methods.
We used spectral clustering using the pairwise distance matrices when we knew the number of clusters in the data. This includes Figs. 6 b, 8 a, and Additional file 1: Figure S6b with two clusters for the pairwise distance matrix (from TCC distributions) obtained for the data from .
The clustering method used when the number of clusters is not known is affinity propagation . This is an unsupervised clustering algorithm based on message passing, which needs a pairwise similarity matrix as input. The pairwise similarity matrix is computed as the negative of the pairwise distance matrix that was computed.
To evaluate the clustering accuracy of our method in Fig. 6 b, we performed binary classification tests using the labels reported in  as the ground truth. In particular, we randomly subsampled two different types of cells and evaluated the ability of each pipeline to separate them into two clusters via spectral clustering. We performed these binary classification tests between (1) the subclasses Oligo1 (45 cells) and Oligo4 (106 cells), (2) the cell types Astrocytes (198 cells) and Interneurons (290 cells), and (3) the more general cell types neurons (1628 cells) and non-neurons (1377 cells). The error rates for each test were obtained by randomly sampling 22, 99, and 200 cells from each of the two labels respectively, averaged over 10 Monte Carlo iterations.
For clustering the dataset of , we used affinity propagation with the default parameters which set the preference value equal to the median of the similarity scores and the damping parameter equal to 0.5. On doing this, we obtained 89 clusters. Of the 89 clusters obtained, cluster number 22 had the largest match with the set of cells the authors labeled as Oligo1 (which was the new type of cells discovered in ). A total of 24 out of the 28 cells in the cluster were labeled Oligo1 by . There were a total of 45 cells labeled Oligo1 in  out of the total of 3005 considered. This is investigated in Fig. 6 c.
Also, affinity propagation with different parameters seems to split the class labeled Oligo3 in  into two classes. This is investigated in Fig. 7, where the two classes considered were classes obtained with parameters set as before.
For clustering the dataset of , we used affinity propagation with preference parameter set to 1.3 and damping parameter set to 0.95 to obtain three clusters in Fig. 4. To obtain eight clusters on the dataset of , we used affinity propagation with preference parameter set to 0.6 and damping parameter set to 0.95 to obtain eight clusters. Then after collapsing any cluster with less than five cells into the cluster closest to it, we obtain the seven clusters investigated in Fig. 5. More details regarding our parameter choices for affinity propagation in this dataset are provided in Additional file 1: Figure S8.
Partial order on clusters
On the  dataset, for the seven clusters obtained, we first find the centroid TCC distribution of each cluster as the mean TCC distribution of all cells in the cluster. Then, we compute the pairwise Jensen-Shannon distances between the centroid TCC distributions (cluster centers). We then run a minimum weight spanning tree on the complete graph between the cluster centers with weights given by the computed pairwise distances. This gives us a partial order on the clusters, which is investigated in Fig. 4 and Additional file 1: Figure S2.
In this paper, we used kallisto and eXpress as representative methods for model-based quantification as they demonstrate similar accuracy to other available quantification tools . In particular, for Zeisel et al.’s dataset (Figs. 6 b and 8 a) we used these tools as a “negative control” to demonstrate the importance of the read-generating model when using quantified transcript abundances to obtain gene expression profiles and cluster single-cell data. Although the significant mismatch between the assumed model (i.e., full transcript length coverage) and the 3’-end bias in this dataset is expected to affect the quantification accuracy for each individual cell, our results show that such a mismatch further impacts the accuracy of clustering and cell-type classification. To evaluate the cell-type classification performance of kallisto and eXpress, we took the minimum of the error rates obtained with bias modeling turned on and off.
For the Trapnell et al. dataset, we used kallisto to quantify transcript abundances and obtain the gene expression profiles within the clusters obtained from TCCs. Note that the read-generating model in this dataset is similar to the standard RNA-seq model that kallisto uses for quantification. More specifically, in Additional file 1: Figure S5a we quantified the corresponding cluster centers by running kallisto’s EM algorithm on the pooled TCCs of each cluster. Using kallisto in this setting resembles bulk RNA-seq quantification applied to the pooled reads coming from each individual cluster (instead of the entire population of cells). In Additional file 1: Figure S5b we further quantified randomly subsampled cells (20 cells per cluster) to obtain an accurate estimate of the gene expression variability within each cluster.
Visualization of cells and clusters
Availability of data and materials
The code used to generate the results presented in this paper is available online on GitHub . All sequencing reads for the Zeisel et al. dataset  are available through Gene Expression Omnibus [GEO:GSE60361] and for the Trapnell et al. dataset  through [GEO:GSE52529]. The method is publically available on GitHub (https://github.com/govinda-kamath/clustering_on_transcript_compatibility_counts) under the MIT license.
No ethics approval was required for this study.
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, Rozenblatt-Rosen O, Suvà ML, Regev A, Bernstein BE. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014; 344(6190):1396–1401. doi:10.1126/science.1254257.
Pollen AA, Nowakowski TJ, Chen J, Retallack H, Sandoval-Espinosa C, Nicholas CR, Shuga J, Liu SJ, Oldham MC, Diaz A, Lim DA, Leyrat AA, West JA, Kriegstein AR. Molecular identity of human outer radial glia during cortical development. Cell; 163(1):55–67. doi:10.1016/j.cell.2015.09.004.
Gaublomme JT, Yosef N, Lee Y, Gertner RS, Yang LV, Wu C, Pandolfi PP, Mak T, Satija R, Shalek AK, Kuchroo VK, Park H, Regev A. Single-cell genomics unveils critical regulators of Th17 cell pathogenicity. Cell; 163(6):1400–12. doi:10.1016/j.cell.2015.11.009.
Kowalczyk MS, Tirosh I, Heckl D, Rao TN, Dixit A, Haas BJ, Schneider RK, Wagers AJ, Ebert BL, Regev A. Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 2015; 25(12):1860–1872. doi:10.1101/gr.192237.115. http://genome.cshlp.org/content/25/12/1860.full.pdf+html.
Lande-Diner L, Stewart-Ornstein J, Weitz CJ, Lahav G. Single-cell analysis of circadian dynamics in tissue explants. Mol Biol Cell. 2015; 26(22):3940–945. doi:10.1091/mbc.E15-06-0403.
Usoskin D, Furlan A, Islam S, Abdo H, Lonnerberg P, Lou D, Hjerling-Leffler J, Haeggstrom J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci. 2015; 18(1):145–53. doi:10.1038/nn.3881.
Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Jurèus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015; 347(6226):1138–42. doi:10.1126/science.aaa1934. http://www.sciencemag.org/content/347/6226/1138.full.pdf.
Burns JC, Kelly MC, Hoa M, Morell RJ, Kelley MW. Single-cell RNA-Seq resolves cellular complexity in sensory organs from the neonatal inner ear. Nat Commun. 2015; 6. doi:10.1038/ncomms9557.
Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, van Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015; 525(7568):251–5.
Kafri R, Levy J, Ginzberg MB, Oh S, Lahav G, Kirschner MW. Dynamics extracted from fixed cells reveal feedback linking cell growth to cell cycle. Nature. 2013; 494(7438):480–3. doi:10.1038/nature11897.
Bendall SC, Davis KL, Amir E-aD, Tadmor MD, Simonds EF, Chen TJ, Shenfeld DK, Nolan GP, Pe’er D. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014; 157(3):714–25. doi:10.1016/j.cell.2014.04.005.
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotech. 2014; 32(4):381–6. doi:10.1038/nbt.2859.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotech. 2015; 33(2):155–60. doi:10.1038/nbt.3102.
Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013; 14(9):618–30. doi:10.1038/nrg3542.
Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015; 16(3):133–45. doi:10.1038/nrg3833.
Oshlack A, Robinson M, Young M. From RNA-seq reads to differential expression results. Genome Biol. 2010; 11(12):220. doi:10.1186/gb-2010-11-12-220.
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, Lao K, Surani MA. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Meth. 2009; 6(5):377–82. doi:10.1038/nmeth.1315.
Islam S, Kjällquist U, Moliner A, Zajac P, Fan JB, Lönnerberg P, Linnarsson S. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011; 21(7):1160–1167. doi:10.1101/gr.110882.110.
Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, Schroth GP, Sandberg R. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotech. 2012; 30(8):777–82. doi:10.1038/nbt.2282.
Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2012; 2(3):666–73. doi:10.1016/j.celrep.2012.08.003.
Picelli S, Bjorklund AK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Meth. 2013; 10(11):1096–1098. doi:10.1038/nmeth.2639.
Sasagawa Y, Nikaido I, Hayashi T, Danno H, Uno K, Imai T, Ueda H. Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity. Genome Biol. 2013; 14(4):31. doi:10.1186/gb-2013-14-4-r31.
Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, Amit I. Massively parallel single-cell RNA-Seq for marker-free decomposition of tissues into cell types. Science. 2014; 343(6172):776–9. doi:10.1126/science.1247651.
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–1214. doi:10.1016/j.cell.2015.05.002.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell; 161(5):1187–201. doi:10.1016/j.cell.2015.04.044.
Amir E-aD, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, Pe’er D. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat Biotech. 2013; 31(6):545–52. doi:10.1038/nbt.2594.
Mahfouz A, van de Giessen M, van der Maaten L, Huisman S, Reinders M, Hawrylycz MJ, Lelieveldt BPF. Visualizing the spatial gene expression organization in the brain through non-linear similarity embeddings. Methods. 2015; 73:79–89. doi:10.1016/j.ymeth.2014.10.004.
Shekhar K, Brodin P, Davis MM, Chakraborty AK. Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE). Proc Natl Acad Sci U S A. 2014; 111(1):202–7. doi:10.1073/pnas.1321405111.
Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lonnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Meth. 2014; 11(2):163–6. doi:10.1038/nmeth.2772.
Tsafrir D, Tsafrir I, Ein-Dor L, Zuk O, Notterman DA, Domany E. Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics. 2005; 21(10):2301–308. doi:10.1093/bioinformatics.
Qiu P, Simonds EF, Bendall SC, Gibbs Jr KD, Bruggner RV, Linderman MD, Sachs K, Nolan GP, Plevritis SK. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nat Biotech. 2011; 29(10):886–91. doi:10.1038/nbt.1991.
Levine JH, Simonds EF, Bendall SC, Davis KL, Amir E-aD, Tadmor MD, Litvin O, Fienberg HG, Jager A, Zunder ER, Finck R, Gedman AL, Radtke I, Downing JR, Pe’er D, Nolan GP. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015; 162(1):184–97. doi:10.1016/j.cell.2015.05.047.
Marco E, Karp RL, Guo G, Robson P, Hart AH, Trippa L, Yuan GC. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc Natl Acad Sci. 2014; 111(52):5643–650. doi:10.1073/pnas.1408993111.
Shin J, Berg DA, Zhu Y, Shin JY, Song J, Bonaguidi MA, Enikolopov G, Nauen DW, Christian KM, Ming G-l, Song H. Single-cell RNA-seq with Waterfall reveals molecular cascades underlying adult Neurogenesis. Cell Stem Cell; 17(3):360–72. doi:10.1016/j.stem.2015.07.013.
Haghverdi L, Buettner F, Theis FJ. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics. 2015. doi:10.1093/bioinformatics.
Moignard V, Woodhouse S, Haghverdi L, Lilly AJ, Tanaka Y, Wilkinson AC, Buettner F, Macaulay IC, Jawaid W, Diamanti E, Nishikawa SI, Piterman N, Kouskoff V, Theis FJ, Fisher J, Gottgens B. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat Biotech. 2015; 33(3):269–76. doi:10.1038/nbt.3154.
Fan J, Salathia N, Liu R, Kaeser G, Yung Y, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, Kharchenko P. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. bioRxiv. 2015. doi:10.1101/026948.
Saliba AE, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res. 2014. doi:10.1093/nar.
Nicolae M, Mangul S, Mandoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-seq data. Algorithm Mol Biol. 2011; 6(1):9.
Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014; 32(5):462–4.
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotech. 2016; advance online publication. doi:10.1038/nbt.3519.
Pachter L. Models for transcript quantification from RNA-Seq. arXiv preprint arXiv:1104.3889. 2011. https://arxiv.org/pdf/1104.3889v2.pdf.
Davidson NM, Oshlack A. Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Biol. 2014; 15(7):410.
Srivastava A, Sarkar H, Patro R. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. bioRxiv. 2015. 029652. http://www.biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf.
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. What is a gene, post-ENCODE? History and updated definition. Genome Res. 2007; 17(6):669–81. http://genome.cshlp.org/content/17/6/669.abstract.
10x Genomics to unveil new single-cell genetic analysis product. 2016. http://www.10xgenomics.com/news/10x-genomics-to-unveil-new-single-cell-genetic-analysis-product. Accessed 16 Feb 2016.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511–5.
Magwene PM, Lizardi P, Kim J. Reconstructing the temporal ordering of biological samples using microarray data. Bioinformatics. 2003; 19(7):842–50. doi:10.1093/bioinformatics.
Clustering on transcript compatibility counts. 2016. https://github.com/govinda-kamath/clustering_on_transcript_compatibility_counts. Github repository (2016). Accessed 15 May 2016.
Lin J. Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory. 1991; 37(1):145–51.
Fuglede B, Topsoe F. Jensen-Shannon divergence and Hilbert space embedding. In: IEEE International Symposium on Information Theory. Chicago: ISIT: 2004. p. 31–1.
Batu T, Fortnow L, Rubinfeld R, Smith WD, White P. Testing that distributions are close. In: Proceedings 41st Annual Symposium on Foundations of Computer Science. Redondo Beach: IEEE: 2000. p. 259–69.
Frey BJ, Dueck D. Clustering by passing messages between data points. science. 2007; 315(5814):972–6.
Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008; 9(2579–2605):85.
Mühlbacher P. A python implementation of the diffusion maps algorithm introduced by Lafon. GitHub. 2015.
Lafon SS. Diffusion maps and geometric harmonics. Ph.D. thesis: Yale University; 2004.
We thank Páll Melsted for implementing the pseudo command in kallisto. This is the command that allows for direct output of transcript-compatibility counts via pseudoalignment. We would also like to thank Bo Li, Allon Wagner, and Nir Yosef for useful discussions about single-cell RNA-seq assays and their biases.
The authors declare that they have no competing interests.
VN, GMK, and JZ conceived the idea of clustering without quantification, performed analyses of data, analyzed and interpreted results, and wrote the manuscript. DNT and LP interpreted results, supervised the project, and wrote the manuscript. All authors read and approved the final manuscript.
GMK and JZ are supported by the Center for Science of Information, an NSF Science and Technology Center, under grant agreement CCF-0939370. VN is supported in part by the Center for Science of Information and in part by a gift from Qualcomm Inc. LP is supported in part by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG006129. DNT is supported in part by the Center of Science of Information and in part by the National Human Genome Research Institute of the National Institutes of Health under award number R01HG008164.
Supplementary Figures. A PDF file that contains eight supplementary figures. (PDF 22937 kb)