SCATE model for a single cell
SCATE begins with compiling a list of candidate CREs and grouping co-activated CREs into clusters. Currently, most scATAC-seq data are generated from human and mouse. For user’s convenience, for these two species we have constructed a Bulk DNase-seq Database (BDDB) consisting of normalized DNase-seq samples from diverse cell types generated by the ENCODE project. For each species, we compiled putative CREs using BDDB and clustered these CREs based on their co-activation patterns across BDDB samples. Users may augment these precompiled CRE lists by using SCATE-provided functions to (1) add and normalize their own bulk and pseudo-bulk (obtained by pooling single cells) DNase-seq or ATAC-seq samples to BDDB and then (2) re-detect and cluster CREs using the updated BDDB. These functions can also be used to create CRE database for other species. For human and mouse, saturation analyses show that BDDB covers most CREs one would discover in a new DNase-seq or ATAC-seq dataset. On average, a new sample only contributes <0.2% new CREs to our precompiled CRE lists (Additional file 2: Fig. S1). Thus, in order to save time and computation for CRE detection and clustering, users may directly use the precompiled CRE lists in BDDB without significant loss. In this article, our analyses using SCATE are all carried out using these precompiled CREs as the input unless otherwise specified.
Given a list of CREs, their clustering structure, and scATAC-seq data from a single cell, SCATE will estimate the activity of each CRE. Let yi,j denote the observed read count for CRE i (i=1,…,I) in cell j, and let μi,j denote the unobserved true activity. The goal is to infer the unobserved μi,j from the observed data yi,j. We assume the following data generative model:
$$\begin{aligned} &\quad y_{i,j} \sim Poisson (L_{j} \mu^{sc}_{i,j}) \\ &\quad \log(\mu^{sc}_{i,j}) = h_{j}(\log(\mu_{i,j}))\\ &\quad \log(\mu_{i,j})=m_{i} + s_{i} \delta_{i,j} \\ &\quad \boldsymbol{\delta}_{j} = \mathbf{X} \boldsymbol{\beta}_{j} \\ \end{aligned} $$
This model is illustrated in Fig. 2a. Below, we will explain the model and key components of the SCATE workflow in detail.
(1) Modeling a CRE’s cell-independent but CRE-specific baseline behavior using publicly available bulk regulome data. By analyzing large amounts of ENCODE DNase-seq data, we found that these bulk data contain invaluable information not captured by the sparse single-cell data. In particular, our recent analysis of DNase-seq data from diverse cell types shows that different CREs have different baseline activities [25]. Some CREs tend to have higher activity levels than others regardless of cell type (Fig. 1e: compare two CREs in blue boxes). As a result, the mean DNase-seq profile across diverse cell types can explain a substantial proportion of data variation in the regulome profile of each individual cell type. In fact, 55.7% of the total data variance in BDDB human DNase-seq samples is explained by the mean human DNase-seq profile, and 60.1% of the total data variance in BDDB mouse DNase-seq samples is explained by the mean mouse DNase-seq profile (Methods). The Pearson correlation coefficient between the mean DNase-seq profile and each individual DNase-seq sample in the BDDB is bigger than 0.5 for most of the samples, and the median correlation is 0.78 for human and 0.81 for mouse (Additional file 2: Fig. S2). In other words, the mean DNase-seq profile to a large extent predicts the DNase-seq profile in each individual cell type, even though the mean DNase-seq profile cannot capture cell-type-specific CRE activities. In [25], we found that the mean DNase-seq profile correlates well with independently measured TF binding activities, indicating that differences in the baseline activity among different CREs captured by the mean DNase-seq profile are real biological signals rather than technical artifacts. These highly reproducible CRE-specific baseline activities cannot be captured by the sparse data in a single cell or by pooling a small number of cells (Fig.. 1b, c, and e). Thus, in order to better reconstruct activities of each individual CRE from scATAC-seq, SCATE explicitly models these cell-type-invariant but CRE-specific baseline behaviors by fitting a statistical model to the large compendium of bulk DNase-seq data in BDDB. This allows us to estimate the baseline mean activity (mi) and variability (si) of each CRE i.
(2) Modeling a CRE’s cell-dependent activity by borrowing information from similar CREs. We model the activity of CRE i in cell j, denoted by μi,j, by decomposing it into two components: a cell-type invariant component that models the baseline behavior (mi and si), and a cell-dependent component δi,j for modeling the CRE’s cell-specific activity. In other words, log(μi,j)=mi+siδi,j. The cell-type invariant component is learned from BDDB as described above. The cell-dependent component is learned using scATAC-seq data in each cell. To do so, we leverage CREs’ clustering structure. Recall that co-activated CREs are grouped into clusters. We assume that CREs in the same cluster have the same δi,j. Thus, information is shared across multiple co-activated CREs. Mathematically, this amounts to assuming δj=Xβj where δj is the vector of all CREs’ activities in cell j, X is a binary matrix that encodes CREs’ cluster membership, and βj is a vector containing the activities of CRE clusters (see Methods). Unlike other methods, we only share information through δi,j rather than assuming that μi,j is the same across similar CREs. In our approach, two CREs in the same cluster have the same δ, but they can have different activities (i.e., different μs) because of the difference in their CRE-specific baseline behaviors.
(3) Bulk and single-cell data normalization. Since CREs’ baseline characteristics are learned from bulk DNase-seq data but our goal is to model scATAC-seq data, we need to reconcile differences between these two technologies. To do so, we assume that μi,j is the unobserved true activity of CRE i in cell j one would obtain if one could measure a bulk DNase-seq sample consisting of cells identical to cell j. In scATAC-seq data, μi,j is distorted to become \(\mu _{i,j}^{sc}\) due to technical biases in scATAC-seq compared to bulk DNase-seq. These unknown technical biases are modeled using a cell-specific monotone function hj(.) such that \(\log (\mu _{i,j}^{sc}) = h_{j}(\log (\mu _{i,j}))\). The observed scATAC-seq read count data are then modeled using Poisson distributions with mean \(L_{j}\mu _{i,j}^{sc}\) where Lj is cell j’s library size. The technical bias function hj(.) normalizes scATAC-seq and bulk DNase-seq data. We developed a method to estimate this unknown function by using CREs whose activities are nearly constant across diverse cell types in BDDB. Once hj(.) is estimated, CRE activities δi,j and μi,j can be inferred by fitting the SCATE model to the observed read count data.
(4) Adaptively optimizing the analysis resolution based on available data. In order to examine the activity of each individual CRE, ideally one would hope to pool as few CREs as possible. However, when data are sparse, pooling too few CREs will lack the power to robustly distinguish biological signals from noise. Thus, the optimal analysis should carefully balance these two competing needs. All existing methods reviewed in category 1 pool CREs based on fixed and predefined pathways (e.g., all motif sites of a TF binding motif). They do not adaptively tune the analysis resolution based on the amount of available information. In SCATE, co-activated CREs are grouped into K clusters. Information is shared among CREs in the same cluster. We uniquely treat K as a tuning parameter and developed a cross-validation procedure to adaptively choose the optimal K based on the available data. When the data is highly sparse, SCATE will choose a small K so that each cluster contains a large number of CREs. As a result, the activity of a CRE will be estimated by borrowing information from many other CREs. This sacrifices some CRE-specific information in exchange for higher estimation precision (i.e., lower estimation variance). When the data is less sparse and more CREs have non-zero read counts, SCATE will choose a large K so that each cluster will contain a small number of CREs. As a result, the CRE activity estimation will borrow information from only a few most similar CREs, and more CRE-specific information will be retained.
(5) Postprocessing. After estimating CRE activities, we will further process all genomic regions outside the input CRE list. SCATE will transform read counts at these remaining regions to bring them to a scale normalized with the reconstructed CRE activities. The transformed data can then be used for downstream analyses such as peak calling, TF binding site prediction, or other whole-genome analyses.
SCATE for a cell population consisting of multiple cells
For a homogeneous cell population with multiple cells, we will pool reads from all cells together to create a pseudo-cell. We will then treat the pseudo-cell as a single cell and apply SCATE to reconstruct CRE activities. Similar to Dr.seq2, this approach combines similar cells to estimate CRE activities. Unlike Dr.seq2, we also combine information from co-activated CREs and public bulk regulome data as described above. Moreover, SCATE adaptively tunes the resolution for combining CREs (i.e., the CRE cluster number K) which is lacking in other methods. As the cell number in the population increases, the sparsity of the pseudo-cell will decrease and the optimal analysis resolution chosen by SCATE typically will increase.
For a heterogeneous cell population, we first group similar cells into clusters. SCATE is then applied to each cell cluster to reconstruct CRE activities by treating the cluster as a homogeneous cell population (Fig.. 2b). By default, SCATE uses model-based clustering [26] to cluster cells, and the cluster number is automatically chosen by the Bayesian Information Criterion (BIC). Since one clustering method is unlikely to be optimal for all applications, we also provide users with the option to adjust the cluster number or provide their own cell clustering. SCATE can be run using user-specified cluster number or clustering results. For example, if users believe that the default clustering does not sufficiently capture the heterogeneity, they could increase the cluster number. In the most extreme case, if one sets the cluster number equal to the cell number, each cluster will become a single cell.
We note that pooling cells in each cluster to create a pseudobulk sample does not mean that the value of single-cell analysis is lost or that scATAC-seq can be replaced by bulk ATAC-seq or DNase-seq. This is because bulk ATAC-seq or DNase-seq analysis of a heterogeneous sample cannot separate different cell subpopulations or discover new cell types. Even if one could use cell sorting to separate cells in a sample by cell type and then apply bulk analysis to each cell type, the sorting relies on known cell type markers and therefore cannot discover new cell types. By contrast, a scATAC-seq experiment coupled with SCATE can identify and characterize different cell populations including potentially new cell types in a heterogeneous sample.
Benchmark data
We compiled three benchmark datasets for method comparison. Dataset 1 consists of human scATAC-seq data from two different cell lines GM12878 (220 cells) and K562 (157 cells) generated by [4]. For this dataset, ENCODE bulk DNase-seq data for GM12878 and K562 were used as the gold standard to evaluate signal reconstruction accuracy. Dataset 2 contains scATAC-seq data from human common myeloid progenitor (CMP) cells (637 cells) and monocytes (83 cells) obtained from [27, 28]. We also obtained bulk ATAC-seq data from human CMP and monocytes generated by [28] and used them as gold standard. Dataset 3 consists of mouse scATAC-seq data from brain (3321 cells) and thymus (7775 cells) generated by [29]. For evaluation, the ENCODE bulk DNase-seq data for mouse brain and thymus were used as gold standard. In all evaluations, we removed the test cell types from the BDDB before running SCATE in order to avoid using the same bulk regulome data in both SCATE model fitting and performance evaluation.
For method evaluation, ideally one would like to have a gold standard for each single cell. However, single-cell resolution gold standard is difficult to obtain. For this reason, our method evaluation primarily relied on comparing scATAC-seq signals reconstructed from a single cell or by pooling multiple cells to bulk DNase-seq or ATAC-seq signals. In this regard, one may view single cells as random samples from a cell population, and the bulk signal characterizes cells’ mean behavior in the cell population. Although each cell is not exactly the same as the population mean, its behavior should fluctuate around the mean. Moreover, one should expect that the pseudobulk signal obtained by pooling an increasing number of cells should become increasingly more similar to the true bulk signal.
Analysis of a homogeneous cell population - a demonstration
We first demonstrate SCATE analysis of a homogeneous cell population using the GM12878 and K562 data (Dataset 1) as an example. It should be pointed out that “homogeneous” is a relative concept rather than an absolute one since one can always define cell subtypes in a cell population by computationally grouping cells into clusters and subclusters at different granularity levels. In this study, “homogeneous” is technically defined as the finest granularity level for which we were able to obtain the corresponding bulk gold standard regulome data for method evaluation. We use this technical definition for two reasons. First, even if a test cell type may potentially be decomposed further into multiple cell subtypes, we could not conduct the benchmark analysis at the cell subtype level if the gold standard bulk regulome data for those cell subtypes are unavailable and the true subtype label of each cell is unknown. Second, the primary goal of our analysis of a homogeneous cell population is to serve as a bridge to help readers understand how SCATE would analyze each cell cluster in a heterogeneous cell population. Our working definition of “homogeneous” is sufficient to meet this need.
In this section, we applied SCATE to GM12878 and K562 separately. For each cell type, we randomly sampled n (n= 1, 5, 10, 25, 50, 100, etc.) cells and pooled their sequence reads together to run SCATE. CRE activities reconstructed by SCATE were compared with their activities measured by bulk DNase-seq in the corresponding cell type.
Figure 3 shows the normalization function hj(.) learned by SCATE for normalizing scATAC-seq and the BDDB bulk DNase-seq data. Here, each scatter plot corresponds to a pooled scATAC-seq sample. Different plots represent different cell numbers or cell types. In these plots, each data point is a low-variability CRE with nearly constant activity across BDDB samples. For each CRE, the read count in the pooled scATAC-seq sample or the bulk ATAC-seq sample (Y-axis) versus the CRE’s baseline mean activity in BDDB DNase-seq data (X-axis) is shown. The red curve is the SCATE-fitted function (\(\phantom {\dot {i}\!}e^{h_{j}(.)}\)) for modeling technical biases in scATAC-seq. Overall, the scATAC-seq read counts were positively correlated with CREs’ baseline activities at these low-variability CREs, and the SCATE-fitted normalization functions were able to capture the systematic relationship (i.e., technical biases) between the scATAC-seq and bulk DNase-seq data. Besides scATAC-seq, we also tested SCATE’s normalization algorithm in bulk data. Additional file 2: Figure S3 shows the SCATE-fitted function (\(\phantom {\dot {i}\!}e^{h_{j}(.)}\)) for normalizing bulk ATAC-seq data in three different cell types to the BDDB bulk DNase-seq data. The normalization functions fitted by SCATE were again able to capture the systematic relationship in the observed data, further demonstrating the effectiveness of our approach for modeling technical biases.
Figure 4 shows the number of CRE clusters adaptively chosen by SCATE. For each cell type, there are four plots corresponding to SCATE analyses by pooling different number of cells, with the cell number n shown on top of each plot. For each n, n cells were randomly sampled from the scATAC-seq dataset and pooled. SCATE was applied to the pooled data to automatically choose the CRE cluster number. This procedure was repeated ten times. The histogram shows the empirical distribution of the cluster number chosen by SCATE in these ten independent cell samplings without using any information from the gold standard bulk DNase-seq. As a benchmark, we also ran SCATE by manually setting the CRE cluster number K to different values. For each K, we computed the Pearson correlation between the SCATE-estimated CRE activities in scATAC-seq and the gold standard CRE activities in bulk DNase-seq. The dots in each plot show the correlation coefficients for different Ks, also averaged across the ten independent cell samplings. The dot with the largest correlation coefficient corresponds to the true optimal cluster number. In real applications, this true optimal cluster number would be unknown because one would not have the bulk DNase-seq as the gold standard to help with choosing K.
Figure 4 shows that the CRE cluster number automatically chosen by SCATE (histogram) typically was close to the true optimal cluster number (the dot with the highest correlation). For instance, for analyzing a single GM12878 cell, the cluster number chosen by SCATE had its mode at 1250, and the true optimal cluster number was 2500. For analyzing 220 GM12878 cells, the cluster number chosen by SCATE had its mode at 521820, and the true optimal cluster number was also 521820.
Figure 4 also shows that, as the cell number increases, both the true optimal CRE cluster number and the cluster number chosen by SCATE also increase. Increasing the number of CRE clusters implies that the average number of CREs in each CRE cluster will decrease because the total number of input CREs is fixed. Thus, SCATE adaptively changes analysis resolution: as more data are available for each CRE, SCATE gradually decreases the number of CREs in each cluster for information sharing. This allows SCATE to maximally retain CRE-specific information.
Figure 5 compares SCATE-reconstructed scATAC-seq signal with bulk DNase-seq signal in GM12878 and K562 in an example genomic region. The figure has six columns corresponding to different cell types and different pooled cell numbers. For benchmark purpose, the figure also compares SCATE with a number of other methods. The data are displayed at a 200 bp non-overlapping genomic window resolution. Here “Raw reads” displays the scATAC-seq read count pooled across cells for each 200 bp genomic window. “Binary” converts the 200 bp window read counts in each cell to a binary accessibility vector and then adds up the binary accessibility vectors across cells. Note that the raw read count approach is also used by Dr.seq2 and scABC to characterize CRE activities in single cells, but only in peak regions detected from the scATAC-seq data. The binary approach is also used by Cicero, Scasat, cisTopic, Destin, and PRISM to characterize CRE activities in peak regions. Since different implementations of a method may lead to variable method performance [30], we also displayed the signals obtained using these existing methods except for PRISM for which we were not able to modify its code to export the binary accessibility matrix (PRISM does not report binary accessibility as it is only used as an intermediate step to compute cell distances). Unlike these existing methods, the “Raw reads” and “Binary” methods implemented by us processed all genomic windows rather than only peak regions. ChromVAR, SCRAT, and BROCKMAN only analyze and report aggregated CRE pathway activities rather than activities of individual CREs. Thus, they cannot be compared here. However, for our previously developed SCRAT, we were able to modify the codes to estimate CRE activities by directly using pathway activities. This results in three methods, “SCRAT 500 CRE cluster,” “SCRAT 1000 CRE cluster,” and “SCRAT 2000 CRE cluster,” shown in the figure. Here, CREs were clustered into 500, 1000, or 2000 clusters as in SCRAT using the bulk DNase-seq data in BDDB. For each CRE cluster, the average normalized scATAC-seq read count across all CREs in the cluster was calculated. It was then assigned back to each CRE in the cluster to represent the estimated CRE activity. The “Raw reads” method may be viewed as a special case of the “SCRAT CRE cluster” method when each genomic window is viewed as a CRE and each CRE forms a CRE cluster by itself (i.e., the number of CRE clusters is equal to the total number of CREs, and each cluster only contains one CRE). “Average DNase-seq” shows the average normalized read count profile of bulk DNase-seq samples in BDDB. It reflects CRE’s baseline mean activity.
Figure 5 shows that SCATE-reconstructed scATAC-seq signals accurately captured the variation of CRE activities in bulk DNase-seq across different genomic loci and different cell types, whereas CRE activities estimated using raw read counts, binarized chromatin accessibility, or SCRAT CRE cluster methods all failed to accurately capture the bulk DNase-seq landscape. Interestingly, SCATE was able to use scATAC-seq data from one single cell to accurately estimate CRE activities in bulk DNase-seq. By contrast, the raw read count and binary accessibility methods both failed, likely due to data sparsity (e.g., see regions in blue boxes). The SCRAT CRE cluster method also failed, likely because (1) it assigns the same activity to all CREs in the same CRE cluster and ignores CRE-specific behaviors, and (2) it does not adaptively tune the analysis resolution as in SCATE to maximally retain CRE-specific signals. While it is also possible that signals in a single cell do not necessarily need to look like the bulk signal due to cell heterogeneity and hence explaining why signals generated by Raw reads, Binary, and CRE cluster methods in a single cell were different from the bulk signal, Fig. 5 shows that SCATE also outperformed these methods when pooling multiple cells into pseudobulk samples (e.g., pooling 25 and 100 cells), suggesting that the better performance of SCATE is real. The “Average DNase-seq” approach produced relatively continuous signals and captured some variation across genomic loci in the GM12878 and K562 bulk DNase-seq data. However, it was unable to capture cell-type-specific signals, such as those shown in the blue boxes.
Analysis of a homogeneous cell population - a systematic evaluation
Next, we systematically evaluated SCATE and the other methods in all three benchmark datasets by treating the six test cell types as six homogeneous cell populations. The evaluation was based on the correlation with gold standard bulk regulome data, peak calling performance using reconstructed signals, and ability to predict transcription factor binding sites (TFBSs).
In the first evaluation, we computed the Pearson correlation between the scATAC-seq signals reconstructed by each method and the gold standard bulk signals across all CREs. As one example, Fig. 6a and Additional file 2: Figure S4A show the results based on pooling scATAC-seq data from 10 GM12878 cells. There are multiple methods that use the raw read counts and binary methods. For clarity of display, in this and all other analyses below, only the “Raw reads” and “Binary” methods implemented by us are shown in the main figures (e.g., Fig. 6), and the results from the other raw counts and binary methods are shown in supplementary figures (e.g., Additional file 2: Fig. S4A). Among all methods, SCATE showed the highest correlation with the bulk gold standard. We performed the same analysis on all six test cell types by pooling different cell numbers. For each cell number, we repeated the analysis ten times using ten independent cell samplings. The median performance of the ten analyses was then compared. Figure 6b and Additional file 2: Figure S4B show that SCATE consistently outperformed all the other methods and showed the strongest correlation with the bulk gold standards in all test data. When the pooled cell number was small, the improvement of SCATE over many methods was substantial. For instance, for the analysis of one single monocyte cell, the correlation was 0.01–0.22 for the different implementations of raw reads and binary methods. It was 0.57, 0.57, and 0.57 for SCRAT 500, 1000, and 2000 CRE cluster methods, respectively. For SCATE, it was 0.67, representing an improvement of 18 ∼6700% over the other methods. Of note, the Average DNase-seq method performed relatively well in this evaluation when the cell number was small. However, as we will show later, the average DNase-seq profile cannot predict changes in CRE activity between different cell types, but SCATE can.
In the second evaluation, we performed peak calling using scATAC-seq signals reconstucted by SCATE and other methods. Peak calling is a common task in DNase-seq or ATAC-seq data analyses. Its objective is to find genomic regions with significantly enriched signals. We implemented a peak calling algorithm using a moving average approach (see Methods) and applied it to signals reconstructed by SCATE, Raw reads, Binary, SCRAT CRE cluster, and Average DNase-seq. For the other existing raw reads and binary methods, we used their default peak calling methods to call peaks. In addition, we also performed peak calling by applying MACS2 [19] to the pseudobulk sample we obtained by pooling cells. The peak calling performance of each method was evaluated using the sensitivity versus false discovery rate (FDR) curve, where the “truth” was defined by the peaks called from the bulk gold standard data. Here, sensitivity is the proportion of true bulk peaks discovered by scATAC-seq, and FDR is the proportion of scATAC-seq peaks that are false (i.e., not found in bulk peaks). As one example, Fig. 7a and Additional file 2: Figure S5A compare the sensitivity-FDR curves of different methods when they were applied to the pooled scATAC-seq data from 25 GM12878 cells. For each curve, we computed the area under the curve (AUC). Fig. 7b and Additional file 2: Figure S5B systematically compare the AUCs of all methods in all six test cell types. In each plot, the analyses were run by pooling different numbers of cells, and the median AUC from 10 independent cell samplings was plotted as a function of the cell number. Once again, SCATE showed the best overall peak calling performance. When the cell number was small, the improvement was substantial. For analyzing one monocyte cell, for example, the AUC of SCATE was 0.4, whereas the AUCs for the other methods (except for Average DNase-seq) were all below 0.21. Thus, SCATE improved over these methods by 90% or more.
In the third evaluation, we used signals reconstructed by each method to predict TFBSs. We evaluated 28 TFs in GM12878 and 29 TFs in K562 (Additional file 3: Table S2). As gold standard, we collected ChIP-seq peaks for these TFs from the ENCODE [15]. For the other cell types, we did not find TF ChIP-seq data suitable for evaluation. Therefore, our TFBS prediction analysis was focused on GM12878 and K562. To predict TFBSs of a TF, we mapped its motif sites in the genome using CisGenome [31]. Genomic windows overlapping with motif sites were sorted based on their reconstructed scATAC-seq signals. Windows with the highest signals were labeled as predicted TFBSs (Fig. 8a). Motif-containing windows that overlap with TF ChIP-seq peaks were viewed as gold standard true TFBSs. Based on this, we generated the sensitivity-FDR curve for each TF by gradually relaxing the TFBS calling cutoff. As one example, Fig. 8b and Additional file 2: Figure S6B show the sensitivity-FDR curves of different methods for predicting ELF1 binding sites by pooling scATAC-seq data from 25 GM12878 cells. For each TF and cell type, we performed this analysis using different cell numbers. For each cell number, the median area under the sensitivity-FDR curve (AUC) of 10 independent cell samplings was computed. As two examples, Fig. 8c and Additional file 2: Figure S6C show the AUCs for different methods as a function of pooled cell number for two TFs: ELF1 in GM12878 and JUND in K562. Finally, Fig. 8d and Additional file 2: Figure S6D show the average performance of all 28 TFs in GM12878 and 29 TFs in K562. In all these analyses, SCATE robustly outperformed all the other methods. The overall improvement was substantial (e.g., see K562 in Fig. 8d and Additional file 2: Fig. S6D).
Analysis of a heterogeneous cell population—demonstration and systematic evaluation
The analyses of homogeneous cell populations provide a demonstration of the basic building block of SCATE. In reality, however, scATAC-seq is usually used to analyze a heterogeneous cell population consisting of multiple cell types where the cell type labels are unknown. To analyze such a heterogeneous cell population, one usually will first computationally cluster cells into relatively homogeneous subpopulations and then analyze each cell cluster as a homogeneous population. Due to inevitable noises, each cell cluster obtained in this way may not be pure. For example, while the majority of cells in a cell cluster may be of one cell type, the cluster may also contain cells from other cell types. As data analysts do not know cells’ true cell type labels, they can only treat all cells in the same cluster as if they were one cell type.
In order to see how SCATE tunes the analysis resolution when a cell cluster contains noise, we mixed K562 and GM12878 cells with different ratios (K562:GM12878 = 100%:0%, 80%:20%, 60%:40%) to mimic a cell cluster dominated by K562 cells but with different levels of noises introduced by GM12878 cells. The cross-validation procedure of SCATE was used to select the CRE cluster number as in Fig. 4. The analysis was repeated by setting the total cell number to 10 and 100 respectively. Additional file 2: Figure S7 shows that as the number of cells increased, the number of CRE clusters chosen by SCATE also increased regardless of the noise level. This indicates that when more reads are available by pooling more cells, SCATE will increase the analysis resolution. In most cases, the optimal CRE cluster numbers chosen by SCATE were largely consistent with the true optimal CRE cluster number determined by comparing scATAC-seq with bulk K562 DNase-seq. The only exception is when the noise level was high (K562:GM12878 = 60%:40% for cell number=100, where the optimal cluster number chosen by SCATE was bigger than the optimal cluster number based on bulk DNase-seq). In that case, however, one can argue that K562 bulk DNase-seq data may not reflect the chromatin profile of a mixture of K562 and GM12878 cells, whereas SCATE attempts to optimize the signal reconstruction for the cell cluster which is a mixture of K562 and GM12878 cells with almost equal proportion. Therefore, they try to measure different things and one should not expect that the optimal cluster number determined by K562 bulk DNase-seq will be consistent with the cluster number chosen by SCATE. This is different from Fig. 4 where the cell type measured by bulk gold standard is consistent with the cells analyzed by SCATE.
Next, we demonstrate how a heterogeneous cell population would be analyzed in practice. We mixed GM12878 and K562 cells from Dataset 1 with different ratios to create synthetic samples with different heterogeneity levels. Each synthetic sample had 100 cells representing a mixture of GM12878 and K562 cells. The percentage of GM12878 cells was set to x= 10%, 30%, and 50%, respectively. For each percentage x, ten synthetic samples were created using independently sampled cells. The median performance of each method on the ten analyses was compared.
Each synthetic sample was analyzed by first clustering cells using the default cell clustering algorithm in SCATE. SCATE and other methods were then used to estimate CRE activities for each cell cluster. In all these analyses, we pretended that cells’ true cell type labels were unknown and did not use them. The number of cell clusters automatically determined by SCATE in these samples ranged from 2 to 5 (Fig. 9a). Figure 9b shows one example in which cells were grouped into 2 clusters.
After running all methods, in order to evaluate whether the analysis can discover the true biology, we annotated each cell cluster using cells’ true cell type labels. Each cluster was annotated based on its dominant cell type. A cell cluster was labeled as “predicted GM12878” if over 70% of cells in the cluster were indeed GM12878 cells. Similarly, a cell cluster with ≥ 70% K562 cells was labeled as “predicted K562”. All other clusters were labeled as “ambiguous.” For a given sample, if at least one cell cluster was labeled as “predicted cell type X” (X = GM12878 or K562), we say that cell type X was detected. Based on this definition, both GM12878 and K562 can be detected in all samples (Fig. 9c). Note that one cell type may be identified by multiple cell clusters. Given the cell type annotation, we then compared the regulome of each cell type reconstructed by SCATE and other methods. Since all methods used the same cell clustering results, the comparison of their signal reconstruction ability is a fair comparison. We conducted four types of comparisons.
First, we asked whether the regulome reconstructed by each method for each predicted cell type can accurately recover the cell type’s true regulome measured by the gold standard bulk data. Take GM12878 as an example. For each cell cluster predicted as GM12878, the Pearson correlation between the cluster’s reconstructed scATAC-seq signal and the gold standard bulk GM12878 DNase-seq data was computed. If a sample had two or more cell clusters predicted as GM12878, each cluster was analyzed separately. The median correlation of all such clusters in ten independent synthetic samples is shown in Fig. 9d and Additional file 2: Figure S8A. SCATE again performed the best. When the proportion of GM12878 cells in a sample was small, the improvement by SCATE was larger. Figure 9e and Additional file 2: Figure S8B show the same analysis for K562, but the performance was shown as a function of GM12878 cell proportion. Figure 9f and Additional file 2: Figure S8C show the combined results. Here, at each cell mixing proportion, the median scATAC-bulk correlation of all cell clusters predicted either as GM12878 or K562 was shown. In all these analyses, SCATE consistently performed the best.
Second, we conducted peak calling and evaluated each method’s ability to recover true peaks in each cell type. Here, the truth was defined as peaks called from the gold standard bulk data, and the evaluation was conducted similar to Fig. 7. Figure 9g and Additional file 2: Figure S8D show the median AUC of all cell clusters predicted either as GM12878 or K562 as a function of cell mixing proportion. SCATE robustly outperformed the other methods.
Third, we compared different methods in terms of their ability to predict TFBSs. TFBS prediction and evaluation were performed similar to Fig. 8. The results are shown in Fig. 9h and Additional file 2: Figure S8E, in which the median AUC for each method is plotted as a function of cell mixing proportion. SCATE produced the best prediction accuracy.
Last but not least, we applied different methods to predict differential CRE activities between different cell types, which is crucial for characterizing the regulatory landscape of a heterogeneous sample. Note that if one views scATAC-seq as a tool for studying cell heterogeneity, then a good analysis method should have the ability to accurately capture differences among cells. Importantly, since differences between cell types are a special case of cell heterogeneity, a good method should be able to keep cell type differences when comparing two cells or two pseudobulk samples from two different cell types. Here, we collected all pairs of cell clusters that were predicted as two different cell types (i.e., one cluster was “predicted GM12878” and the other cluster was “predicted K562”; ambiguous cell clusters were excluded). For each such pair, we computed the difference of reconstructed CRE activities between the two cell clusters. We then compared this predicted difference with the true differential CRE activities derived from the gold standard bulk DNase-seq data for GM12878 and K562. The Pearson correlation between the predicted and true differential signals was calculated. As one example, Fig. 10 and Additional file 2: Figure S9 show the results for a cell cluster pair in a synthetic sample in which 30% of cells was GM12878. SCATE best recovered the differential CRE activities (correlation = 0.43). Figure 9i and Additional file 2: Figure S8F show the median correlation across ten independent synthetic samples at each cell mixing proportion. Once again, SCATE performed the best.
In the above analyses, the Average DNase-seq method completely failed for predicting differential signals between two cell types (correlation = 0) (Figs. 9i, and 10), even though it performed relatively well for estimating CRE activities within one cell type, and peak calling and TFBS prediction in one cell type (Figs. 6, 7, 8, 9f–h). Similarly, each of the other methods may perform well in some datasets or analyses but not in others. SCATE is the only method that robustly performed the best in all our analyses.
Similar to GM12878 and K562 (Dataset 1), we also constructed heterogeneous cell populations using the other two datasets (Datasets 2 and 3) and used them to evaluate different methods. The results are shown in Figure 9j–o and Additional file 2: Figures S8G-L and S10. For these two datasets, we did not perform TFBS prediction due to lack of gold standard ChIP-seq data. For estimating CRE activities (Fig. 9j, m, Additional file 2: Fig. S8G,J), peak calling (Fig. 9k, n, Additional file 2: Fig. S8H,K), and predicting differential CRE activities (Fig. 9l, o, Additional file 2: Fig. S8I,L), SCATE again outperformed all the other methods. In many cases, the improvement was substantial (e.g., Fig. 9k, l, n, o, Additional file 2: Fig. S8).
Example 1: Analysis of scATAC-seq data from human hematopoietic differentiation
To further demonstrate and evaluate SCATE in a more realistic setting, we analyzed a scATAC-seq dataset generated by [27] which consists of 1920 cells from 8 human hematopoietic cell types for which corresponding bulk ATAC-seq data are available. These cell types include hematopoietic stem cell (HSC), multipotent progenitor (MPP), lymphoid-primed multipotent progenitor (LMPP), common myeloid progenitor (CMP), common lymphoid progenitor (CLP), granulocyte-macrophage progenitor (GMP), megakaryocyte-erythrocyte progenitor (MEP), and monocyte (Mono). In this dataset, the true cell type label of each cell was known since cells were obtained by cell sorting. However, they were not used in our SCATE analyses so that our results reflect how data would be analyzed in reality. The true cell type labels were only used after the analysis to evaluate methods. Figure 11a shows the tSNE [32] plot of all cells color-coded by their true cell types. In the plot, different cell types were distributed along three major differentiation lineages (myeloid: HSC →MPP→(CMP or LMPP) →GMP→Mono; erythroid: HSC →MPP→CMP→MEP; lymphoid: HSC →MPP→LMPP→CLP), which are consistent with known biology. For method evaluation, we analyzed all cells together as a heterogeneous cell population and pretended that the cell type labels were unknown. We also downloaded and processed bulk ATAC-seq data for these 8 cell types from [28] and used them as the gold standard to assess regulome reconstruction accuracy.
Using its default cell clustering method, SCATE identified 14 cell clusters. To evaluate the performance of this unsupervised analysis for recovering true biology, we first assigned a cell type label for each cluster. A cluster was annotated as “predicted cell type X” if the cluster contained at least two cells and the true cell type label of ≥ 70% cells from the cluster was cell type X. Clusters that cannot be annotated using this criterion were labeled as ambiguous. In this way, we were able to unambiguously annotate 9 clusters. Since multiple clusters may be annotated with the same cell type, these 9 annotated clusters corresponded to a total of 6 cell types (Fig. 11b). For these 9 clusters, one can evaluate signal reconstruction accuracy because the bulk ATAC-seq data for the annotated cell type was available. Each cluster was treated as a homogeneous cell population by SCATE and other methods in our analysis (as one would do in real applications), even though the cluster actually may not be pure and may contain cells from more than one cell types. Figure 11d and Additional file 2: Figure S11A compare the Pearson correlation between the gold standard bulk signal and the CRE activities reconstructed from scATAC-seq by different methods. Each boxplot contains 9 data points corresponding to the 9 cell clusters. Figure 11e and Additional file 2: Figure S11B compare the peak calling performance (AUC under the sensitivity-FDR curve). Figure 11f and Additional file 2: Figure S11C compare the accuracy for predicting differential CRE activities between different cell types. Here, each data point in the boxplot is a pair of cell clusters annotated with two different cell types. The Pearson correlation between the gold standard bulk differential signal and differential signal reconstructed from scATAC-seq was computed and compared. In all these analyses, SCATE outperformed the other methods. Figure 11j shows an example genomic region in a HSC cell cluster. SCATE most accurately reconstructed the bulk ATAC-seq signal in HSC.
SCATE provides users with the flexibility to specify their own cell cluster number or use their own cell clustering results. The software can reconstruct signals based on user-provided cell cluster number or clustering structure. For instance, suppose one is not satisfied with the default cell clustering and wants to increase the granularity of clustering to make each cluster smaller and more homogeneous, one can manually adjust the cluster number. To demonstrate, we increased the cell cluster number to 38. After increasing the cell cluster number, each cell cluster contained approximately 50 cells on average. After rerunning SCATE, 24 of the 38 cell clusters can be unambiguously annotated, identifying a total of 7 cell types (Fig. 11c). As a comparison, the default analysis only unambiguously identified 6 cell types. For the unambiguously annotated cell clusters, Fig. 11g–i and Additional file 2: Figure S11D-F compare the performance of different methods for reconstructing CRE activities, peak calling, and estimating differential CRE activities between different cell types. SCATE still delivered the best performance. Since the average cell cluster size became smaller, the performance of some methods decreased substantially in some analyses (e.g., the CRE reconstruction and peak calling accuracy for Raw reads and Binary in Fig. 11g, h and Additional file 2: Figure S11D, E). In these cases, the benefit from SCATE was even more obvious.
Example 2: Analysis of 10x Genomics scATAC-seq data from human peripheral blood mononuclear cells (PBMC)
We also analyzed a scATAC-seq dataset generated from 10x Genomics platform by [33] which consists of 10,027 human peripheral blood mononuclear cells. These cells were sorted into 5 cell types using magnetic-activated cell sorting: B cells, CD4+ T cells, CD8+ T cells, monocytes, and natural killer (NK) cells. Figure 12a visualizes the data with true cell type labels color-coded. Consistent with known biology, CD4+ T cells, CD8+ T cells, and NK cells are closer to each other, whereas B cells and monocytes form more distinct clusters. Again, in order to illustrate how scATAC-seq data would be analyzed in reality, we pretended that cells’ true cell type labels are unknown when running SCATE and other methods. We downloaded and processed bulk ATAC-seq data for these cell types from [28] and used them as the gold standard.
Using its default cell clustering method, SCATE identified 15 cell clusters. To evaluate how this unsupervised analysis recovered true biology, we first computationally assigned a cell type label to each cell cluster using the same protocol as in Example 1’s hematopoietic analysis. In this way, we were able to unambiguously annotate 11 clusters representing all 5 cell types (Fig. 12b). For these 11 clusters, we evaluated signal reconstruction accuracy using the bulk ATAC-seq data of the annotated cell type. SCATE again outperformed the other methods in terms of Pearson correlation between the gold standard bulk signal and the CRE activities reconstructed from scATAC-seq (Fig. 12d, Additional file 2: Fig.S12A), peak calling performance (Fig. 12e, Additional file 2: Fig. S12B), and accuracy for predicting differential CRE activities between different cell types (Fig. 12f, Additional file 2: Fig. S12C). Figure 12j shows an example genomic region in a B cell cluster. SCATE most accurately reconstructed the bulk ATAC-seq signal in B cells.
In the default analysis, the average number of cells in a cluster for the 11 annotated clusters was 852. Similar to Example 1, we also rerun the analysis by manually setting the cell cluster number to 100 to increase the granularity of clustering. This reduced the average number of cells in a cluster to 100. After running SCATE, we were able to unambiguously annotate 86 clusters corresponding to the 5 cell types (Fig. 12c). Figure 12g–i and Additional file 2: Fig. S12D-F show that SCATE still delivered the best overall performance.