Overview of method
CellWalker resolves cell types and differentially accessible regions in scATAC-seq data by integrating information from scRNA-seq and bulk data (see Additional File 1: Fig. S1 for full pipeline). This integration relies on building a combined network featuring nodes representing cells in scATAC data and nodes for external labeling data, e.g., cell types derived from scRNA-seq data (Fig. 1a). Briefly, cells from scATAC-seq are nodes in the network, and edges between them encode information about cell similarity. A second set of nodes represents labeling datasets connected to cell nodes by edges that encode the similarity between a label and a cell.
Using a graph diffusion implemented via a random walk with restarts, CellWalker computes a global influence matrix that relates every cell and label to every other cell and label based on information flow between them in the network (Fig. 1b). In this matrix, each column describes where walks that begin at a given node end. Different portions of this matrix can be used to map information between and within domains: cell-to-cell for clustering cells, label-to-label for exploring label similarity, label-to-cell for cell type labeling, and cell-to-label for distributing bulk signatures to labels. The user can set a single label edge weight parameter (s) defining the ratio of label-to-cell edges versus cell-to-cell edges (Fig. 2a). This parameter represents a trade-off between cell similarity information in the scATAC-seq data versus in the external labels. When s is low, the output is similar to de novo cell clustering using only scATAC-seq, and when it is high, the output converges towards directly assigning labels to each cell. Thus, s can be chosen to reflect user preference across these strategies, or it can be tuned as CellWalker is run to optimize a criterion that assesses the quality of the resulting cell clusters. We developed a measure called cell homogeneity for this purpose. It can be computed directly from the influence matrix as the median ratio of information between cells within the same cell type to information between cells of different cell types. A higher cell homogeneity score indicates a greater ability to differentiate between different cell types.
Method validation and evaluation
To assess the ability of CellWalker to distribute labeling information across cells, we first tested it on a compendium of simulated datasets. Using cell homogeneity to quantify performance, we found that as few as 10% of cells being labeled is sufficient for CellWalker to improve cell labeling, and that there is no further improvement after ~ 30% of cells are labeled (Fig. 2b). As expected, cell homogeneity degrades as more cells are initially mislabeled, and improves when cells of different types are more distinct from each other in the scATAC-seq data (Fig. 2c and Additional File 1: Fig. S2a). Furthermore, CellWalker performs well with noisy data, even when up to 50% of reads are dropped or random reads are added (Additional File 1: Fig. S2b). Finally, we observed that CellWalker is able to distribute labels to novel cell populations (Additional File 1: Fig. S2c). These results establish the network diffusion strategy implemented in CellWalker as a robust approach to integrate scATAC-seq with scRNA-seq or other labeling data.
In all cases, we observed that tuning the label edge weight parameter was very important for assigning accurate labels to cells. In particular, a setting of s near 0, where labels are assigned to de novo clusters, was never the optimal setting of s. Similarly, very high settings of s, where labels are directly assigned to cells without considering cell-to-cell similarity, were also not among the most accurate. This indicates that blending these strategies is beneficial for accurately labeling cells in scATAC-seq data.
Importantly, CellWalker is efficient enough in terms of both compute time and memory usage to be practical for analysis of current single-cell data set sizes. Running the method beginning to end on 10,000 cells requires only 8 min on a single core on a personal computer (Additional File 1: Fig. S3). On a high-performance cluster, we estimate that 100,000 cells could be analyzed in ~ 80 h of total clock time.
Next, we tested CellWalker on adult mouse cortex SNARE-seq data which includes both scRNA-seq and scATAC-seq reads for each cell [20]. We analyzed the scATAC-seq portion of the data with CellWalker. For cell type labeling, we integrated the scATAC-seq data with differentially expressed marker genes previously derived from clustering the scRNA-seq portion of the SNARE-seq data. Performance was evaluated using the held-out scRNA-seq label for each cell that was identified in the original publication. We tuned the edge weight parameter s to optimize cell homogeneity (as in our simulations) and observed that this closely mirrors optimization of the fraction of cells labeled correctly, validating cell homogeneity as a measure of how well cell types are resolved (Additional File 1: Fig. 4a and b). We compared CellWalker to label transfer, as implemented in Seurat [15], and found that CellWalker labels more cells correctly (Fig. 2d). This advantage is greater when considering only rare cell types and very rare cell types (Fig. 2d, bottom). We saw similar results on a second set of SNARE-seq data derived from developing mouse cortex, as well as for a 10x Single Cell Multiome ATAC + Gene Exp chip of healthy human brain tissue (Additional File 1: Fig. 4c and d) [20, 21]. Taken together, these analyses of multiome data indicate that CellWalker’s integration of label data provides a substantial advantage towards resolving cell types in scATAC-seq data, particularly for analysis datasets with poorer data quality and for identifying rarer cell types.
Identification of cell types in the developing brain
Given the ability of CellWalker to identify rare cell types in brain SNARE-seq data, we next applied it to a scATAC-seq study of the human telencephalon with multiple biological replicates spanning mid-gestation [19]. Previous work generated a cell type atlas in similar samples based on extensive analysis of scRNA-seq data [17]. Using this atlas as external labeling data, we used CellWalker to compute a full influence matrix across all labels and 30,000 scATAC-seq cells. First, using the label-to-label portion of the influence matrix, we hierarchically clustered all labels and observed high agreement with the scRNA-seq clustering from the previously published results from Nowakowski et al. [17] (Fig. 3a). In terms of broad cell types, we observed that all radial glia and all interneurons group together. Other more local similarities were reflected as well, such as early newborn excitatory neurons (nEN-early1) being more similar to prefrontal cortex excitatory neurons (EN-PFC3) than to other newborn excitatory neuron types.
Next, we scored each cell using label-to-cell influence. This produces a “fuzzy” labeling of cells, representing the fact that a scATAC-seq cell may be strongly connected through the network to multiple cell types. For most cells in the scATAC-seq data (27,270 out of 30,000), CellWalker assigned a label without much ambiguity. Thus, most transcriptional states observed in scRNA-seq are associated with a distinct open chromatin signature in scATAC-seq. While previous work assigning cell types based on clustering of this scATAC-seq data only allowed for identification of broad cell types found in scRNA-seq (see Ziffra et al. [19]), CellWalker was able to identify cells of all specific types, notably distinguishing between different subtypes of radial glia and separating the two major types of medial ganglionic eminence (MGE)-derived inhibitory interneurons.
In a few cases, we observed cells with multiple nearly equally scoring labels, indicating intermediate membership in multiple cell types and revealing transcriptional states that correspond to highly similar open chromatin profiles (Fig. 3b). Some of these relationships, such as visual cortex (V1) and prefrontal cortex (PFC) excitatory neurons, represent similar types of maturing neurons that are present in two brain regions (Fig. 3b, group 1). Others correspond to progressions of neuronal development. For example, the newborn interneuron and caudal ganglionic eminence (CGE) cortical interneuron cell types have shared network influence on a large group of scATAC-seq cells (Fig. 3b, group 2). Some transcriptional states are difficult to distinguish from each other, because cells receive influence from multiple different scRNA-seq cell types. Specifically, we found groups of cells that scored highly as two or more of the following types: intermediate progenitor cells, early newborn excitatory neurons, late newborn excitatory neurons, and maturing excitatory neurons (Fig. 3b, group 3).
To explore whether these indeterminate cell types represent limitations of scATAC-seq data, failures of the CellWalker model, or cases where transcription changes without large changes in open chromatin, we took a closer look at early and late newborn excitatory neurons (nEN-early and nEN-late respectively). These are fairly abundant, identifiable cell types in scRNA-seq [17]. However, 92% of nEN-early ATAC peaks are also found in nEN-late cells. We assigned each scATAC-seq cell an excitatory neuron progression score based on the difference between the influence of the early and late newborn excitatory neuron labels, such that a higher excitatory neuron progression score indicates a later newborn excitatory neuron. Using this score, we observed that while there is a small distinct set of early newborn excitatory neurons, the majority of newborn excitatory neurons fall evenly between the two types with many scores near zero (Additional File 1: Fig. S5a). This indicates that there is a continuous gradient of changes in chromatin accessibility rather than large-scale difference between transitioning cell types. However, this observed difference between the dynamics of gene expression versus open chromatin during developmental transitions may not hold up with higher coverage scATAC-seq data, which could elucidate distinct chromatin profiles.
Cell type-specific annotation of loci
We next sought to determine if cell type annotations could be used to characterize the biology of loci based on chromatin accessibility at distal regulatory elements. Because most distal regulation occurs within Topologically Associated Domains (TADs) [22], we asked if the transition from early to late excitatory neurons could be attributed to differences in TAD accessibility between cell types. It is generally believed that Hi-C contact maps derived from bulk data represent the average of a mixture of cells [23]. We correlated the distal accessibility (defined as outside a gene body or promoter) of TADs derived from the germinal zone (GZ) of the mid-gestation developing human cerebral cortex [22] with excitatory newborn progression score and found that the distribution of correlations is significantly bimodal (empirical p value = 0.021, Additional File 1: Fig. 5b and 5c). This means that the accessibility of GZ TADs distinctly either correlates or anti-correlates with cell state progression from early to late excitatory neuron. As a control, we find that the median distance of peaks to genes and the number of peaks per TAD do not correlate with excitatory neuron progression (Additional File 1: Fig. 5d and 5e). We therefore classified GZ TADs as early or late depending on their correlation with excitatory neuron progression. As a validation of the classification of these TADs, we find that the expression of genes in early TADs negatively correlates with excitatory neuron progression score, while the expression of genes in late TADs correlates positively (median correlations of − 0.62 and 0.22 respectively). Thus, subtle changes in chromatin accessibility between early and late newborn excitatory neurons may be associated with cell type-specific TAD activity. The ability to separate TADs by cell type enables a greater understanding of gene regulation in complex tissues such as the human brain. A similar strategy could be applied to other annotations of loci, such as linkage disequilibrium (LD) blocks or expression quantitative trait loci (eQTLs).
Several key genes involved in neuronal development lie in early or late TADs, indicating their expression may be distally regulated. Notably, the neurogenic differentiation gene NEUROD1 lies in a late TAD with higher levels of accessibility late than early throughout the TAD but similar accessibility in the gene body and promoter (Fig. 3c). Correspondingly, NEUROD1 has two-fold higher mean transcripts in late than early newborn excitatory neurons (mean 73 TPM early vs 131 late in scRNA-seq data [17]). This indicates that the gene expression differences of NEUROD1 are potentially driven by distal enhancers. Conversely, TENM4, which is involved in establishing neuronal connectivity during development [24], lies in an early TAD (Additional File 1: Fig. S5f) and is less expressed in late newborn excitatory neurons (mean 350 TPM early vs 249 late in scRNA-seq data [17]). Deciphering the cell type-specific regulation of these genes is an important step towards understanding how differences in genotype lead to their misexpression and linked diseases.
Cell type-specific annotation of regulatory elements
It is generally believed that many enhancers involved in brain development function in a cell type-specific manner [19]. CellWalker provides a way to explore this idea. We mapped pREs derived from bulk ATAC-seq on microdissected tissue across the mid-gestation human telencephalon [18] to cell types based on cell-to-label influence (Fig. 4a). As expected, we found that the many pREs specific to the ganglionic eminence map to intermediate progenitor cell types, while pREs in other regions primarily map to types of excitatory neurons [17]. As further validation, we also observed that pREs from cell types labeled to specific regions such as the MGE map to regions sampled from the ganglionic eminence (Additional File 1: Fig. S6a). These findings demonstrate that cell types resolved in scATAC-seq data with CellWalker can be used to annotate regulatory elements discovered in bulk ATAC-seq. This strategy combines the benefits of high coverage in bulk data with the cell-type information in scATAC-seq.
We next examined whether cell type-specific pREs are associated with disease genes. First, we considered sets of genes near significant variants detected in a collection of Genome-Wide Association Studies (GWAS) [25]. Testing for associations with pREs active in four broad cell types, among these gene sets, we found significant enrichment for pREs near genes associated with a collection of neurological diseases as well as many measures for developmental delay (significant at FDR < 0.1, Additional File 2: Table S1). We therefore decided to take a closer look at curated lists of genes linked to autism spectrum disorders (ASD) and developmental delay [26]. We found that pREs specific to radial glia are significantly associated with genes linked to developmental delay (Fig. 4b, right), among which pREs specific to early radial glia are significant (Fig. 4c, right). pREs specific to inhibitory interneurons are significantly associated with genes linked to ASD, in agreement with previous studies [27,28,29] (Fig. 4b, left). Among these, enrichment was significant for newborn interneurons (Fig. 4c, left). Recently, an enhancer of SLC6A1 which is accessible in cells in the ganglionic eminence was linked to ASD [18]. We found that this enhancer maps to both intermediate progenitor cells located in the ganglionic eminence as well as to newborn interneurons and is accessible in cells predicted to belong to these cell types (Fig. 4d). As validation, we found that this peak is accessible in intermediate progenitors and interneurons but not excitatory neurons in ATAC-seq data on FACS sorted cells [30] (Additional File 1: Fig. S6c). Interestingly, while the enhancer is most strongly linked to early stages of interneuron development, expression of SLC6A1 increases throughout the course of interneuron development (Additional File 1: Fig. S6b). It is possible therefore that de novo mutations observed in this enhancer in ASD individuals contribute to changes in the initiation of SLC6A1 expression, which could influence the timing of interneuron development. This strategy for determining cell type-specific effects can be applied to other loci to better understand their potential roles in disease and cell differentiation.