 Method
 Open Access
 Published:
scDALI: modeling allelic heterogeneity in single cells reveals contextspecific genetic regulation
Genome Biology volume 23, Article number: 8 (2022)
Abstract
While it is established that the functional impact of genetic variation can vary across cell types and states, capturing this diversity remains challenging. Current studies using bulk sequencing either ignore this heterogeneity or use sorted cell populations, reducing discovery and explanatory power. Here, we develop scDALI, a versatile computational framework that integrates information on cellular states with allelic quantifications of singlecell sequencing data to characterize cellstatespecific genetic effects. We apply scDALI to scATACseq profiles from developing F1 Drosophila embryos and scRNAseq from differentiating human iPSCs, uncovering heterogeneous genetic effects in specific lineages, developmental stages, or cell types.
Background
The functional impact of genetic variants on molecular traits such as gene expression can be influenced by the cell type or cell state. Particularly noncoding variants in enhancer elements can impact a gene’s expression in one tissue and not in others. Populationscale genetics studies, using bulk sequencing across individuals, have identified many such tissuespecific [1,2,3] and developmental stagespecific [4] effects, which often involve rare genetic variants. However, even carefully dissected tissues are composed of heterogeneous cell types, thus motivating the application of singlecell sequencing to reveal cellstate dependencies of genetic effects. Recent singlecell RNAseq studies in in vitro models revealed changing genetic dependencies across different cellular transitions [5,6,7].
However, most existing analysis strategies for singlecell genetic studies have been based on computational methods originally developed for bulksequencing data [5, 6, 8], requiring the discretization of cellular states and thus potentially failing to detect more finegrained differences in regulation. Computational strategies that allow for the unbiased identification of cellstatespecific effects are only beginning to emerge [9, 10] and currently rely on profiling a large number of genetically diverse individuals, which is particularly limiting for in vivo analyses and nonhuman model systems. The latter could be addressed by measuring allelespecific signals, i.e., quantifying molecular traits separately for each haplotype [9, 11,12,13,14], which in principle allows for identifying genetic effects even in a single individual. The combination of allelespecific quantifications coupled with the use of singlecell technologies could be a powerful strategy to dissect the functional impact of genetic variants both within and across multiple cell types contained in a complex tissue. Prior studies have quantified allelespecific properties at a singlecell level to characterize transcriptional bursting and stochasticity in gene expression [15, 16]. However, the analysis of allelespecific patterns to unravel allelic regulation at the singlecell level are only beginning to emerge [8, 17], and principled computational methods for this task are not established.
To address the aforementioned challenges, we developed a versatile computational model and analysis framework, scDALI (singlecell differential allelic imbalance). scDALI leverages allelespecific quantifications in single cells to identify and comprehensively test for different types of allelic effects, including homogeneous effects that are shared across all cell states or heterogeneous effects that are specific to some cell states. Intuitively, our model is similar in spirit to differential expression testing but aimed at identifying loci that exhibit heterogeneous allelic imbalance rather than variation in total expression. Critically, the model does not require the definition of cell states or clusters a priori and can cope with both discrete cellular states or continuous transitions. Additionally, scDALI enables the estimation of allelic imbalance from sparse sequencing data in individual cells, thereby facilitating the visualization and downstream interpretation of allelic regulation. scDALI is applicable to singlecell datasets obtained from different modalities and sequencing technologies.
We applied scDALI to study allelespecific variation in singlecell chromatin accessibility data (sciATACseq) in developing F1 embryos of Drosophila melanogaster, where we identified hundreds of regulatory regions with allelic imbalances in specific cell types or developmental stages. Among these effects, we identify putative enhancer regions with opposing allelic imbalance in different cell lineages, which are missed by bulk assay profiling. We then leveraged scDALI to finemap the celltype specificity of known expression quantitative trait loci (eQTL) in a population cohort of human induced pluripotent stem cells (iPSC), by assessing allelic regulation of singlecell transcriptomes. scDALI offered increased detection power compared to previous methods and uncovered how subtle differences in cell states can substantially affect allelic regulation. scDALI is therefore applicable to diverse species and data types, and leverages singlecell technologies to avoid cell sorting, thereby providing the means to discover and quantify the functional impact of cell statespecific genetic effects in a systematic and unbiased manner.
Results and discussion
scDALI enables the analysis of contextspecific allelic regulation from singlecell sequencing data, either generated from outbred individuals or F1 crosses of inbred wildisolates. Key to our approach is the integration of two independent signals that can be obtained from the same singlecell sequencing experiment: total counts and allelespecific quantifications. These signals can be derived from singlecell RNAsequencing, singlecell ATACsequencing, and a range of other epigenetic assays. scDALI first uses total counts, quantified at individual features, to define a manifold of cell types and cell states, similar to established workflows for the inference of state clusters [18] or pseudotemporal orderings [19]. Second, from the same dataset, allelespecific counts from matched cells are extracted, which allow for quantifying allelic imbalances and therefore genetic effects (Fig. 1a).
scDALI is a probabilistic model that can dissect dependencies between both of these signals, while aggregating evidence across cells to mitigate the sparsity of singlecell data. Briefly, our method can be cast as a generalized linear mixed model (GLMM) with a BetaBinomial likelihood that accounts for count noise and residual overdispersion due to unmodeled variability in the data (Methods). This formulation extends the classical BetaBinomial framework, which has previously been used for allelespecific analyses of bulk sequencing data [9, 12, 13, 20]. The model is also conceptually related to random effect models that have been proposed to study genotypeenvironment interactions in populationscale studies [10, 21]. scDALI captures both homogeneous deviation from a specified base allelic rate (e.g., 0.5 for autosomes in diploid organisms), as well as heterogeneity in allelic rates across cells using a kernel matrix that explains cell states, which is estimated from total counts [22, 23]. Our framework can capture a variety of cell state effects, including discrete cell clusters as well as continuous developmental trajectories. It is also possible to incorporate additional known covariates, such as batch or sample structure as fixed effects. Within the scDALI framework, we formulate computationally efficient score tests [24, 25] that allow us to identify sites that exhibit different types of allelic imbalance. In particular, scDALI implements tests specific to global homogeneous (pervasive) imbalance (scDALIHom), heterogeneous genetic effects that vary across cell types and cell states (scDALIHet) or either of these effects (scDALIJoint) (Fig. 1b). scDALI also allows for estimating the fraction of the total allelespecific variance that can be explained by cell state effects, and the model can be used to estimate and visualize allelic imbalances across cell states (Methods). Our framework is designed for the analysis of singlecell sequencing data from a small number of genetically distinct individuals, and hence the focus is not the discovery of novel quantitative trait loci but rather to leverage allelic imbalance to characterize the cellstate specificity of genetic factors.
Model validation using simulated data
Initially, we validated our approach using simulated data, which was designed to mimic real count data as expected from a very heterogeneous sample (singlecell data from whole embryos), by adapting key parameters from empirical sciATACseq profiles from whole Drosophila melanogaster embryos (continuous and discretized cell states, overdispersion parameters; Additional file 1: Fig. S1a, 2, Methods). First, we assessed the calibration of all three scDALI tests by simulating from the corresponding null models, confirming uniformly distributed Pvalues (Additional file 1: Fig. S1b, c). Notably, a variant of scDALIHet using a Binomial rather than BetaBinomial observation model was not calibrated at overdispersion levels estimated from real data (Additional file 1: Fig. S1c). We also considered two alternative tests for modeling empirical allelic rates (maternal counts divided by total counts): a oneway ANOVA, testing for differences between discrete clusters and a multipledegreesoffreedom likelihood ratio test based on an ordinary least squares regression model (OLS, Methods). While the ANOVA model was calibrated, the OLS model led to inflated Pvalues when testing highdimensional cell states relative to the sample size (Additional file 1: Fig. S1 d). Next, we simulated allelic counts from the scDALI model, varying the proportion of homogeneous versus heterogeneous allelic imbalance (Fig. 1c, Additional file 1: Fig. S2, 3a). As expected, scDALIJoint identified effects of both classes, generalizing the individual tests scDALIHet and scDALIHom. We then went on to simulate allelic counts either assuming continuous states, discrete cell state clusters derived from these states, or weighted combinations thereof (Fig. 1d, Additional file 1: Fig. S2, 3b, Methods). We compared scDALI to an ANOVA test based on the discretized cell state representation, finding that scDALIHet offered substantial advantages in the presence of additional continuous variation, whereas ANOVA is most suitable to detect purely discrete effects. We also considered a range of additional settings, varying the levels of overdispersion and kernel variance (Additional file 1: Fig. S3), finding that scDALI was robust to a range of different parameters. scDALI is implemented as computationally efficient opensource software, scaling to the analysis of large datasets with up to tens of thousands of cells (Additional file 1: Fig. S4).
scDALI identifies heterogeneous allelic imbalance in scATACseq from developing Drosophila embryos
Having validated the model, we applied scDALI to open chromatin regions during embryonic development in F1 hybrid embryos of Drosophila melanogaster. We profiled singlecell chromatin accessibility by sciATACseq in F1 embryos obtained by mating the same mother to four genetically distinct fathers [20]. To ensure that we captured regulatory variation associated with major developmental events, we collected embryos from four F1 crosses at three key stages of embryonic development (Fig. 2a, 2–4 h, 6–8 h, and 10–12 h after egg laying), which correspond to stages when the majority of cells are multipotent, or are undergoing lineage commitment, and or tissue differentiation, respectively. Sequencing the resulting 12 sciATACseq libraries generated a dataset of 35,485 single cells (between 8000 and 10,000 cells per cross) that passed stringent quality metrics (Additional file 1: Fig. S5; Methods). Overall, our dataset features all the hallmarks of highquality sciATACseq, including the appropriate nucleosomal banding pattern (Additional file 1: Fig. S5a), and a high concordance to previously identified peaks from a timematched sciATACseq dataset in a reference strain [26] (Additional file 1: Fig. S6).
To infer a common cell state representation for all time points and crosses, we adapted a variational autoencoder (VAE) [27] that was previously developed for scRNAseq data [28] to scATACseq. Briefly, a VAE is a neural network with a probabilistic bottleneck layer that learns the distribution of the data by compressing highdimensional observations into a lower dimensional latent space. Our implementation (Additional file 1: Fig. S7a) incorporates a sizefactor adjusted Bernoulli likelihood model tailored to the binary nature of scATACseq data. Furthermore, the model not only integrates measurements across datasets and batches but also allows to explicitly model information about different sampling times for developmental datasets (Additional file 1: Fig. S7d, e). This extension enables our model to infer continuous temporal ordering of cells by coupling the VAE objective function with a regression problem to predict sampling time from the latent cell state representation (Methods). We trained the model using the top 25,000 most accessible peaks across all crosses and time points. The VAE yielded a wellaligned latent space for all F1 crosses (Additional file 1: Fig. S7c) that captured progressive changes across developmental time (Fig. 2b, Additional file 1: Fig. S7d, e). We used the VAE latent space to define the cell state covariance for scDALI (see below). For annotating cell types, cells were clustered based on this lowerdimensional representation using the Leiden algorithm [29, 30] (28 clusters, Additional file 1: Fig. S7b), followed by an assignment of tissue identities based on the enrichment for enhancers with validated in vivo spatiotemporal activity in specific tissues during embryogenesis and genes with known tissuespecific expression [26] (Methods). Four smaller clusters with ambiguous annotations that likely correspond to barcode collisions were excluded from further analysis. This annotation process resolved seven cell populations that are representative of major embryonic lineages, including muscle, nervous system, and ectoderm (Fig. 2b).
Next, we quantified chromatin accessibility on an allelespecific level. We applied WASP [13] to avoid allelic mapping artifacts, filtering between 7 and 8% of mapped reads (Additional file 1: Fig. S8a, Methods). Allelespecific chromatin accessibility was quantified within 1 kb regions centered on ATAC peaks, requiring that each read overlapped at least one heterozygous variant. This resulted in a haplotype assignment for 20% of the reads (based on 5–6 variants per region on average, Additional file 1: Fig. S8b, c). After discarding peaks with low allelic coverage (mean count of reads that could be assigned to either allele < 0.1), we obtained between 8040 and 12,861 openchromatin peaks per cross for further analysis resulting in a combined set of 39,530 peaks to be tested (Additional file 1: Fig. S9d, e).
We applied scDALI to test for homogeneous or heterogeneous allelic imbalance at each of the 39,530 peaks, jointly considering cells across all developmental stages for each cross (Fig. 2c, d, Additional file 1: Fig. S9; using the VAE latent space coordinates to define the cell state kernel; Methods). scDALIJoint identified 7823 (~ 20%) ATAC peaks with evidence for allelic imbalance (FDR < 0.1, BenjaminiHochberg adjusted). Notably, the majority of these peaks were also identified by scDALIHom (83%), indicating that homogeneous imbalance is prevalent. However, scDALIHet identified 415 sites with evidence for cell statespecific allelic imbalance, 200 of which were missed by scDALIHom. This indicates that strong heterogeneity can preclude the identification of allelic imbalance by bulk sequencing or analysis strategies that assume exclusive homogenous effects. For instance, a peak (region chr3R:2031005620311056) in cross F1DGRP307 was identified with high significance by scDALIJoint (P = 1.93 × 10^{−8}) and scDALIHet (P = 5.45 × 10^{−8}), but was globally consistent with a model that assumes no allelic imbalance (scDALIHom P = 0.81, Fig. 2d). We also assessed whether heterogeneous allelic imbalance at peaks identified by scDALIHet could be explained by variation in total accessibility of the corresponding peak, finding no evidence for such a relationship for the vast majority of peaks (Additional file 1: Fig. S10 c, d). To evaluate to what extent scDALI is affected by the specific choice of cellstate representations, we also considered two alternative methods to define a cell state kernel, latent semantic indexing [26] (LSI), and cisTopic [31] (Methods). This comparison indicated that peaks with significant heterogenous imbalance were robustly identified across all three cellstate representations (Additional file 1: Fig. S10 a, b).
Properties of regions with heterogeneous allelic imbalance
We applied scDALI to estimate allelespecific accessibility in individual cells for 415 peaks with significant heterogeneous allelic imbalance. We considered two alternative strategies for annotating cellstatespecific effects. First, we aggregated estimated allelic rates for each of the 7 annotated lineages and compared the rate distribution and mean allelic rates to identify lineages with pronounced differential allelic imbalance. Second, we estimated transcription factor (TF) activity scores for each cell based on the total accessibility of a curated set of 65 transcription factor motives [4] (using chromVAR [32]) and ranked TFs based on the correlation between their activity and estimated allelic rates (Methods). Notably, the latter approach avoids the definition of discrete cell clusters and thus can be used to identify specific regulatory programs associated with allelic imbalance.
We find several cases in which allelic imbalance affects known lineagespecific regulatory elements. For example, region chr3R:2287748922878489 (scDALIHet P = 2.7 × 10^{−5}) has been previously identified as a neuronalspecific DNase Hypersensitive Site (DHS) [33] and has been demonstrated to function as a nervous system enhancer in vivo (CAD4 database [26]). Accordingly, this region is identified as predominantly accessible in the nervous system (Fig. 3a). In addition, while cells from other lineages show no appreciable allelic imbalance, accessibility in the nervous system is strongly biased for the paternal allele (Fig. 3b, Additional file 1: Fig. S11b). Ordering cells by their estimated allelic rate and computing the difference between the top and bottom 10% quantiles (Qdiff10), we define a measure of the effect size of heterogeneous allelespecific imbalances, which captures the variation in allelic rates between the most extreme populations (Additional file 1: Fig. S11a). For this specific example, we obtain a Qdiff10 of 0.24 despite the overall mean allelic rate being close to 0.5 (Additional file 1: Fig. S11b). In accordance with the allelic imbalance identified by scDALI at this locus, the assessment of TFs associated with heterogeneity in allelic effects identified known nervous system regulators, such as Tramtrack (ttk) and Hairy (h) (Fig. 3c, Additional file 1: Fig. S11e).
Interestingly, we found a number of regulatory regions that show opposing allelic imbalances in different lineages. For example, region chr2R:1367570713676707 has only a small maternal bias (estimated overall mean rate 0.61) when considering the global allelic rate but is identified as a site with pronounced allelic heterogeneity by scDALI (scDALIHet P = 1.5 × 10^{−8}, Fig. 3e). This region has previously been identified as a neuronal and musclespecific DHS [33] and accordingly shows increased accessibility in the nervous system and muscle in our data. However, accessibility is biased for the maternal allele in the muscle and the paternal allele in the nervous system (Qdiff10 = 0.29, Fig. 3f, Additional file 1: Fig. S11c). This pattern of opposing allelic imbalance is also reflected in the correlation with the activity of TFs active in these tissues. For example, known muscle regulators, such as Twist (twi) and Tinman (tin) are correlated with the maternal allelic rate, while factors active in the nervous system, for example, Tramtrack (ttk), Disconnected (disco), and Kruppel (Kr), are correlated with the paternal rate (Fig. 3g, Additional file 1: Fig. S11f).
Another example is chr3R:2031005620311056 (scDALIHet P = 5.45 × 10^{−8}), a region spanning an intron of the gene CG42668. The total accessibility of this region largely coincides with the known tissuespecific gene expression of CG42668 in the cells of the midgut and visceral muscle. Our allelespecific analysis revealed differential allelespecific effects in both tissues, suggesting distinct regulatory programs orchestrating the tissuespecific activity of CG42668 (Fig. 3h, i). Furthermore, muscle cells showed additional intralineage variation, resulting in a bimodal distribution of allelic rates (Additional file 1: Fig. S11d). Despite the presence of strong inter and intralineage variation (quantile difference 0.39), this effect is obscured in a bulklevel analysis (scDALIHom P = 0.81). The activity score of GATAe, a known midgut TF, is highly correlated (Pearson r > 0.5) with the maternal rate, while Zelda (vfl), which has a role in zygotic genome activation and early developmental patterning in the embryo primordium, with the paternal rate, consistent with the allelic bias observed in these cell populations (Fig. 3j, Additional file 1: Fig. S11g). The temporal intralineage variation within the muscle population is also reflected in the correlation with the activity of known early and late muscle TFs. Twist (twi) and Tinman (tin) are active in the early muscle primordium (mesoderm) where they direct the specification of the muscle lineages, and concordantly their activity scores are correlated with the paternal allelic rate observed in the early muscle cells. TF Lameduck (lmd) is instead correlated with the maternal rate, as it is required during later stages of muscle formation for the proper specification of the somatic and visceral muscle (Fig. 3j, Additional file 1: Fig. S11g).
More globally, allelespecific effects are stronger at distal regulatory elements (potential enhancers) compared to promoterproximal regions, both for peaks with heterogeneous (onesided MannWhitney U test, P = 5.6 × 10^{−5}) as well as homogeneous (onesided MannWhitney U test, P = 2.17 × 10^{−26}) imbalance (Fig. 3k). Furthermore, imbalances are significantly more common at distal versus proximal regions (Fig. 3l), similar to what has been observed in bulk ATACseq data at timematched developmental stages [20]. These differences between distal and proximal sites are less pronounced when considering discoveries from scDALIHom (twosided Binomial test P = 0.02), with about 61% of significant regions being found at proximal regions compared to 62% of all tested peaks. Interestingly, however, we find this effect to be markedly more prominent for heterogeneously imbalanced regions (twosided Binomial test P = 2.15 × 10^{−10}), with only 47% of peaks discovered by scDALIHet being located near gene promoters.
To further characterize heterogeneous imbalances, we used scDALI to assess differential lineage effects, testing for differences in mean allelic rates between each lineage and all remaining cells (Methods). Briefly, this test can be formulated under the scDALIHet framework, replacing the continuous cell state kernel with a blockdiagonal matrix to indicate lineage membership. Unsurprisingly, the frequency of significant imbalances by lineage (FDR < 0.1) largely resembled the overall read count distribution, which influences the detection power for allelic imbalance (Fig. 3m, Additional file 1: Fig. S12a). For the majority of peaks, allelespecific variation was attributable to one or two differentially imbalanced lineages (72%); however, 11% of peaks showed differences between three of four lineages (Fig. 3n). Interestingly, for 17% of scDALIHet discoveries, allelespecific effects do not differentiate any single lineage, indicating the presence of significant intralineage variation, for example due to variation in developmental time.
Identification of sites with heterogeneous allelic imbalance linked to developmental time
Developmental time is a major driver of variation in our dataset and therefore a promising predictor of allelespecific changes within lineages. We applied scDALI to test for timespecific allelic imbalances within muscle, the lineage with the largest number of cells, using the pseudotemporal ordering estimated by the VAE model as a cell state representation (Fig. 4d). Leveraging the scDALI framework, we design a kernel capturing both linear and nonlinear (polynomial) temporal dependencies (Methods). Out of 363 peaks with significant heterogeneous allelic imbalance that are accessible in muscle (mean total allelic count within lineage < 0.1), scDALI identified 69 (19%) peaks with significant timespecific effects (FDR < 0.1; Fig. 4a, b). Notably, 27% of these peaks with timespecific allelic imbalance did not show any lineagespecific effects (Fig. 4c). As an example, region chr2R:1367570713676707 discussed above (Fig. 3f) does indeed exhibit strong timespecific imbalances (Fig. 4e, Fig. 4f), consistent with the observed intralineage variation specifically in muscle cells.
Application of scDALI to identify celltypespecific effects of eQTL
To demonstrate that scDALI is also applicable to singlecell RNAseq, we considered a recently published multidonor singlecell RNAseq dataset of human induced pluripotent stem cells (iPSCs) differentiating towards definitive endoderm [5]. Samples were profiled using a fulllength sequencing protocol (Smartseq2, [34]), allowing for the quantification of gene expression in haplotyperesolved manner and thus providing the basis for an analysis using scDALI. Briefly, this study spans 34,254 cells (after basic filtering, Methods) from 125 donors at four time points during cell differentiation. We considered 3966 eQTL (SNPgene pairs) that were identified in the primary analysis and applied scDALIHet to assess the evidence for heterogeneous allelic gene expression. Briefly, for each of these eQTL, we aggregated allelespecific counts across all cells from donors with a heterozygous query eQTL variant relative to this variant (using haplotype phasing, c.f. Methods and [5]). We then used the first 20 principal components from total expression counts to construct a cell state kernel for the scDALI analysis (Methods).
While allelic rates are generally less susceptible to confounding variables such as batch effects, donorspecific read mapping biases as well as differences in the representation of cell types and cell states can lead to spurious signals of heterogeneous allelic variation. Indeed, we confirmed the need to account for the donor identities (donorID) in this analysis to retain calibrated test statistics (10 PCs, Fig. 5a, b; assessed using permuted cell coordinates; Methods).
We assessed the number of eQTL with heterogeneous imbalance discovered by scDALIHet when varying the number of principal components used to construct the cellstate kernel, finding that more complex kernels yielded a larger number of discoveries, which however saturated for five or more components (Fig. 5c). For example, a model using the first PC to define a cell state kernel (which primarily captures differentiation, Additional file 1: Fig. S13) identified 611 eQTL with heterogeneous allelic imbalance compared to 812 eQTL when using 10 components (Fig. 5d, FDR < 0.1). This indicates that although variation in gene expression in this data is predominantly explained by the differentiation state (Fig. 5e), the remaining sources of variation drive a substantial fraction of distinct genetic regulation. One example of such an effect is an eQTL with heterogeneous ASE for CPNE1 (P = 3 × 10^{−9}, scDALIHet). CPNE1 has been shown to play a role in neuronal progenitor cell differentiation [35]. Intriguingly, the pattern of allelic imbalance is confined to a distinct subpopulation of iPS cells, which is marked by expression of UTF1. Notably, this UTF1positive iPS subpopulation has recently been associated with differentiation efficiency towards a midbrain neural fate [6] (Fig. 5g).
Conclusion
The majority of disease associated variants impact noncoding regions, disrupting the function of regulatory elements such as enhancers and promoters. As enhancers regulate when and where genes are expressed, genetic variation within enhancers naturally has cell typespecific effects. However, capturing and understanding these genetic effects is an enormous challenge. Resolving these effects to specific cell types using classical quantitative trait loci (QTL) mapping would require FACS sorting different cell types from a heterogeneous tissue across a large panel of individuals, a huge task that is often impossible as specific markers for cell isolation are not available for many cell types and transitions.
To address this, we developed scDALI, a computational framework to characterize the celltype specificity of genetic effects from singlecell sequencing data in an unbiased fashion. Our model provides a principled strategy for exploiting two independent signals that can be obtained from the same sequencing experiment, whether that is gene expression or epigenetic data: (1) total counts, which we use to derive cell types and states, and (2) allelespecific quantifications of genetic effects within genomic features such as genes or ATAC peaks of accessibility. Combining these two measurements allowed us to test for both pervasive, homogeneous imbalance and cellstatespecific heterogeneous effects, without the need to define cell types or cell states a priori.
We applied scDALI to newly generated scATACseq profiles from an F1 cross design, assaying dynamic and discrete changes in allelespecific chromatin accessibility of developing Drosophila melanogaster embryos, a naturally very heterogeneous sample. Our model discovers thousands of imbalanced regions, hundreds of which show distinct cell statespecific effects. About half of the regulatory regions with allelic imbalance in specific cell types are not detectable in a pseudobulk analysis, as opposing effects cancel out across the cell state space. Although the total number of discoveries with heterogeneous effects is relatively modest, we expect this to increase dramatically as the number of profiled cells increases. Even with the numbers profiled here, our analysis identified genetic effects at a number of characterized tissuespecific developmental enhancers. scDALI estimates allelespecific effects in individual cells, which allows dissecting this heterogeneity at different resolutions. We have shown how this map can be used to identify the underlying regulatory programs by associating differential allelic imbalance with pathway or transcription factor activity scores. Alternatively, it is possible to aggregate allelic rates at the level of known (discrete) clusters, thereby assessing the distribution of estimated allelic activity both between and within lineages or cell types. We find that developmental time is an important contributor to intralineage variation of allelic imbalance, pinpointing developmental stagespecific enhancers. Furthermore, our analysis revealed that allelespecific effects are significantly stronger and more common at distal elements (putative enhancers) compared to promoterproximal regions. Notably, these differences are markedly more pronounced among peaks with heterogeneous (tissuespecific) imbalances compared to homogeneous effects, confirming and extending previous results on bulksequencing data [20]. We then applied scDALI to a published scRNAseq dataset from 125 human iPS cell lines and demonstrated how our model can be used to discover contextspecific genetic effects of known eQTL and characterize the associated cellular subpopulations.
While our approach uncovers many novel putative enhancers, it also has its limitations. The focus of this work lies on the characterization of cellstatespecific effects for known quantitative trait loci and the mapping of genetic effects from few available individuals or even a single sample. In particular, we do not test for interactions between cell states and the presence of genetic variants, which prevents our model from discovering potential causal loci associated with cellstatespecific allelic imbalance. While in principle, it is possible to combine allelic analyses with genotype data to identify causal variants [9, 11,12,13], this requires larger numbers of unique genotypes. Furthermore, even for populationscale studies using singlecell sequencing, the primary interest is the characterization of known loci and not discovery of novel effects. These considerations are motivated by differences in power to detect eQTL in bulk versus singlecell data [36] and the sample sizes that can currently be profiled using singlecell readouts. The required multiindividual singlecell sequencing studies are only beginning to emerge and scDALI could be extended to leverage such variation.
Understanding to what degree allelespecific effects replicate at different molecular layers remains another important direction of future research. In this study, we have demonstrated that scDALI can be flexibly applied to both singlecell RNAseq and ATACseq data. However, new multiomics methods can obtain both DNA accessibility and RNA measurements from the same single cell [37]. The integration of these different dimensions of allelic imbalance across both modalities will be an important area for future work that may help to relate the functional impact of genetic variation in enhancers to their target gene’s expression.
Methods
scDALI model
scDALI extends the frequently used BetaBinomial observation model for allelespecific read counts in bulksequencing data [9, 12, 13, 20], by accounting for cellstatespecific effects. For a given genomic region of interest and cells i = 1, …, n let a_{i} be the number of reads mapping to the the maternal haplotype and d_{i} be the total number of reads. Furthermore, let K denote a n × n cellstate kernel matrix, capturing celltocell covariances across the cell state space. Throughout our analyses, we use K = EE^{T} where E is a lowdimensional representation based on total read counts. For example, E can be obtained by reducing the dimensionality of the total counts matrix using principal component analysis (PCA) or a variational autoencoder model (see below) or by constructing a (onehotencoded) cell clustering. scDALI captures cellstatespecific allelic variation on a logit scale using a latent ndimensional Gaussian variable:
Here, the scalar α denotes global or homogeneous allelic imbalance, affecting all cells equally and independent of the cell state, while σ^{2} modulates the strength of cellstatespecific or heterogeneous effects. To couple u to the mean of a BetaBinomial observation model for allelic read counts, scDALI then uses a logit link function g(x) = log (x/(1 − x)):
The parameter θ captures residual, extrabinomial variance (overdispersion) due to unmodeled technical and biological sources of variation. Note that by using a linear kernel function K = EE^{T}, we effectively cast our model as a generalized linear mixed model [38] (GLMM). However, the model can in principle be extended to the nonlinear case using common kernel functions from the Gaussian process literature [39] or using nonlinear transformations of individual cellstate dimensions.
To systematically assess homogeneous or heterogeneous allelic imbalance, scDALI implements three score tests:

scDALIHet H_{0}^{Het} : σ^{2} = 0 vs. H_{1}^{Het} : σ^{2} > 0 (heterogeneous imbalance)

scDALIHom H_{0}^{Hom} : α = 0 vs. H_{1}^{Hom} : α ≠ 0 (homogeneous imbalance)

scDALIJoint H_{0}^{Joint} : α = 0, σ^{2} = 0 vs. H_{1}^{Joint} : α ≠ 0 or σ^{2} > 0 (general imbalance)
By leveraging a scorebased framework, scDALI avoids fitting the full GLMM model under the alternative hypotheses when evaluating the test statistics, which is computationally expensive for large data sets. The associated null models can be fitted efficiently using the Fisher scoring/NewtonRaphson. For a full derivation of the scDALI score tests, see Additional file 2: Supplementary Methods.
Allelic rate interpolation and downstream analysis
Once a set of genomic regions with significant heterogeneous allelic imbalance has been identified, scDALI can be used to estimate the landscape of allelic imbalance across the cell state space for the purpose of visualization and downstream analysis. For computational reasons, we approximate the scDALI model described above, replacing the BetaBinomial observation model (3) with a Gaussian likelihood model for empirical rates r_{i} = a_{i}/d_{i}. Both model parameters (equation (1)) and posterior approximations for allelic rates can be fitted efficiently using sparse variational inference [40, 41].
While the estimate for α provides a measure of pervasive, homogeneous allelic imbalance, we can use the posterior mean of the latent variable u as an estimate for cellspecific allelic rates. In particular, we define a measure of effect size or statistical dispersion for heterogeneous effects, Qdiff10, as the difference between the 90% and 10% quantiles of the estimated posterior mean for u.
Guidelines for the cell state definition
An appropriate cellstate definition depends on the data as well as the research question. If differences in allelic rates between discrete cell types are of primary interest, using a onehot encoding of the cell type clusters will maximize detection power for these effects. However, such a representation will ignore continuous, e.g., intracell type variation and results will depend on whether or not the cell clustering represents a biologically meaningful discretization. In some cases, a continuous representation is a natural choice, e.g., when studying differentiation or developmental time courses. As a generalpurpose approach, we propose to use a lowerdimensional embedding of the total counts matrix for the detection of both continuous and discrete effects. Similar to the standard analysis of singlecell sequencing data, the choice of a particular dimensionality reduction method should be informed by a variety of factors (dataset size, need for interpretability, specific characteristics of the data modality, etc.).
Guidelines for the control of confounding effects
In many cases, both alleles are thought to be affected similarly by technical confounders and batch effects, and consequently, these effects will cancel out when quantifying allelic rates. However, all factors that may affect rates rather than allelic counts need to be accounted for, e.g., possible individualspecific reference mapping biases in a populationscale analysis (see section Analysis of allelic imbalance in populationscale iPSC data). Nevertheless, we advise to adjust for common technical confounders (batch effects, size factors) when constructing the cellstate representation, which is typically based on total read counts.
Cell state variational autoencoder
To infer a lowerdimensional embedding from chromatin accessibility profiles of temporally resolved scATACseq data, we implement a variational autoencoder model [27] (VAE). Variations of VAE models have been widely applied to model singlecell transcriptome measurements [28, 42, 43] and more recently been extended to model chromatin accessibility data [44]. Our model is most closely related to scVI [28], a VAE capable of integrating scRNAseq data across different individuals or batches while accounting for librarysize variation. However, our model differs from scVI in two notable aspects. First, we use a likelihood model tailored to the nearbinary nature of singlecell ATACseq data. Second, we integrate sampling times for developmental datasets to estimate a continuous pseudotemporal ordering from few available time points. The model can be decomposed into three submodules (Additional file 1: Fig. S1a): the decoder network, representing the generative process for observed accessibility profiles, the temporal classifier, and the encoder network for inferring the posterior distribution over latent variables.
Let x_{i} ∈ {0, 1}^{m} be the binarized accessibility vector for m peaks in cell i and c_{i} be the batch / individual identity. The probabilistic generative model underlying the decoder module is as follows:
Here, z are the latent, lowdimensional cell state representations, and l_{i} is a cellspecific sizefactor variable capturing variation in sequencing depth [28]. The prior parameters μ_{i}(c_{i}) and σ_{l}^{2}(c_{i}) are chosen to be maximumlikelihood estimates based on the total number of reads per cell in each cross. Cell states along with observed batch ids are mapped to mdimensional peak activities ρ_{i} ∈ [0, 1]^{m}, ∑_{j}ρ_{ij}, representing the relative “openness” of each peak in cell i. The mapping is realized by a neural network f_{ρ} with trainable parameters. By providing both f_{ρ} and the encoder network (see section below) with c_{i}, the model is encouraged to disentangle batchspecific effects and cell state representations [28, 45] (Additional file 1: Fig. S1c). The full distribution over observed accessibility profiles is obtained by applying the scaling factor to the peak activities. If l_{i} were the true (discrete) number of reads per cell, \( 1{\left(1{\rho}_{ij}\right)}^{l_i} \) would correspond to the probability of observing at least one read in peak j. However, to simplify the inference process, we place a continuous lognormal prior on l_{i}.
For the Drosophila melanogaster data considered in this paper, coarse temporal information in the form of embryo collection windows is available (Additional file 1: Fig. S1d). We integrate these time stamps with the observed ATACseq data to inform the latentspace inference, correct time measurement errors, and learn a continuous ordering of cells from few available labels. Assume the time label y_{i} for cell i takes on one of t ordered values. If t is small, it is difficult to accurately estimate the temporal scale at which cell state changes take place. Instead, we model the relative order of cells as a function of the cell state z_{i}, using an ordinal likelihood model. Formally, we assume y_{i} ∈ {1, 2, …, t} and define [46]
where Φ denotes the cumulative distribution function of the standard normal distribution, w_{0} = − ∞, w_{t} = ∞ and w_{1}, …, w_{T − 1} are trainable parameters such that w_{i} < w_{i + 1}. The function f_{y} maps cell states to pseudotemporal values along the real axis and is chosen to be a simple linear model. Intuitively, p(y_{i}  z_{i}) corresponds to the probability of sampling a value from the interval \( \left({w}_{y_i1},{w}_{y_i}\right) \) under a normal distribution with mean f_{y}(z_{i}) and unit variance. By allowing for Gaussian noise around the latent time f_{y}(z_{i}), we can account for measurement errors in the labeling process. Note that the w_{i} form a contiguous segmentation of the real line which enforces ordinal constraints. Guided by both observed time stamps and cell state proximities, the model infers a highresolution pseudotemporal trajectory f_{y}, allowing us to order cells according to their developmental progression.
We optimize all parameters jointly using amortized variational inference [27], incentivizing the model to learn a cell state representation that is informed by and supports the observed time labels (Fig. 2b, Additional file 1: Fig. S1e). For a full description of the mathematical details of the variational approximation and practical implementation details, we refer the reader to the Additional file 2: Supplementary Methods.
Generation and sequencing of Drosophila melanogaster F1 embryos
We generated Drosophila melanogaster F1 hybrids by crossing females from a common maternal virginizer line with males from four different inbred lines from the Drosophila melanogaster genetic reference panel [20, 47] (DGRP). Embryos were collected in 2 h windows (2–4 h, 6–8 h, and 10–12 h after egg laying) as previously described [20].
Hyperactive Tn5 transposase was purified by the EMBL Protein Expression and Purification facility as previously described [48] and stored at − 20 °C in storage buffer (25 mM Tris pH 7.5, 800 mM NaCl, 0.1 mM EDTA, 1 mM DTT, 50% glycerol) until use. Uniquely indexed oligonucleotides from Cusanovich et al. [26] were annealed to common pMENTs oligos 95 °C 5 min, cooling to 65 °C (0.1 °C/s), 65 °C 5 min, cooling to 4 °C (0.1 °C/s)) to generate indexed transposons that were then loaded onto purified Tn5 by incubation at 23 °C with constant shaking at 350 rpm for 30 min. The loaded Tn5 transposomes were diluted 1:10 (final 0.02 mg/ml) in nucleasefree water and used immediately for tagmentation.
Embryo dissociation and nuclear isolation were performed as described previously [26]. Nuclei were flash frozen in liquid nitrogen and stored at − 80 °C until use. Generation of sciATACseq libraries was performed largely as previously described [26], with minor modifications. The tagmentation reaction was performed by adding 2 μL of each of the 96 custom and uniquely indexed Tn5 transposomes and by incubating at 55 °C for 1 h. After reversecrosslinking, 5 μL of forward and reverse indexed primers (from Cusanovich et al. [26]), 7.5 μL KAPA HiFi DNA Polymerase ReadyMix (Roche) and 0.25 μL Bst3.0 (NEB) were added to each well. Tagmented DNA was then PCR amplified with the following cycling conditions: 72 °C 5 min, 98 °C 30 s; 98 °C 10 s, 63 °C 30 s, 19–22 cycles; 72 °C 1 min, hold at 10 °C. The optimal number of cycles for each library was determined beforehand by monitoring amplification on a qPCR machine for a set of test wells. Libraries were sequenced on an Illumina NextSeq 500 sequencer High Capacity 150 PE kit as previously described [26].
Processing of raw sciATAC sequencing data
Raw sequencing data was processed based on the pipeline (https://github.com/shendurelab/flyatac/) developed by Cusanovich et al. [26]. BCL files were converted to fastq files using bcl2fastq v.2.16 (Illumina). To correct for sequencing or PCR amplification errors, read barcodes were matched against all possible barcodes. In case of an approximate match (Levenshtein distance < 3 and distance to next best match > 2), the corresponding barcodes were fixed to their presumptive match; all other barcodes were classified as ambiguous or unknown. Barcode correction was followed by adapter trimming [49] and read alignment to the dm6 reference genome using bowtie [50] (with options X 2000 3 1). After the removal of PCR duplicates, we classified barcodes corresponding to genuine cells from the background by fitting a twocomponent Gaussian mixture model to the logtransformed read counts per barcode. A cutoff for cell barcodes was determined by requiring that the posterior probability of belonging to the higher readdepth mixing component was greater than 0.95 (Additional file 1: Fig. S6b, c).
Chromatin accessibility was quantified in a set of 53,133 peaks of accessibility previously identified from a timematched sciATACseq dataset [26] and lifted to the dm6 reference genome (https://github.com/FlyBase/bulkfilescripts). We compared the Pearson correlation between pseudobulk aggregates for each collection window both within our dataset as well as between our data and the published reference (Additional file 1: Fig. S7). Next, we restricted our analysis to autosomes in order to remove sexspecific biases [26]. For each cross and collection window, we determined the 10% and 99% quantiles of the cellcount distribution and only kept cells whose counts were within those limits, resulting in 35,485 cells in total. Finally, we selected the 25,000 top most accessible peaks for further analysis. We trained the cell state variational autoencoder on the whole dataset for 30 epochs. Additionally, we generated alternative representations for data from cross F1DGRP712 using LSI [26] (leading components 2 to 20; the first component was excluded due to correlation with total counts per cell) and cisTopic [31] (50 topics). These embeddings were only used to assess the robustness of the scDALI workflow to different cell state representations (Additional file 1: Fig. S10).
We used the Scanpy implementation of the Leiden algorithm [29, 30] with a resolution of 1.2 and identified 28 cell clusters in the joint VAE latent space. For each cluster, we computed differentially accessible peaks using logistic regression by predicting cluster labels from the relative peak activity profiles obtained from the VAE model [30, 51]. We then performed an enrichment analysis for known tissue specific enhancer elements (CAD4 database [26] and genes (tissuespecific expression of the nearest gene based on in situ hybridization data from the Berkeley Drosophila Genome Project (http://insitu.fruitfly.org/cgibin/ex/insitu.pl) and FlyBase gene expression annotations (http://flybase.org/) using a Fisher’s exact test. Based on these enrichments, each cluster was assigned to one of seven major lineages (Fig. 2b). Four clusters with a total of 1432 cells could not be annotated unambiguously and were removed from further analysis (Fig. 1b), resulting in a final set of 35,485 cells (Additional file 1: Fig. S1d).
Processing of allelespecific sciATAC counts
We used existing genotyping information [20] for the parental strains to create crossspecific VCF files, filtering for genetic variants that were heterozygous in the F1 generation. To eliminate reference mapping biases, we applied the WASP pipeline [13] (https://github.com/bmvdgeijn/WASP/tree/master/mapping) and created filtered BAM files, removing between 7 and 8% of mapped reads (Additional file 1: Fig. S8a) from the original alignment. For each of the 35,485 cells, we then quantified allelespecific accessibility by adapting the original WASP code for count generation to singlecell sequencing (https://github.com/tohein/scai_utils). As features, we chose 1 kb windows centered on each of the 53,133 peaks to mitigate the inherent sparsity of the data. Reads aligned within these windows were assigned to either allele, requiring that each read overlapped at least one heterozygous singlenucleotide variant. If a read overlapped multiple variants, one was chosen at random to determine the allele of origin. As it can be challenging to accurately estimate the allelic base rate for sex chromosomes (that is, the overall proportion of female embryos), we excluded these from our analysis. Finally, windows in each cross were filtered by requiring that the mean allelic total count (that is, the sum of reads that could be assigned to either allele) across cells was no smaller than 0.1 (Additional file 1: Fig. S8d). This resulted in between 8040 and 12,861 peaks per cross and a combined set of 39,530 peaks to be tested for allelic imbalance (Additional file 1: Fig. S8e).
scDALI analysis of Drosophila melanogaster sciATAC data
We applied scDALI to all of the 39,530 peaks to test for heterogeneous (scDALIHet), homogeneous (scDALIHom) and either kind of allelic imbalance (scDALIJoint). Both scDALIHet and scDALIJoint used the 8dimensional VAE latent space embedding as a cell state representation and a linear kernel function. Pvalues from each test were BenjaminiHochberg adjusted to control the false discovery rate [52] (FDR). For each of the 415 sites with evidence for cell statespecific allelic imbalance (scDALIHet P < 0.1 FDR), we estimated allelic rates using scDALI. Depending on the number of covered cells for each peak, all models were trained with a maximum of 1000 inducing points.
To compute transcription factor activity scores, deviations in accessibility were calculated with chromVAR v1.10.0 [32] for a set of 65 curated Drosophila motifs from [4]. The Zscore corrected deviations were used to calculate the Pearson correlation with the estimated allelic rates.
To identify variable lineages for each of the 415 peaks with heterogeneous imbalance, we used scDALIHet with lineagespecific cell state kernels. Specifically, for each lineage we used a block diagonal kernel matrix, where entry i, j was set to 1 if both cells i and j were associated with that lineage and 0 otherwise. This allowed us to test for differences in the mean allelic rate for a particular lineage compared to the mean of all other cells. The combined Pvalues for all lineages were adjusted for multiple testing using the BenjaminiHochberg correction [52].
Lastly, we employed scDALIHet to test for changes in allelic imbalance across development in the muscle lineage using the temporal ordering inferred by the VAE model. To capture nonlinear effects, we applied a polynomial basis transform. Specifically, we constructed a matrix E_{Time} with entries (E_{Time})_{ij} = t_{i}^{j − 1}, j ∈ {1, 2, 3}, where t_{i} ∈ [0, 1] denotes the estimated time for cell i in the muscle lineage. We then applied scDALI assuming a linear cellstate kernel K = E_{Time}E_{Time}^{T}. Pvalues were adjusted using the BenjaminiHochberg correction [52].
Evaluation of scDALI on simulated data
We simulated allelespecific counts from the scDALI model (Eq. (1–3)) using observed allelic total counts and inferred cell state representations (VAE embedding and Leiden clusters derived from the VAE embedding) from real sciATACseq data of developing Drosophila melanogaster embryos (cross F1DGRP712). All simulations kernels were linear, that is K_{VAE} = E_{VAE}E_{VAE}^{T} and K_{Cluster} = E_{Cluster}E_{Cluster}^{T} where E_{VAE} and E_{Cluster} denote the 8dimensional VAE embedding and onehot encoding of 24 Leiden clusters, respectively.
We assessed the degree of extrabinomial variation present in the data, by fitting a basic BetaBinomial model to the observed allelespecific counts (10,220 cells and 12,861 peaks) using no additional cell state information (Additional file 1: Fig. S1a). Based on the histogram of estimated values, we ran all simulations at two different levels of overdispersion θ ∈ {2, 5}.
We first assessed the calibration of scDALI when testing for heterogeneous effects (scDALIHet). As scDALI is intended to leverage multidimensional cellstate representations, we analyzed the effect of testing an increasing number of cellstate dimensions for different numbers of cells. We considered two baseline candidates: a oneway ANOVA test, comparing allelic rates between cell clusters as well as a linear model incorporating cellstate covariates as fixed effects (likelihoodratio test, OLSLRT). Both alternatives were fitted to empirical allelic rates. All three tests used the observed Leiden clustering as a cellstate representation. We simulated data from a model assuming no heterogeneous imbalance, varying the number of clusters (cellstate dimensions) while keeping the number of simulated cells constant. We considered four different sample sizes: 250, 500, 1000, and 5000 cells per peak with nonzero allelic measurements. All experiments were performed for 1000 peaks. We computed the average inflation factor log_{10}(median P)/log_{10}(0.5) (Additional file 1: Fig. S1d) across 25 different random initializations, finding the OLSLRT to produce inflated pvalues for a large number of cellstate dimensions relative to the sample size. This is consistent with results on multipledegreesoffreedom tests reported previously [21]. We therefore excluded the OLSLRT from further simulation experiments.
We verified the uniform distribution of pvalues for all three scDALI models and the oneway ANOVA when simulating data from their respective null models (1000 peaks and 5000 cells randomly sampled from the full data), considering different levels of pervasive (cell state independent) effects (α ∼ N(0, ν^{2}), where ν^{2} ∈ {0, 0.01, 0.05, 0.1}; Additional file 1: Fig. S1b, c). Here, all models used the 8dimensional VAE embedding as a cellstate representation. Additionally, we show that a modified version of scDALIHet using a Binomial (rather than BetaBinomial) likelihood model will lead to false positive results at relevant levels of overdispersion.
We compared power to detect homogeneous vs. heterogeneous effects for scDALIHet, scDALIHom and scDALIJoint (Fig. 1c, Additional file 1: Fig. S3a). Let ρ ∈ {0, 0.2, 0.4, 0.6, 0.8, 1} denote the relative extent of heterogeneous imbalance and ν^{2} ∈ { 0.01, 0.05, 0.1} the total variance explained by allelespecific effects (both heterogeneous and homogeneous). We simulated data from the scDALI model (Eq. (1–3)), where α ∼ N(0, ρ · ν^{2}) and σ^{2} = (1 − ρ) · ν^{2}, using observed total counts and cellstate representations for 5000 cells and 1000 ATAC peaks randomly chosen from the observed data. All three models as well as the simulation procedure were run using the 8dimensional VAE embedding as a cellstate representation. Statistical power was calculated as the fraction of simulated regions discovered at an αlevel of 0.05 and averaged across 25 random seeds.
Lastly, we assessed power to detect discrete vs. continuous heterogeneous effects, using a weighted combination of the VAE and Leiden cluster kernels
In this scenario we assumed no additional homogeneous effects (α = 0) and considered a range of weights η and kernel scaling parameters. (η ∈ {0, 0.2, 0.4, 0.6, 0.8, 1} and σ^{2} ∈ {0.1, 0.05, 0.1}, Fig. 1d, Additional file 1: Fig. S3b). Compared were scDALIHet using the VAE representation vs. a oneway ANOVA model based on the discrete Leiden clusters. As above, we simulated data for 5000 cells and 1000 peaks and averaged power estimates across 25 random initializations.
Analysis of allelic imbalance in populationscale iPSC data
Total gene expression counts for all genes and allelespecific quantifications for 4470 previously identified SNPgene pairs (4422 eQTL lead variants) were obtained as described in the primary publication [5]. Briefly, reads were initially mapped to reference and alternative alleles for each heterozygous SNP in every cell and subsequently assigned relative to the genotype of each chromosome using known phasing information. Allelespecific read counts were aggregated at the gene level, by summing up the counts for each SNPs contained in exonic regions. Finally, for each eQTL (geneSNP pair), genelevel allelespecific counts were interpreted relative to the eQTL variant to obtain a consistent definition of ASE across cells from different donors that were heterozygous for that variant. SNPgene pairs were filtered by requiring at least 50 cells with nonzero allelespecific counts, leading to 3966 pairs to be tested using scDALI. We performed principal component analysis (PCA) of total gene expression counts from 34,254 cells and used the leading k principal components (PCs) and a linear kernel function to define cell state kernels. We chose k = 1 to focus on timespecific allelic imbalance (see also Additional file 1: Fig. S1) while k = 10 was used to model more general cellstate effects.
To assess the effect of donorspecific effects on heterogeneous allelespecific expression, we permuted the leading 10 PC coordinates among cells from the same donor. We then compared two implementations of scDALIHet that either did or did not account for the donor background using a onehotencoded representation of the donor identities for each cell as an additional covariate matrix. Additionally, we compared the difference in the number of discoveries for each model when using cellstate kernels based on 1, 2, 3, 4, 5, 10, 15, and 20 (unpermuted) PCs.
Availability of data and materials
All raw sequencing data from the Drosophila F1 study have been submitted to the EMBLEBI ArrayExpress database (https://www.ebi.ac.uk/arrayexpress/) and are available under accession number EMTAB10240 [53]. Processed data, including the sciATAC peaks per genotype and stage, can all be downloaded from http://furlonglab.embl.de/data. The processed iPSC scRNAseq data is available on zenodo (https://zenodo.org/record/3625024#.YcM1GS8w0eb) and all HipSci genotyping data can be accessed under http://www.hipsci.org. A Python implementation of the scDALI framework including the VAE model for cell state inference is available under the BSD3 license from https://github.com/PMBio/scdali [54]. Code to reproduce the specific analyses is available from https://github.com/PMBio/scdali_analyses and was deposited on zenodo (https://zenodo.org/record/5710797) [55].
References
Li X, Kim Y, Tsang EK, Davis JR, Damani FN, Chiang C, et al. The impact of rare variation on gene expression across tissues. Nature. 2017;550:239–43.
Ferraro NM, Strober BJ, Einson J, Abell NS, Aguet F, Barbeira AN, et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science. 2020;369:eaaz5900.
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.
Cannavò E, Koelling N, Harnett D, Garfield D, Casale FP, Ciglar L, et al. Genetic variants regulating expression levels and isoform diversity during embryogenesis. Nature. 2017;541:402–6.
Cuomo ASE, Seaton DD, McCarthy DJ, Martinez I, Bonder MJ, GarciaBernardo J, et al. Singlecell RNAsequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression. Nat Commun. 2020;11:810.
Jerber J, Seaton DD, Cuomo ASE, Kumasaka N, Haldane J, Steer J, et al. Populationscale singlecell RNAseq profiling across dopaminergic neuron differentiation. Nat Genet. 2021;53:304–12.
Kumasaka N, Rostom R, Huang N, Polanski K, Meyer K. Mapping interindividual dynamics of innate immune response at singlecell resolution. bioRxiv [Internet]. bioRxiv. 2021. Available from: https://doi.org/10.1101/2021.09.01.457774
Benaglio P, Newsome J, Han JY, Chiou J, Aylward A, Corban S, et al. Mapping genetic effects on cell typespecific chromatin accessibility and annotating complex trait variants using single nucleus ATACseq [Internet]. bioRxiv. 2020. Available from: https://doi.org/10.1101/2020.12.03.387894
Kumasaka N, Knights AJ, Gaffney DJ. Finemapping cellular QTLs with RASQUAL and ATACseq. Nat Genet. 2016;48:206–13.
Cuomo ASE, Heinen T, Vagiaki D, Horta D, Marioni JC, Stegle O. CellRegMap: a statistical framework for mapping contextspecific regulatory variants using scRNAseq [Internet]. bioRxiv. 2021. Available from: https://doi.org/10.1101/2021.09.01.458524
Knowles DA, Davis JR, Edgington H, Raj A, Favé MJ, Zhu X, et al. Allelespecific expression reveals interactions between genetic variation and environment. Nat Methods. 2017;14:699–702.
Sun W. A statistical framework for eQTL mapping using RNAseq data. Biometrics Wiley. 2012;68:1–11.
van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allelespecific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015;12:1061–3.
Mohammadi P, Castel SE, Brown AA, Lappalainen T. Quantifying the regulatory effect size of cisacting genetic variation using allelic fold change. Genome Res. 2017;27:1872–84.
Sun M, Zhang J. Allelespecific singlecell RNA sequencing reveals different architectures of intrinsic and extrinsic gene expression noises. Nucleic Acids Res. 2020;48:533–47.
Jiang Y, Zhang NR, Li M. SCALE: modeling allelespecific gene expression by singlecell RNA sequencing. Genome Biol. 2017;18:74.
Fan J, Wang X, Xiao R, Li M. Detecting celltypespecific allelic expression imbalance by integrative analysis of bulk and singlecell RNA sequencing data. PLoS Genet. 2021;17:e1009080.
Chen H, Lareau C, Andreani T, Vinyard ME, Garcia SP, Clement K, et al. Assessment of computational methods for the analysis of singlecell ATACseq data. Genome Biol. 2019;20:241.
Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of singlecell trajectory inference methods. Nat Biotechnol. 2019;37:547–54.
Floc’hlay S, Wong E, Zhao B, Viales RR, ThomasChollier M, Thieffry D, et al. Cisacting variation is common across regulatory layers but is often buffered during embryonic development. Genome Res. 2021;31:211224.
Moore R, Casale FP, Jan Bonder M, Horta D. BIOS Consortium, Franke L, et al. A linear mixedmodel approach to study multivariate geneenvironment interactions. Nat Genet. 2019;51:180–6.
Svensson V, Teichmann SA, Stegle O. SpatialDE: identification of spatially variable genes. Nat Methods. 2018;15:343–6.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33:155–60.
Lin X. Variance component testing in generalised linear models with random effects. Biometrika. 1997;84:309–26.
Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74.
Cusanovich DA, Reddington JP, Garfield DA, Daza RM, Aghamirzaie D, MarcoFerreres R, et al. The cisregulatory dynamics of embryonic development at singlecell resolution. Nature. 2018;555:538–42.
Kingma DP, Welling M. Autoencoding variational Bayes [Internet]. arXiv [stat.ML]. 2013. Available from: http://arxiv.org/abs/1312.6114v10
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018;15:1053–8.
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep. 2019;9:5233.
Wolf FA, Angerer P, Theis FJ. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 2018;19:15.
GonzálezBlas CB, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, et al. cisTopic: cisregulatory topic modeling on singlecell ATACseq data. Nat Methods. 2019;16:397–400.
Schep AN, Wu B, Buenrostro JD, Greenleaf WJ. chromVAR: inferring transcriptionfactorassociated accessibility from singlecell epigenomic data. Nat Methods. 2017;14:975–8.
Reddington JP, Garfield DA, Sigalova OM, Karabacak Calviello A, MarcoFerreres R, Girardot C, et al. Lineageresolved enhancer and promoter usage during a time course of embryogenesis. Dev Cell. 2020;55:648–64.e9.
Picelli S, Faridani OR, Björklund AK, Winberg G, Sagasser S, Sandberg R. Fulllength RNAseq from single cells using Smartseq2. Nat Protoc. 2014;9:171–81.
Park N, Yoo JC, Ryu J, Hong SG, Hwang EM, Park JY. Copine1 enhances neuronal differentiation of the hippocampal progenitor HiB5 cells. Mol Cells. 2012;34:549–54.
Cuomo ASE, Alvari G, Azodi CB, singlecell eQTLGen consortium, McCarthy DJ, Bonder MJ. Optimizing expression quantitative trait locus mapping workflows for singlecell studies. Genome Biol. 2021;22:188.
Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Chromatin potential identified by shared singlecell profiling of RNA and chromatin. Cell. 2020;183:1103–16.e20.
McCulloch CE, Searle SR. Generalized, linear, and mixed models. New Jersey: Wiley; 2004.
Rasmussen CE. Gaussian Processes in Machine Learning. In: Bousquet O, von Luxburg U, Rätsch G, editors. Advanced lectures on machine learning: ML Summer Schools 2003, Canberra, Australia, February 2  14, 2003, Tübingen, Germany, August 4  16, 2003, Revised Lectures. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004. p. 63–71.
AGG M, Van Der Wilk M, Nickson T, Fujii K, Boukouvalas A, LeónVillagrá P, et al. GPflow: A Gaussian Process Library using TensorFlow. J Mach Learn Res. 2017;18:1–6.
Titsias M. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In: van Dyk D, Welling M, editors. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR; 2009. p. 567–74.
Grønbech CH, Vording MF, Timshel PN, Sønderby CK, Pers TH, Winther O. scVAE: variational autoencoders for singlecell gene expression data. Bioinformatics. 2020;36:4415–22.
Wang D, Gu J. VASC: Dimension reduction and visualization of singlecell RNAseq data by deep variational autoencoder. Genomics Proteomics Bioinformatics. 2018;16:320–31.
Xiong L, Xu K, Tian K, Shao Y, Tang L, Gao G, et al. SCALE method for singlecell ATACseq analysis via latent feature extraction. Nat Commun. 2019;10:4576.
Xu C, Lopez R, Mehlman E, Regier J, Jordan MI, Yosef N. Probabilistic harmonization and annotation of singlecell transcriptomics data with deep generative models. Mol Syst Biol EMBO. 2021;17:e9620.
Chu W, Ghahramani Z. Gaussian processes for ordinal regression. J Mach Learn Res. 2005;6:10191041.
Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482:173–8.
Rossi MJ, Lai WKM, Pugh BF. Simplified ChIPexo assays. Nat Commun. 2018;9:2842.
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012;9:357–9.
Ntranos V, Yi L, Melsted P, Pachter L. A discriminative learning approach to differential expression analysis for singlecell RNAseq. Nat Methods. 2019;16:1636.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995;57:289–300.
Heinen T, Secchia S, et al. scDALI: modelling allelic heterogeneity in single cells reveals contextspecific genetic regulation. Datasets Array Express https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB10240/ (2021).
Heinen T, Secchia S, et al. scDALI: modelling allelic heterogeneity in single cells reveals contextspecific genetic regulation. Github. https://github.com/PMBio/scdali (2021).
Heinen T, Secchia S, et al. scDALI: modelling allelic heterogeneity in single cells reveals contextspecific genetic regulation. Zenodo. https://zenodo.org/record/5710797 (2021).
Acknowledgements
The authors thank members of the Furlong and Stegle labs for helpful comments and discussions. This work was technically supported by the EMBL Flow Cytometry, Genomics and Protein Expression and Purification Core Facilities.
Review history
The review history is available as Additional file 4.
Peer review information
Barbara Cheifet was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
T.H. received support from the German Cancer Research Center International PhD Program. This work was financially supported by core funding from EMBL (to E.E.F. and O.S), the German Cancer Research Center (O.S.), and the European Research Council (ERC) grant agreement ID: 787611 (DeCRyPT) to E.E.F and grant agreement ID: 810296 (DECODE) to O.S. Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
T.H. developed and implemented scDALI. T.H. and S.S. analyzed data and generated figures. S.S. adapted the sciATACseq protocol with advice from J.R., and S.S. and J.R. generated the sciATACseq data. B.Z. collected the F1 embryos. T.H., S.S., E.F., and O.S. interpreted the results and wrote the manuscript with input from all authors. E.F. and O.S. conceptualized, supervised, and funded the project. The authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Heinen, T., Secchia, S., Reddington, J.P. et al. scDALI: modeling allelic heterogeneity in single cells reveals contextspecific genetic regulation. Genome Biol 23, 8 (2022). https://doi.org/10.1186/s13059021025938
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059021025938
Keywords
 Singlecell, Regulatory genomics, Statistical methods