GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership

Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03067-9.

In this paper, we investigate the question of how to interpret the individual dimensions of a parts-based representation learned by fitting a topic model (in the topic model, the dimensions are also called "topics").For topics that assign observations to discrete clusters, one could apply a standard method for differential expression analysis [61,62] to compare expression between topics, then annotate these topics by the genes that are differentially expressed.The question, therefore, is what to do with topics that do not assign observations to discrete clusters.To tackle this question, we extend models that compare expression between groups by allowing observations to have partial membership in multiple groups.This more flexible differential expression analysis is implemented by taking an existing model and modifying it to allow for partial memberships to groups or topics.This modified model is a "grade of membership" model [63], so we call our new method grade of membership differential expression (GoM DE).The idea is that, by generalizing existing methods, we can continue to take advantage of existing elements of differential expression analysis but now apply them to learn about different types of cell features beyond discrete cell populations.
We describe the GoM DE approach more formally in the next section.Then, we evaluate the GoM DE approach in simulations, showing, in particular, that it recovers the same results as existing differential expression analysis methods when the cells can be grouped into discrete clusters.In case studies, we demonstrate how the GoM DE analysis analysis can be used to uncover and interpret a variety of cell features from single-cell RNA-seq and ATAC-seq data sets.

Methods overview and illustration
We begin by giving a brief overview of the topic model; then, we describe the new methods for annotating topics.To illustrate key concepts, we analyze a single-cell RNA-seq (scRNA-seq) data set obtained from peripheral blood mononuclear cells (PBMCs) [29] that has been used in several benchmarking studies (e.g., [4,7,8,64,65]).We refer to these data as the "PBMC data."

Learning expression topics from single-cell RNA-seq data
The original aim of the topic model was to discover patterns from collections of text documents, in which text documents were represented as word counts [45,50,[66][67][68].By substituting genes for words and cells for documents, topic models can also be used to learn a reduced representation of cells by their membership in multiple "topics" [47].
When applied to scRNA-seq data generated using UMIs, the topic model assumes a multinomial distribution of the RNA molecule counts in a cell, where s i = x i1 + • • • + x im , and m is the number of genes, that is, the number of RNA molecules x ij observed for gene j in cell i is a noisy observation of an underlying true expression level, π ij [8,69].
For n cells, the topic model is a reduced representation of the underlying expression, where , L, F are n × m , n × K , m × K matrices, respectively, with entries π ij , l ik , f jk .Each cell i is represented by its "grade of membership" in K topics, a vector of proportions l i1 , . . ., l iK , such that l ik ≥ 0 , K k=1 l ik = 1 , and each "expression topic" is rep- resented by a vector of (relative) expression levels f 1k , . . ., f mk , f jk ≥ 0 (these are also constrained to sum to 1, which ensures that the π ij s are multinomial probabilities).To efficiently fit the topic model to large single-cell data sets, we exploit the fact that the topic model is closely related to the Poisson NMF model [48].
The matrix L in (2), which contains the membership proportions for all cells and top- ics, can be visualized using a "Structure plot." Structure plots have been used to visualize the results of population genetics analyses (e.g., [70][71][72]) and, more recently, to visualize the topics learned from bulk and single-cell RNA-seq data [47].
A Structure plot visualizing the topic model fit to the PBMC data, with K = 6 topics, is given in Fig. 1.In this data set, the cells have been "sorted" into different cell types which provides a cell labeling to compare against.From the Structure plot, it is apparent (1) x i1 , . . ., x im ∼ Multinomial(s i ; π i1 , . . ., π im ).
(2) = LF T , Fig. 1 A and B give two views of the topic model fit to the PBMC data [29] (n = 94,655 cells, K = 6 topics) using Structure plots [70,71].Cells are arranged horizontally; bar heights correspond to cell membership proportions.In A, the cells are arranged using the estimated membership proportions only.In B, the cells are grouped by the FACS labels (the "T cells" label combines all sorted T cell populations other than CD8+ cytotoxic T cells).In C, the topics are annotated by distinctive genes from the GoM DE analysis (Fig. 3) and by enriched gene sets.Numbers in parentheses next to genes give posterior mean l.e.LFCs, and for gene sets, they are enrichment coefficients.An enrichment coefficient is an estimate of the expected increase in the LFC for genes that belong to the gene set relative to genes that do not belong to the gene set.Note the groupings a-g in A are intended only to aid visualization.See also Additional file 1: Fig. S1 for an alternative visualization that a subset of topics-topics 1, 2, and 3-correspond closely to the sorted subpopulations (B cells, CD14+ monocytes, CD34+ cells) (indeed, distinctive genes and enriched gene sets identified by the methods described below suggest these same subpopulations; Fig. 1C).Topics 4 and 5, on the other hand, are not confined to a single sorted cell type and instead appear to capture biological processes common to T cells and natural killer (NK) cells.CD8+ cytotoxic T cells have characteristics of both NK cells and T cells-these are T cells that sometimes become "NK-like" [73]-and this is captured in the topic model by assigning membership to both topics.Topic 6 also captures continuous structure, but, unlike topics 4 and 5, it is present in almost all cells, and therefore, its biological interpretation is not at all clear from the cell labeling.More generally, the topics, whether they capture largely discrete structure (topics 1-3) or more continuous structure (topics 4-6), can be thought of as a "soft" clustering [47].

Learning chromatin accessibility topics from single-cell ATAC-seq data
For single-cell ATAC-seq data, the observations x ij denote the number of reads mapping to region j in cell i.However, it is common to "binarize" the read counts such that x ij = 1 when at least one fragment in cell i maps to region j and x ij = 0 otherwise.
Using the topic model to analyze (binarized) single-cell ATAC-seq data was first suggested by [49].Therefore, they implicitly assumed a multinomial model (1) in which the x ij s are binarized accessibility values instead of UMI counts.A binomial model for bina- rized accessibility data was proposed in [74].As we explain in the "Methods" section, we view both models as approximations, and under reasonable assumptions the models are similar.

Differential expression analysis allowing for grades of membership
Having learned the topics, our aim now is to identify genes that are distinctive to each topic.In the simplest case, the topic is a distinct or nearly distinct cluster of cells, such as topic 1 or topic 2 in Fig. 1.
In the following, we describe methods for analyzing differences in expression, but they can also be understood as methods for analyzing differences in chromatin accessibility.Therefore, "expression, " "expressed, " and "gene" in the descriptions below may be substituted with "accessibility, " "accessible, " and "peak" (or "region").
Consider a single gene, j.Provided unmodeled sources of variation are negligible relative to measurement error, a simple Poisson model of expression should suffice: In this model, θ ij for gene j in cell i is controlled by the cell's membership in the cluster: when cell i belongs to the cluster, θ ij = p j1 ; otherwise, θ ij = p j2 .Under this model, dif- ferential expression (DE) analysis proceeds by estimating the log-fold change (LFC) in expression for each gene j, Although simple, this Poisson model forms the basis for many DE analysis methods [75][76][77][78][79][80]. (3) (4) LFC(j) = log 2 p j1 p j2 .
We now modify the Poisson model (3) in a simple way to analyze differential expression among topics.In a clustering, each cell belongs to a single cluster, whereas in the topic model, cells have grades of membership to the clusters [63] in which l ik is the mem- bership proportion for cluster or topic k.Therefore, we extend the model to allow for partial membership in the K topics: in which the membership proportions l ik are treated as known, and the unknowns p j1 , . . ., p jK represent relative expression levels (a related model is used in C-SIDE [80] to model cell-type mixtures in DE analysis of spatial transcriptomics data).Note that p jk will be similar to, but not the same as, f jk in the topic model because the DE analysis is a gene-by-gene analysis, whereas the topic model considers all genes at once.The standard Poisson model ( 3) is recovered as a special case of (5) when K = 2 and all member- ship proportions l ik are 0 or 1.
Recall, our aim is to identify genes that are distinctive to each topic.To this end, we estimate the least extreme LFC (l.e.LFC), which we define as in which LFC k,l (j) is the pairwise LFC, In words, the l.e.LFC for topic k is the LFC comparing topics k and l, in which l is chosen to be topic that results in the smallest ("least extreme") change.By this definition, a "distinctive gene" is one in which its expression is significantly different from its expression in all other topics (note the l.e.LFC reduces to the standard LFC (4) when K = 2 ).We then annotate topics by the distinctive genes.The estimation of l.e.LFCs and computation of related posterior statistics is described in the "Methods" section.
To illustrate what the least extreme LFC does and does not do, consider the following toy example with K = 10 topics (Fig. 2).Gene 1 has high expression in topic 1 and low expression in the other topics.Therefore, all the pairwise LFCs for topic 1 are large, LFC 1,k (1) = log 2 (100) , k = 2, . . ., 10 , and this results in an l.e.LFC for topic 1 of log 2 (100) ≈ 6.6 .So gene 1 is a distinctive gene for topic 1.Next consider gene 2, which has high expression in topics 1 and 2 and low expression in the other topics.For gene 2, the pairwise LFCs for topic 1 are mostly large, LFC 1,k (2) = log 2 (100) , k = 3, . . ., 10 , except for LFC 1,2 (2) = 0 .So, the l.e.LFC for topic 1 is zero and, as a result, gene 2, although potentially helpful for interpreting topic 1, is not a distinctive gene for topic 1.

Illustration of GoM DE analysis in PBMC data set
To illustrate, we applied the GoM DE analysis to the topic model shown in Fig. 1 and visualized the results in "volcano plots" (Fig. 3).We then used the GoM DE results (5) LFC k,l (j) := log 2 p jk p jl .
For the topics that closely correspond to cell types, the GoM DE analysis, as expected, identified genes and gene sets reflecting these cell types.For example, topic 1 corresponds to FACS B cells and is characterized by overexpression of CD79A (posterior mean l.e.LFC = 13.05) and enrichment of B cell receptor signaling genes (enrichment coefficient = 0.72).Topic 2 corresponds to myeloid cells and is characterized by overexpression of S100A9 (l.e.LFC = 15.45) and enrichment of genes down-regulated in hematopoietic stem cells (enrichment coefficient = 0.90).
The close correspondence between topics 1 and 2 and FACS cell types (B cells, myeloid cells) provides an opportunity to contrast the GoM DE analysis with a standard DE analysis of the FACS cell types (Fig. 4).This is not a perfect comparison because the topics and FACS cell populations are not exactly the same, but the LFC estimates correlate well (Fig. 4A, B).This comparison illustrates to two key differences: 1.Many more l.e.LFCs are driven toward zero in the GoM DE analysis (Fig. 4C), so the l.e.LFCs more effectively draw attention to the "distinctive genes" (Fig. 4A, B).This includes genes that are distinctively underexpressed such as ID2 in B cells [81].2. The GoM DE analysis yields much larger LFC estimates of the cell-type-specific genes.This is because the topic model isolates the biological processes (topics 1 and 2) related to cell type while removing background biological processes (topic 6) that do not relate to cell type.
Fig. 3 GoM DE analysis of the PBMC data using the topic model shown in Fig. 1.The volcano plots show posterior mean estimates of the l.e.LFC vs. posterior z-scores for 17,055 genes.The posterior z-score is defined as the posterior mean l.e.LFC divided by the posterior standard error.Genes are colored according to the local false sign rate (lfsr) [82].A few genes with extreme posterior z-scores are shown with smaller posterior z-scores so that they fit within the y-axis range.See also the detailed GoM DE results (Additional file 2: Table S1), detailed GSEA results (Additional file 3: Table S3, S4), and the interactive volcano plots (Additional file 4) Other topics capture more continuous structure, such as topics 4 and 5 (Fig. 1).Although the GoM DE analysis of these topics is not comparable to a standard DE analysis, many of the the distinctive genes and gene sets suggest NK and T cells, which are precisely the FACS-labeled cells with greatest membership to these topics: for example, for topic 4, overexpression of NKG7 (posterior mean l.e.LFC = 14.09), enrichment of cytolysis genes (enrichment coefficient = 2.22); for topic 5, overexpression of CD3D (l.e.LFC = 12.01), enrichment co-stimulatory signaling during T-cell activation (enrichment coefficient = 1.58).
Topic 6 captures continuous structure and is present in almost all cells, so knowledge of the FACS cell types is not helpful for understanding this topic.Still, the GoM DE results for topic 6 show a striking enrichment of ribosome-associated genes (Fig. 3, Additional file 3: Tables S3, S4) (these ribosomal protein genes also account for a large fraction of the total expression in the cells [5]).This ability to annotate distinctly nondiscrete structure is a distinguishing feature of the grade-of-membership approach, and below we will show more examples where this feature contributes to understanding of the cell populations.

Evaluation of DE analysis methods using simulated data
Having illustrated the features of this approach, we now evaluate the methods more systematically in simulated expression data sets.We began our evaluation by first considering the case of two groups in which there is no partial membership to these groups, that is, when the cells can be separated into two cell types.The GoM DE analysis should accommodate this special case and should compare well with existing DE analysis methods.We compared with DESeq2 [79] and MAST [83], both popular methods that have been shown to be competitive in benchmarking studies [61,62,85] (and are included in Seurat [25]).
To compare the ability of these methods to discover differentially expressed genes, we simulated RNA molecule count data for 10,000 genes and 200 cells in which 98% of cells were attributed to a single topic, with roughly the same number of cells assigned to each of the two topics (with membership proportions of 99% or greater).Note that although half the simulated genes had different expression levels in the two topics, most of these expression differences were small, and therefore the methods were not expected to identify most expression differences.This mimics the typical situation in gene expression studies whereby most expression differences are small.Molecule counts were simulated using a Poisson measurement model so that variation in expression across cells was due to either measurement error or true differences in expression levels between the two groups.For all DE analyses, we took group/topic assignments to be known so that incorrect assignment of cells to topics was not a source of error.Other aspects of the simulations were chosen to emulate molecule count data from scRNA-seq studies (see the "Methods" section).We repeated the simulations 20 times, and summarized the results of the DE analyses in Fig. 5 (also Additional file 1: Figs.S2, S3).
DESeq2 and the GoM DE analysis have several features in common: both are based on a Poisson model, and both use adaptive shrinkage [82,84] to improve accuracy of the LFC estimates and test statistics.Therefore, we expected the GoM DE results to closely resemble DESeq2 in these simulations.Indeed, both methods produced nearly identical posterior mean LFC estimates, posterior z-scores (Fig. 5A, B), and s-values (Additional file 1: Fig. S3) and achieved very similar performance (Fig. 5C).Although DESeq2 additionally estimates an overdispersion level for each gene, in these simulations, DESeq2 correctly determined that the level of overdispersion was small for genes with large expression differences, which explains the strong similarity of the LFC estimates and posterior z-scores.MAST, owing to an approach that is very different from DESeq2 and the GoM DE analysis, yielded estimates that were less similar (Additional file 1, Fig. S3), yet achieved comparable performance (Fig. 5C).
Next, we evaluated the GoM DE analysis methods in data sets in which the cells had varying degrees of membership to multiple topics.Since existing DE methods cannot handle the situation in which there are partial memberships to groups, we mainly sought to verify that the method behaves as expected in the ideal setting when data sets are simulated from the topic model (2).To provide some baseline for comparison, we also applied the method of Dey et al. [47], which is not strictly a DE analysis method but does provide a ranking of genes by their "distinctiveness" in each topic.This ranking is based on a simple Kullback-Leibler (K-L) divergence measure; large K-L divergences should signal large differences in expression, as well as high overall levels of expression, so large K-L divergences should correspond to small DE p-values.Since the K-L divergence is not a signed measure, we omitted tests for negative expression differences from the evaluations, which was roughly half of the total number of possible tests for differential expression.
We performed 20 simulations with K = 2 topics and n = 200 cells and another 20 simulations with K = 6 topics and n = 1,000 cells.To simplify evaluation, all genes either had the same rate of expression in all topics, or the rate was different in exactly one topic.As a result, the total number of expression differences in each data set was roughly the same regardless of the number of simulated topics.Other aspects of the simulations were kept the same as the first set of simulations (see Methods).Similar to before, we took the membership proportions to be known so that mis-estimation of the membership proportions would not be source of error in the GoM DE analysis and in calculation of the K-L divergence scores.
The largest K-L divergence scores in the simulated data sets reliably recovered true expression differences (Fig. 6A, E).Therefore, the K-L divergence scores achieved good true positives rates (i.e., good power) at low false positive rates, FPR = FP/(TN + FP) (see Fig. 5 for notation).However, for DE analysis, a more relevant performance measure is the false discovery rate, FDR = FP/(TP + FP) .Because the K-L divergence score does not fully account for uncertainty in the unknown gene expression differences, many genes with no expression differences among topics were also highly ranked, leading to poor FDR control (Fig. 6D, H).By contrast, the GoM DE analysis better accounted for uncertainty in the unknown expression levels.The GoM DE analysis also more accurately recovered true expression differences at small p-values or s-values (Fig. 6B, C, F,  G) and therefore obtained much lower false discovery rates at corresponding levels of power (Fig. 6D, H).Comparing the GoM DE analysis with and without adaptive shrinkage, the adaptive shrinkage did not necessarily lead to better performance (Fig. 6D, H) but did provide more directly interpretable measures of significance (s-values or local false sign rates) by shrinking the LFC estimates and adapting the rate of shrinkage to the data; for example, the expression differences were shrunk more strongly in the K = 6 data sets, correctly reflecting the much smaller proportion of true expression differences (compare Fig. 6C and G).
Fig. 6 Evaluation of methods for identifying expression differences in single-cell expression data sets in which cells were simulated with partial membership to 2 topics (A-D) or 6 topics (E-H).Methods compared are the Kullback-Leibler (K-L) divergence score of [47] and GoM DE with adaptive shrinkage (s-values, lfsr) and without adaptive shrinkage (p-values).The left-most panels (A, E) show the distribution of K-L divergence scores for all candidate expression differences (approximately half of 10,000 genes × 2 or 6 topics × 20 simulated data sets), shown separately for true expression differences (dark blue) and non-differences (orange).K-L divergence scores smaller than 10 −8 are plotted as 10 −8 .Similarly, B, C, F, and G show the distribution of GoM DE p-values or s-values with or without adaptive shrinkage, separately among differences and non-differences.D and H summarize performance in identifying expression differences; it shows power and FDR as the GoM DE p-value or lfsr are varied from 0 to 1 or as the K-L divergence score is varied from large to small.Note that in E and G, some bar heights are actually larger than 25,000 but are cut off at 25,000 for better visualization

Case study: scRNA-seq epithelial airway data from Montoro et al. (2018)
We reanalyzed scRNA-seq data for n = 7193 single cells sampled from the tracheal epi- thelium in wild-type mice [86].The original analysis [86] used a combination of methods, including t-SNE, community detection [87], diffusion maps [88], and partitioning around medoids (PAM) to identify 7 epithelial cell types: abundant basal and secretory (club) cells; rare, specialized epithelial cell types, including ciliated, neuroendocrine and tuft cells; a novel subpopulation of "ionocytes"; and a novel basal-to-club transitional cell type, "hillock" cells.Although not large in comparison to other modern single-cell data sets, this data set is challenging to analyze, with complex structure, and a mixture of abundant and rare cell types.In contrast to the PBMC data set, there are no existing cell annotations to interpret the topics, so we must rely on inferences made from the expression data alone to make sense of the results.
The topic model fit to the UMI counts with K = 7 topics is shown in Fig. 7A, and the results of the GoM DE analysis and subsequent GSEA are summarized in Fig. 7.

Fig. 7
Structure in mouse epithelial airway data (n = 7193 cells [86]) inferred from topic modeling (A, B), and GoM DE analysis (D) of selected topics using the membership proportions matrix L shown in A. In C, the topics are annotated by selected distinctive genes (numbers in parentheses are posterior mean l.e.LFCs) and selected enriched gene sets (numbers in parentheses are posterior mean estimates of the enrichment coefficients).In A, to better visualize the rare cell types, the cells were divided into two groups, "abundant" and "rare, " based on the estimated membership proportions, then the "abundant" cells were subsampled.The Structure plot in B was obtained by fitting another topic model, with K = 5 topics, to rare epithelial cell types (defined as the subset of 637 cells i with at least 10% membership to topic 6).The volcano plots show posterior estimates of l.e.LFC vs. posterior z-scores for 18,388 genes.A small number of genes with extreme posterior z-scores are shown with smaller posterior z-scores so that they fit within the y-axis range.See also the interactive volcano plots (Additional file 5: S6), GoM DE results (Additional file 2: Table S2, Additional file 1: Figs.S6; S7), and GSEA results (Additional file 3: Tables S5, S6) Although we do not have cell labels to compare with, distinctive genes emerging from the GoM DE analysis help connect some of the topics to known cell types.For example, the most abundant topics correspond well with predominant epithelial cell types in the lung: topic 1 shows strong overexpression of basal cell marker gene Krt5 [89] (posterior mean l.e.LFC = 4.62) and distinctive genes in topics 2 and 3 include key secretory genes in club cells such as Bpifa1/Splunc1 [90] (l.e.LFC = 4.93) and Scgb1a1 [91] (l.e.LFC = 5.90).
The "hillock" transitional cells, which were originally identified via a diffusion maps analysis [86], emerge as a single topic (topic 4, cyan), with Krt13 (l.e.LFC = 8.04) and Krt4 (l.e.LFC = 5.46) being among the most distinctive genes.The transitional nature of these cells is evoked by their mixed membership; only 237 out of the 7193 cells have > 90% membership to this topic.
Other less abundant epithelial cell types emerge as separate topics once a topic model is fit separately to the subpopulation of these rare cell types (Fig. 7B).These topics recover ciliated cells (topics 8, 9; Ccdc153, posterior l.e.LFC = 5.39), neuroendocrine cells (topic 10; Chga, l.e.LFC = 6.92), and tuft cells (topic 11; Trpm5, l.e.LFC = 6.94).Note that Foxi1+ ionocytes were previously identified as a novel cell type from a small cluster of 26 cells [86], but our analysis failed to distinguish this very rare cell type from the neuroendocrine cells (Additional file 1: Figs.S4, S5).
In summary, by taking a topic-model-based approach we identified and annotated well-characterized cell types such as basal cells, as well less distinct but potentially interesting substructures such as "Hillock" cells and club cell subtypes.

Case study: Mouse sci-ATAC-seq Atlas data from Cusanovich et al. (2018)
We reanalyzed data from the Mouse sci-ATAC-seq Atlas [97], comprising 81,173 single cells in 13 tissues.First, to provide an overview of the primary structure in the whole data set, we fit a topic model with K = 13 topics to these data.The topics correspond closely to the clusters identified in [97] (Additional file 1: Fig. S8), and several different tissues are distinguished by different topics (Fig. 8A).For the 4 tissues that have replicates, the replicates show a similar composition of the topics (Fig. 8A).
Next, we performed a more detailed analysis of just the kidney (6431 cells), fitting a topic model with K = 10 to just these cells.We focussed on the kidney cells because, as noted previously [97,98], both expression and chromatin accessibility vary in relation to the spatial organization of the renal tubular cells, and we predicted that this spatial structure could be better captured by topics rather than by traditional clustering methods.To interpret these topics obtained from chromatin accessibility data, we first used the GoM DE analysis to identify differentially accessible peaks for each topic; then, we used "co-accessibility" as predicted by Cicero [95,97] to connect genes to peaks representing distal regulated sites.Finally, we performed a simple enrichment analysis to identify the "distinctive genes" for each topic, which we defined as the genes with many distal regulatory sites that were differentially accessible.
The results of these analyses are shown in Fig. 8.Many of the distinctive genes (Fig. 8, Additional file 1: Fig. S9, Additional file 6: Table S7) clearly relate topics to known kidney cell types.For example, topic 1 is enriched for genes Klf5 and Elf5 which relate to the collecting duct [98,99]; topic 3 is enriched for genes Umod and Slc12a1 associated with the loop of Henle [98,100]); and topics 2, 6, and 7 are respectively enriched for Fig. 8 A Structure in Mouse Atlas sci-ATAC-seq data (n = 81,173) inferred from topic modeling, with K = 13 topics.B Topic model fit to kidney cells (n = 6,431) with K = 10 topics.C, D Gene-based enrichment analysis of differentially accessible peaks for the kidney cell topics shown in B, in which peaks are linked to genes using Cicero [95].In A, the cells are grouped by tissue, and replicates (for bone marrow, large intestine, lung and whole brain) are shown as separate tissues.Numbers in parentheses next to each tissue give the number of cells in that tissue.In D, marker genes for S1 (topic 4) and S3 (topic 5) proximal epithelial tubular cells are highlighted in red (see Table 1 of [96])."Mean l.e.LFC" is the average l.e.LFC among all peaks connected to the gene, restricted to l.e.LFCs with lfsr < 0.05.Log-Bayes factors greater than 200 are shown as 200 in the volcano plots.See Additional file 1: Fig. S9 and Additional file 6: Table S7 for more gene enrichment results.In B, the cells are subdivided into 5 groups (a-f ) only to improve visualization.See also Additional file 1: Fig. S10 which compares the topics in B to cell-type predictions based on clustering [97] genes related to the distal convoluted tubule (Wnk1), podocytes (Col1a2) and glomerular endothelial cells (Ptprb).
Most interestingly, spatial organization of the proximal tubule is captured by two topics; topic 4 is enriched for Slc5a2 (also known as Sglt2) and Slc2a2 (also known as Glut2), associated with the S1 segment of the proximal tube [96,101,102], and topic 5 is enriched for Slc5a8 (Smct1) and Atp11a, related to the S3 segment [96,103].This result illustrates the ability of the topic model to capture continuous variation in membership of two somewhat complementary processes, which traditional clustering methods are not designed for.

Case study: chromatin accessibility profiles of the hematopoietic system from Buenrostro et al. (2018)
Buenrostro et al. [104] studied 2034 single-cell ATAC-seq profiles of 10 cell populations isolated by FACS to characterize regulation of the human hematopoietic system.Both PCA and t-SNE showed, visually, the expected structure into the main developmental branches (Fig. 2 in [104]).However, neither PCA nor t-SNE isolated these branches as individual dimensions of the embedding.Identifying these branches may allow for more precise characterization of the underlying regulatory patterns.Here, by fitting a topic model to the data, the main developmental branches are identified as individual topics (Fig. 9A): topic 3, pDC; topic 4, erythroid (MEP); topic 5, lymphoid (CLP); and topic 6, myeloid (GMP and monocytes).Another topic captures the cells at the top of the developmental path (topic 1; HSC and MPP).Other cells at intermediate points in the developmental trajectory, such as CMP, GMP and LMPP cells, are more heterogeneous, and this is reflected by their high variation in topic membership.
Fig. 9 Structure in human hematopoietic system data [104] (n = 2034 cells) inferred from the topic model with K = 10 topics (A) and HOMER motif enrichment analysis [105] applied to the results of the GoM DE analysis (B).In the Structure plot, the cells are grouped by FACS, as well as an unknown population from human bone marrow [104].B shows HOMER enrichment results for selected motifs (for the full results, see Additional file 7: Table S8).Acronyms used: common lymphoid progenitor (CLP); common myeloid progenitor (CMP); granulocyte-macrophage progenitor (GMP); hematopoietic stem cell (HSC); lymphoid-primed multipotent progenitor (LMPP); megakaryocytic-erythroid progenitor (MEP); multi-potent progenitor (MPP); plasmacytoid dendritic cells (pDC) To better interpret the regulatory patterns behind each topic, we identified transcription factor (TF) motifs that were enriched for differentially accessible regions in each topic (Fig. 9B, Additional file 7: Table S8).Many of the top TF motifs (as ranked by HOMER p-values [105]) point toward regulation of the main developmental trajectories, such as EBF motifs in topic 5 (lymphoid), CEBP motifs in topics 6 and 7 (myeloid), and Hox motifs in topic 1 (HSC and MPP cells).A few topics (topics 8-10) are much less abundant and do not align well with the FACS cell types, and their motif enrichment results were correspondingly more difficult to interpret.
A complication that arose in analyzing these data, which was also noted in [104], is that the cells were obtained from different sources, and this shows up as systematic variation in the chromatin accessibility.This donor effect is captured by topics 1 and 2 in HSC and MPP cells and, to a lesser extent, in CMP and LMPP cells (Additional file 1: Fig. S11).Topic 1 is enriched for Jun and Fos TF motifs, similar to what was found in [104].

Discussion
The GoM DE analysis is part of a topic-model-based pipeline for analysis of single-cell RNA-seq [47] or ATAC-seq data [49].This pipeline includes the following steps: (1) fit a topic model to the data; (2) visualize the structure inferred by the topic model; (3) run the GoM DE analysis with the estimated topics; and, optionally, (4) perform other downstream analyses using the results of the GoM DE analysis, e.g., gene set enrichment analysis (for RNA-seq data) or motif enrichment analysis (for ATAC-seq data).Unlike most analysis pipelines for clustering and dimensionality reduction (e.g., [4,19,23,26,27]), the topic-model-based pipeline is directly applied to the "raw" count data and therefore does not require an initial step to transform and normalize the data which can lead to downstream issues in the statistical analysis [8,[106][107][108].We presented several case studies illustrating the use of the topic-model-based pipeline to analyze single-cell RNAseq and ATAC-seq data sets.From these case studies, we have drawn a few lessons on the practical challenges that may arise in applying topic modeling approaches to singlecell data, and we share these lessons here (see also [47,49] for related discussion).
One practical question is how to choose K, the number of topics.Many papers have suggested different criteria for determining K. Our view, following [47], is that there is no single "best" K, and we recognize the advantages of learning topics at multiple settings of K; in some data sets, different Ks can reveal structure at different levels of granularity (for example, increasing the number of topics in the Mouse sci-ATAC-seq Atlas data revealed more structure within tissues; see https:// tinyu rl.com/ 2p99s wdk).We have found that it is often helpful to start with a smaller K to elucidate the less granular structure, which is often easier to interpret, then rerun the topic modeling with larger K to identify finer structure.
We proposed annotating topics by distinctive genes identified using the l.e.LFC.One drawback is that this does not reveal the commonalities that may exist among multiple topics, for example, topics corresponding to subpopulations within a common class of cells.A simple alternative to the l.e.LFC, which is also implemented in the fast-Topics R package, is to compare against expression under the "null model" (see the "Methods").We view this as a complementary LFC metric that may reveal additional insights into the topics.
Donor, batch or other technical effects in the single-cell RNA-seq or ATAC-seq data can complicate the analysis and interpretation of the topics if these effects are not small.Since these effects are usually not known, usually we must assess their impact indirectly [109].For example, the Mouse sci-ATAC-seq Atlas data included several replicates, but the replicate effects appeared to be small judging by the fact that the replicates showed a similar composition of topics.By contrast, the donor effects in the human hematopoietic system data were much larger, and in the topic model, these donor effects were at least partially captured by individual topics.The broader question of how to deal with non-ignorable donor or batch effects-in particular, how to separate technical effects from biological effects of interest-remains a question of considerable debate and continued investigation [25,39,[109][110][111][112][113][114][115][116].In particular, it has been noted that attempting to "correct" for effects can sometimes remove differences that we would like to learn about such as differences in cell-type proportions among the batches.
For modeling UMI counts, an open question is whether the Poisson or multinomial model ( 1) is sufficient or whether more flexible models are needed (this question was investigated in [69] for single-gene models, but not for multi-gene models).Alternative models such as the negative binomial [117] or Poisson log-normal [80,118], which can capture additional random variation ("overdispersion") in underlying expression or measurement error, may result in more robust estimation of the topics.
In single-cell ATAC-seq data, the GoM DE analysis identifies differentially accessible peaks or regions.Usually, these peak-level results need to be translated into biological units that are more useful for annotating the topics (e.g., genes, gene sets, transcription factors).In the analysis of the hematopoietic system single-cell ATACseq data, we used HOMER [105] to identify TF motifs enriched for differentially accessible peaks.In the analysis of the Mouse sci-ATAC-seq Atlas data, we identified genes enriched for differentially accessible distal regularity sites.Clearly, the quality of the gene enrichment results will depend on our ability to accurately associate peaks with genes.For this, we used the scores computed in [97] using Cicero [95].However, there are now several alternatives to Cicero that may be preferred [19,27,28,[119][120][121][122], and in principle any of these approaches could be combined with the peak-level GoM DE results to identify relevant genes.
Recently developed technologies profile both transcription and chromatin accessibility in single cells [123,124].For such data, one could fit two topic models, one to the RNA-seq data and another to the ATAC-seq data.With a careful initialization of the topic model fitting algorithm, the topics may be more consistent across the two modalities.But it would be preferrable to analyze the multimodal data jointly for improved accuracy [125][126][127][128][129][130].Potentially, the strategy used in MOFA [131,132] could be adapted for topic modeling-that is, the transcripts and accessibility profiles would share the same membership proportions, L , but each modality would have a different F .However, it remains to be seen how well this strategy works in practice.

Conclusions
To summarize, we have described a new method that aids in annotating and interpretating the "parts" of cells learned by fitting a topic model to scRNA-seq data or singlecell ATAC-seq data.Our method, GoM DE (differential expression analysis allowing for grades of membership), can be viewed as an extension of existing differential expression methods that allows for mixed membership to multiple groups or topics.

Models for single-cell ATAC-seq data
In single-cell ATAC-seq data, x ij is the number of unique reads mapping to peak or region j in cell i.Although x ij can take non-negative integer values, it is common to "binarize" the accessibility data (e.g., [19,74,[133][134][135]), meaning that x ij = 1 when at least one read in cell i maps to region j and x ij = 0 otherwise.For this reason, one might prefer to model the binarized accessibility values as binomial (Bernoulli) random variables.A multinomial model, on the other hand, should better capture the sampling process for reads mapping to regions but does not account for the truncation of read counts above 1.Therefore, we view both the binomial and multinomial models as approximations.As we explain next, under reasonable assumptions the binomial and multinomial models are similar to each other so it may not matter which model one chooses.
The multinomial topic model for analyzing single-cell ATAC-seq data was suggested by [49].They assumed the multinomial model (1) in which the x ij s are binarized acces- sibility values instead of UMI counts.
A binomial model was proposed in [74], where t i > 0 is a cell-specific factor that depends on sequencing coverage and other properties (e.g., amplification, read post-processing [136]), r j > 0 is a region-specific factor (say, proportional to the size of the region), and the θ ij s capture additional vari- ation in accessibility across cells and regions.Moving forward, we make the simplifying assumption that the regions are all approximately the same size; that is, r j = 1 for all j = 1, . . ., m.
The binomial model ( 8) is closely related to a multinomial model.To make the connection, we first note that the binomial model with r j = 1 for all j can be approximated by a Poisson model, This will be a good approximation when the θ ij s are small and the cell-specific factors t i are large, which is usually the case in single-cell ATAC-seq data.Next, we note that the Poisson model ( 9) and multinomial model (1) are closely related if we choose the size factors to be t i = s i [69,137]; this implies ≈ , where is the n × m matrix with entries θ ij .By these arguments, the binomial model ( 8) (also the model used in [74]) and the multinomial model ( 1) (also the model used in [49]) are similar, and connecting the two models clarifies the assumptions made by each of the models.In particular, the multinomial topic model (1-2) used here and in [49] assumes a low-rank structure in the θ ij s across cells and regions; i.e., ≈ LF T .(8) x ij ∼ Binom(1, t i r j θ ij ), (9) x ij ∼ Pois(t i θ ij ).

Derivation of GoM DE model
In the "Methods overview, " we motivated the GoM DE model (3) as extending a basic Poisson model expression to allow for partial membership to K groups or topics.The GoM DE model can also be motivated from an approximation to the topic model.Recall, the topic model, is a multinomial model (1) in which the multinomial probabilities π ij are given by affine combinations of the expression levels f jk in the K top- ics, π ij = K k=1 l ik f jk .The non-negativity constraints l ik ≥ 0, f jk ≥ 0 and sum-to-one constraints K k=1 l ik = 1, p j=1 f jk = 1 ensure that the π ij s are multinomial probabilities.From a basic identity relating the multinomial and Poisson distributions [138,139], the multinomial likelihood for the topic model can be replaced with a likelihood formed by a simple product of independent Poissons, that is, where x i = (x i1 , . . ., x im ) and π i = (π i1 , . . ., π im ) .The approximation then comes from no longer requiring the π ij s to be multinomial probabilities by removing the constraint that f 1k + • • • + f mk = 1 .This allows us to analyze the genes j = 1, . . ., m independently.This is a good approximation so long as s i is large and the f jk s are small (a similar approx- imation was used for GLM-PCA [8]).To be explicit about this approximation, we say π ij ≈ θ ij (which are no longer guaranteed to be multinomial probabilities) and f jk ≈ p jk (which are no longer guaranteed to sum to one), resulting in the GoM DE model, which for convenience we restate here:

"Null" model
The simplest Poisson model of the form (3) is one in which θ ij is the same across all cells i, that is, θ ij = p j0 for all i = 1, . . ., n .We treat this a "null" model, which can be used to make certain comparisons, e.g., to estimate changes in expression in relative to expression in all cells.The maximum-likelihood estimate (MLE) of p j0 under the null model is

Estimation of log-fold change
In practice, we have found the l.e.LFC to work well, so in our results we use the l.e.LFC.But the l.e.LFC may not be appropriate in all circumstances, and for this reason ,we note that the GoM DE analysis framework is quite general and accommodates alternatives to the l.e.LFC.Two alternatives are implemented in the software.One alternative is to compare with the "null" model, (10) Another treats one topic l as a "reference topic", and compares all other topics k = l to l using (4).

Maximum-likelihood estimation
A convenience of the Poisson model allowing for grades of membership is that we can reuse Poisson NMF computations (described below and in more detail in [48]) to compute MLEs of the unknowns p jk : if we consider all genes j = 1, . . ., m simultaneously, we recover a Poisson NMF model, x ij ∼ Poisson( ij ), ij = K k=1 h ik w jk , by setting h ik = s i l ik , w jk = p jk .Therefore, we can reuse the Poisson NMF algorithms to compute MLEs of the unknowns p jk .

Maximum a posteriori estimation
To improve numerical stability in the parameter estimation, we compute maximum a posteriori (MAP) estimates of p j1 , . . ., p jK in which each p jk is assigned a gamma prior, p jk ∼ Gamma(α, β) , with α = 1 + ε , β = 1 , and ε > 0 .Typically, ε will be some small, positive number, e.g., ε = 0.1 .Here, we use the parameterization of the gamma distribu- tion from [140] in which α is the shape parameter and β is the inverse scale parameter; under this parameterization, the mean is α/β and the variance is α/β 2 .The maximum- likelihood computations can be reused for MAP estimation with this gamma prior by adding "pseudocounts" to the data; specifically, MAP estimation of p j1 , . . ., p jK given counts x 1j , . . ., x nj and membership proportions L and is equivalent to maximum-likeli- hood estimation of p j1 , . . ., p jK given counts x 1j , . . ., x nj , ε, . . ., ε and membership pro- portions matrix L I K , where I K is the K × K identity matrix.Unless otherwise stated, we added ε = 0.1 pseudocounts to the data.

Quantifying uncertainty and stabilizing LFC estimates
We implemented a simple Markov chain Monte Carlo (MCMC) algorithm [141,142] to quantify uncertainty in the LFC estimates.Although normal approximations to likelihoods are typically used by DE methods to quickly obtain analytical measures of uncertainty (e.g., standard errors, confidence intervals) for LFCs, we found that normal approximations to the likelihoods from (5) were sometimes poorly behaved, particularly for lowly expressed genes.Another consideration was that the analytical solutions provide confidence intervals for the unknowns p jk , but ultimately we are interested in quan- tifying uncertainty in the l.e.LFCs (6) which do not have a simple linear relationship to the p jk s.Therefore, it is unclear whether the standard analytical solutions can be applied to the l.e.LFCs without making further approximations or simplifications.MCMC is typically computationally intensive, but with careful implementation (e.g., use of sparse matrix operations and multithreaded computations) the MCMC algorithm is quite fast.Other benefits of using MCMC is that the algorithm can straightforwardly (13) LFC null k (j) := log 2 p jk p j0 .
accommodate different choices of LFC statistics and no normality assumptions are needed.
The basic idea behind the MCMC algorithm is as follows: for a given gene j, simulate the posterior distribution of the LFC statistic by performing a "random walk" on g j = (g j1 , . . ., g jK ) , where g jk := log p jk , k = 1, . . ., K .The random walk generates a sequence of states g 1 j , . . ., g (n s ) j , in which n s denotes the pre-specified length of the simu- lated Markov chain.After choosing an initial state g (0)  j , each new state g (s+1) j is generated from the current state g (s)  j by the following procedure: first, a topic k ∈ {1, . . ., K } is chosen uniformly at random; next, a proposed state g ⋆ j is generated as g ⋆ jk = g jk + δ , δ ∼ N (0, σ 2 ) , with g ⋆ jk ′ = g jk ′ for all k ′ � = k .Assuming an (improper) uniform prior for the unknowns, Pr(p jk ) ∝ 1 , the proposed state is accepted into the Markov chain with probability in which x j is the jth column of the counts matrix X , x j = (x 1j , . . ., x nj ) , and Pr(x j | p j ) is the likelihood at p j , Pr(x j | p j ) = n i=1 Poisson(x i ; s i θ i ) (note that x j may include pseu- docounts).The standard deviation of the Gaussian proposal distribution, σ , is a tuning parameter (unless otherwise stated, we used σ = 0.3 ) The additional p ⋆ jk /p (s) jk term in the acceptance probability is needed to account for the fact that we are simulating the logtransformed parameters g j , not p j ( [143], p. 11).When the proposal is not accepted, the new state is simply copied from the previous state, g (s+1) j = g (s)  j .Most of the effort in running the MCMC goes into computing the acceptance probabilities ( 14), so we have carefully optimized these computations.For example, we have taken advantage of the fact that the count vectors x j are typically very sparse.Addition- ally, these computations can be performed in parallel since the Markov chains are simulated independently for each gene j.
Once Monte Carlo samples g (s) , for s = 1, . . ., n s , have been simulated by this ran- dom-walk MCMC, we compute posterior mean LFC estimates and quantify uncertainty in the LFC estimates.For example, expressing the l.e.LFC for gene j and topic k as a function of the unknowns, LFC l.e.k (p j ) , the posterior mean l.e.LFCs are calculated as E[LFC l.e.k (p j )] ≈ n s s=1 LFC l.e.k (p (s) j )/n s .The final step in the GoM DE analysis is to perform adaptive shrinkage [82] to stabilize the posterior mean estimates.To implement this step, we used the ash function from the ashr R package [144].We used the same settings as DESeq2 to replicate as closely as possible the performance of DESeq2 with adaptive shrinkage.DESeq2 calls ash with method = "shrink", which sets the prior to be a mixture of uniforms without a point-mass at zero.
The adaptive shrinkage method takes as input a collection of effect estimates β1 , . . ., βm and associated standard errors ŝ1 , . . ., ŝm .In this setting, it is not immediately obvious what are the standard errors, in part because the posterior distribution of the unknowns is not always symmetric about the mean or median.To provide a reasonable substitute summarizing uncertainty in the estimates, we computed Monte Carlo estimates of highest posterior density (HPD) intervals.A (1 − α) HPD interval is the smallest interval that ( 14) (1 − α)% of the probability mass [145,146].Specifically, let [a jk , b jk ] denote the (1 − α) HPD interval for the LFC estimate of gene j in topic k, and let βjk denote the posterior mean.We defined the standard error as ŝjk = b jk − βjk when βjk < 0 ; oth- erwise, ŝjk = βjk − a jk .Defining the standard errors in this way prevented overshrink- ing of estimates that were uncertain but had little overlap with zero.We set the size of the HPD intervals to 1 − α = 0.68 so that the ŝjk would recover conventional standard error calculations when the posterior distribubtion is well approximated by the normal distribution.The revised posterior means and standard errors returned by the adaptive shrinkage method were then used by ashr to calculate test statistics including posterior z-scores (defined as the posterior mean divided by the posterior standard error [147]), local false sign rates (lfsr), and s-values.
An important question is the choice of n s .One heuristic way to assess whether n s is large enough is to perform two independent MCMC runs initialized with different pseudorandom number generator states ("seeds") and check consistency of the posterior estimates from the two runs (we checked consistency of the posterior estimates after stabilizing the estimates using adaptive shrinkage, as described above).In simulated data sets (below), comparison of two independent MCMC runs suggested that n s = 10,000 was sufficient to obtain reasonably accurate estimates of posterior means and posterior z-scores for all genes (Additional file 1: Fig. S2).Therefore, we performed initial MCMC simulations for all single-cell data sets using n s = 10,000 .The runtimes for per- forming these MCMC simulations on the single-cell data sets (described below), with n s = 10,000 , are given in Table 1.
Although this consistency check suggested that running a simulation with n s = 10,000 would be "good enough, " to provide additional assurance we performed another consistency assessment in the PBMC data set.We found that even better consistency was achieved with n s = 100,000 (Additional file 1: Fig. S12).Therefore, to provide more reli- able results, the final GoM DE results were generated with n s = 100,000.
The GoM DE analysis methods are implemented in the de_analysis function in the fastTopics package [148].

Single-cell data sets
All data sets analyzed were stored as sparse n × m matrices X , where n was the number of cells and m was the number of genes or regions.The data sets are summarized in Table 1.

Table 1
GoM DE simulation running times for the single-cell data sets with n s = 10, 000 simulation states; n is the number of cells, m is the number of genes or accessibility peaks analyzed, and K is the number of topics.See "Computing environment" for more details

Preparation of scRNA-seq data
Since the topic model is a multinomial model of count data, no log-normalization or other transformation of the scRNA-seq molecule counts was needed.Furthermore, we kept all genes other than those with no variation in the data set (this is done in part to demonstrate that our methods are robust to including genes with little variation).Also note that due to the use of sparse matrix techniques in our software implementations, including genes with low variation did not greatly increase computational effort.

Preparation of single-cell ATAC-seq data
As previously suggested [19,[133][134][135]), we "binarized" the single-cell ATAC-seq data, that is, we assigned x ij = 1 ("accessible") when least one fragment in cell i mapped to peak or region j; otherwise, x ij = 0 ("inaccessible").There are at least a couple reasons for doing this.For small peaks (say, < 5 kb), read counts do not provide a reliable quanti- tative measure of accessibility in single cells.This is because the (random) first insertion restricts the space for subsequent insertions.Additionally, insertions could occur within the same site on the same allele or on each of the two alleles, complicating interpretation of the read counts.
Like the RNA molecule count data (see above), we kept all regions except those that showed no variation.

Epithelial airway data from Montoro et al. (2018)
We analyzed a mouse epithelial airway data set from [86,150].These were gene expression profiles of trachea epithelial cells in C57BL/6 mice obtained using droplet-based 3′ scRNA-seq, processed using the GemCode Single Cell Platform.We downloaded file GSE103354_Trachea_droplet_UMIcounts.txt.gz.This file also contained the cluster assignments that we compared with (in [86], the samples were subdivided into 7 clusters using a community detection algorithm).After removing genes that were not expressed in any of the cells, the data set contained molecule counts for 7193 cells and 18,388 genes (90.7% of counts were zero).

Mouse Atlas data from Cusanovich et al. (2018)
Cusanovich et al. [97] profiled chromatin accessibility by single-cell combinatorial indexing ATAC-seq (sci-ATAC-seq) [151,152] in nuclei from 13 distinct tissues of a 8-weekold male C57BL/6J mouse.Replicates for 4 of the 13 tissues were obtained by profiling chromatin accessibility in a second mouse.We downloaded the (sparse) binarized peak × cell matrix in RDS format, atac_matrix.binary.qc_filtered.rds,from the Mouse sci-ATAC-seq Atlas website [153].We also downloaded cell_metadata.txt which included cell types estimated by a clustering of the cells (see Table S1 in [97]).The full data set used in our analysis (13 tissues, including 4 replicated tissues) consisted of the binary accessibility values for 81,173 cells and 436,206 peaks (1.2% overall rate of accessibility).Note that all peaks had fragments mapping to at least 40 cells, so no extra step was taken to filter out peaks.
Separately, we analyzed the sci-ATAC-seq data from kidney only, in which peaks with fragments mapping to fewer than 20 kidney cells were removed, resulting in data set containing binary accessibility values for 6431 cells and 270,864 peaks.Base-pair positions of the peaks were based on Mouse Genome Assembly mm9 (NCBI and Mouse Genome Sequencing Consortium, Build 37, July 2007).
From the Mouse sci-ATAC-seq website, we also downloaded the file master_cic-ero_conns.rds containing the Cicero co-accessibility predictions [95,153], which we used to link chromatin accessibility peaks to genes.For the kidney data, we connected a peak given in the "Peak2" column of the Cicero co-accessibility data table to a gene given in the "peak1.tss.gene_id"column if the "cluster" column was 11, 18, 22, or 25 (these four clusters were the main kidney-related clusters identified in [97]).This extracted, for each gene, the distal and proximal sites connected to the gene associated with Peak1 (specifically, a gene in which the transcription start site overlaps with Peak1).Among the 22,194 genes associated with at least one peak, the median number of peaks connected to a gene was 19, and the largest number of peaks was 179 (for Bahcc1 on chromosome 11).Among the 270,864 peaks included in the topic modeling analysis, 113,489 (42%) were connected to at least one gene, 95% of peaks were connected to 10 genes or fewer, and the largest number of connected genes was 60.

Human hematopoietic system data from Buenrostro et al. (2018)
Buenrostro et al. [104] used FACS to isolate 10 hematopoietic cell populations from human bone marrow and blood; then, the cells were assayed using single-cell ATAC-seq.
The processed single-cell ATAC-seq data were downloaded from [154], specifically file GSE96769_scATACseq_counts.txt.gzcontaining the fragment counts and file GSE96769_PeakFile_20160207.bed.gzcontaining peaks obtained from bulk ATAC-seq data [104].Although there may be benefits to calling peaks using aggregated single-cell data instead [155], we used the original accessibility data based on the bulk ATAC-seq peaks so that our analysis was more directly comparable to the analysis of [104].Following [104,155], we extracted the 2034 samples passing quality control filters; then, we "binarized" the counts.The list of 2034 cells considered "high quality" was obtained from file metadata.tsvincluded in the online benchmarking repository [155].After removing peaks with fragments mapping to fewer than 20 cells, the final data set used in our analysis consisted of binary accessibility values for 2034 cells and 126,719 peaks (4.6% overall rate of accessibility).Base-pair positions of the peaks were based on human genome assembly 19 (Genome Reference Consortium Human Build 37, February 2009).
In [104], a large, patient-specific batch effect was identified in the accessibility profiles for the HSC cells, and therefore, steps were taken in [104] to normalize the accessibility data before performing PCA.We instead fit the topic model to the unnormalized binary accessibility values, in part to find out how well the topic model can cope with the complication of a batch effect.In agreement with [104], this batch effect is at least partly captured by the topics, although in our analysis, the batch effect also appeared in MPP cells and, to a lesser extent, in CMP cells (Additional file 1: Fig. S11).

Fitting the topic models
In brief, we took the following steps to fit a topic model.All these steps are implemented in the R package fastTopics.
First, we fit a Poisson NMF model [37,156], where ∈ R n×m is a matrix of the same dimension as X with entries ij ≥ 0 giving the Poisson rates for the counts x ij .The parameters of the Poisson NMF model are stored as two matrices, H ∈ R n×K , W ∈ R m×K , with non-negative entries h ik , w jk .fastTopics has efficient implementations of algorithms for computing maximum-likelihood estimates (MLEs) of W, H [48]. Second, we recovered MLEs of F, L from MLEs of W, H by a simple reparameteriza- tion [48].
In an empirical comparison of Poisson NMF algorithms with count data sets, including scRNA-seq data sets [48], we found that a simple co-ordinate descent (CD) algorithm [157,158], when accelerated with the extrapolation method of Ang and Gillis [159], almost always produced the best Poisson NMF (and topic model) fits, and in the least amount of time.To confirm this, we compared topic model fits obtained by running the same four algorithms that were compared in [48]-EM and CD, with and without extrapolation-on the PBMC data set and assessed the quality of the fits.We evaluated the model fits in two ways: using the likelihood and using the residuals of the Karush-Kuhn-Tucker (KKT) first-order conditions (the residuals of the KKT system should vanish as the algorithm approaches maximum-likelihood estimates of W, H ). Following [48], to reduce the possibility that multiple optimizations converge to different local maxima of the likelihood, which could complicate these comparisons, we first ran 1000 EM updates; then, we examined the performance of the algorithms after this initialization phase (Additional file 1: Figs.S13, S14).Consistent with [48], the extrapolated CD updates always produced the best fit, or at the very least a fit that was no worse than the other algorithms, and almost always converged on a solution more quickly than the other algorithms.Therefore, subsequently we used the extrapolated CD updates to fit the topic models.(15) In more detail, the pipeline for fitting topic models consisted of the following steps: (1) initialize W using Topic-SCORE [160], (2) perform 10 CD updates of H , with W fixed, (3) perform 1000 EM updates (without extrapolation) to get close to a solution ("prefitting phase"), ( 4) run an additional 1000 extrapolated CD updates to improve the fit ("refinement phase"), and (5) recover F, L from W, H by a simple transformation.The prefitting phase was implemented by calling fit_poisson_nmf from fastTopics with these settings: numiter = 1000, method = "em", control = list(numiter = 4).The refinement phase was implemented with a second call to fit_poisson_nmf, with numiter = 1000, method = "scd", control = list(numiter = 4,extrapolate = TRUE), in which the model fit was initialized using the fit from the prefitting phase.The topic model fit was recovered by calling poisson2multinom in fastTopics.Note that only the estimates of L were used in the GoM DE analysis.
For each data set, we fit topic models with different choices of K and compared the fits for each K by comparing their likelihoods (Additional file 1: Fig. S5).

Visualizing the membership proportions
The membership proportions matrix L can be viewed as an embedding of the cells i = 1, . . ., n in a continuous space with K − 1 dimensions [50] (it is K − 1 dimensions because of the constraint that the membership proportions for each cell must add up to 1).A simple way to visualize this embedding in 2-d is to apply a nonlinear dimensionality reduction technique such as t-SNE [11,161] or UMAP [12] to L ([49] used t-SNE).We have also found that plotting principal components (PCs) of the membership proportions can be an effective way to explore the structure inferred by the topic model (Additional file 1: Figs.S1, S4).However, we view these visualization techniques as primarily for exploration, and a more powerful approach is to visualize all K − 1 dimensions simultaneously using a Structure plot [70,71].Here, we describe some improvements to the Structure plot for better visualization.These improvements are implemented in the structure_plot function in fastTopics.
When cells were labeled, we compared topics against labels by grouping the cells by these labels in the Structure plot.We then applied t-SNE to the L matrix, separately for each group, to arrange the cells on a line within each group.For this, we used the R package Rtsne [162] (in fastTopics, we also implemented options to arrange the cells in each group using UMAP or PCA, but in our experience we found that UMAP and PCA produced "noisier" visualizations).
Arranging the cells by 1-d t-SNE worked best for smaller groups of cells with less complex structure.For large groups of cells, or for unlabeled single-cell data sets, we randomly subsampled the cells to reduce t-SNE runtime (when cells number in the thousands, it is nearly impossible to distinguish individual cells in the Structure plot anyhow).Even with this subsampling, the Structure plot sometimes did not show fine-scale substructures or rare cell types.Therefore, in more complex cases, we first subdivided the cells into smaller groups based on the membership proportions, then ran t-SNE on these smaller groups.These groups were either identified visually from PCs of L or in a more automated way by running k-means on PCs of L (see [163]).

Gene enrichment analysis based on differential accessibility of peaks connected to genes
Here, we describe a simple approach to obtain gene-level statistics from the results of a differential accessibility analysis.This approach was applied in the topic modeling analysis of the Mouse Atlas kidney cells.
Cusanovich et al. [97] used the Cicero co-accessibility predictions and the binarized single-cell ATAC-seq data to compute a "gene activity score" R ki for each gene k and cell i.Here, we have a related but different goal: we would like to use the results of the differential accessibility analysis, which generates differential accessibility estimates and related statistics for each peak and each topic, to rank genes according to their importance to a given topic.A difficulty, however, with ranking the genes is that the Cicero co-accessibility predictions are uncertain, and they are only partially informative about which peaks are relevant to a gene.In aggregate, however, the expectation is that the "most interesting" genes will be genes that are (a) predicted by Cicero to be connected many peaks that are differentially accessible and (b) the differences in accessibility are mainly in the same direction.This suggests an enrichment analysis in which, for each gene, we test for enrichment of differential acccessibility among the peaks connected to that gene.Here, we describe a simple enrichment analysis for (a) and (b).
For (a), we computed a Bayes factor [164] measuring the support for the hypothesis that at least one of the peaks is differentially accessible (the LFC is not zero) against the null hypothesis that none of the peaks are differentially accessible.For (b), we computed the average LFC among all differentially accessible peaks (that is, peaks with nonzero LFC according to some significance criterion).
We implemented this gene enrichment analysis by running adaptive shrinkage [82] separately for each gene and topic.This had the benefit of adapting the shrinkage separately to each gene in each topic.In particular, in comparison to the usual adaptive shrinkage step for a GoM DE analysis (see above), it avoided overshrinking differences for genes exhibiting strong patterns of differential accessibility.We took the following steps to implement this adaptive shrinkage analysis.First, we ran function ash from the ashr package [144] once on the posterior mean l.e.LFC estimates βjk and their standard errors ŝjk for all topics k and all peaks j, with settings mixcom- pdist = "normal", method = "shrink".This was done only to determine the variances in the mixture prior and to get a "default" model fit to be used in the subsequent adaptive shrinkage analyses.
Next, we ran ash separately for gene and each topic k using the l.e.LFC estimates βjk and standard errors ŝjk from the peaks j connected to that gene.We set the vari- ances in the mixture prior to the variances determined from all the l.e.LFC estimates, and used ash settings mixcompdist = "normal" and pointmass = FALSE.One issue with running adaptive shrinkage using only the l.e.LFC estimates for the peaks connected to a gene is that some genes have few Cicero connections, leading to potentially unstable fits and unreliable posterior estimates.We addressed this issue by encouraging the fits toward the "default model" that was fitted to all genes and all topics; specifically, we set the Dirichlet prior on the mixture proportions to be Dirichlet(α 1 , . . ., α K ) with prior sample sizes α k = 1.01 + n 0 π default k , where here K denotes the number of components of the prior mixture (not the number of topics), π default k denotes the kth mixture proportion in the adaptive shrinkage prior for the fitted "default" model, and n 0 = 20 .This stabilized the fits for genes with few Cicero connections while still allowing some ability to adapt to genes with many connections.
Finally, we used the logLR output from ash as a measure of support for enrichment (this is the Bayes factor on the log-scale), and we computed the mean l.e.LFC as the average of the posterior mean estimates of the l.e.LFCs taken over all peaks j connected to the gene and with posterior lfsr < 0.05.

Motif enrichment analysis for differentially accessible regions
We used HOMER [105] to identify transcription factor (TF) motifs enriched for differentially accessible regions, separately for each topic estimated from the single-cell ATAC-seq data.For each topic k = 1, . . ., K , we applied the HOMER Motif Analysis tool findMotifsGenome.pl to estimate motif enrichment in differentially accessible regions; specifically, we took "differentially accessible regions" to be those with p-value less than 0.05 in the GoM DE analysis (Additional file 1: Fig. S16).These differentially accessible regions were stored in a BED file positions.bed.The exact call from the command-line shell was findMotifsGenome.plpositions.bedhg19 homer -len 8,10,12 -size 200 -mis 2 -S 25 -p 4 -h.
Note that the adaptive shrinkage step was skipped in the GoM DE analysis, so these are the p-values for the unmoderated l.e.LFC estimates.The reason for skipping the adaptive shrinkage step is that the shrinkage is performed uniformly for the LFC estimates for all regions, and since the vast majority of regions have l.e.LFC estimates are that are indistinguishable from zero, the result is that very few differentially accessible regions remain shrinkage.

Gene set enrichment analysis
We took a simple multiple linear regression approach to the gene set enrichment analysis (GSEA), in which we modeled the l.e.LFC estimate for gene i in a given topic, here denoted by y i , as y i = µ i + n j=1 x ij b j + e i , e i ∼ N (0, σ 2 ) , in which x ij ∈ {0, 1} indicates gene set membership; x ij = 1 if gene i belongs to gene set j, otherwise x ij = 0 (we rep- resented the gene-set membership as a sparse matrix since most x ij s are zero).Here, n denotes the number of candidate gene sets, and σ 2 is the residual variance to be esti- mated.The idea behind this simple approach was that the most relevant gene sets are those that best explain the log-fold changes y i , and therefore, in the multiple regression, we sought to identify these gene sets by finding coefficients b j that were nonzero with high probability.See [173,174] for similar ideas using logistic regression.Additionally, since many genes were typically differentially expressed in a given topic, modeling LFCs helped distinguish among DE genes that showed only a slight increase in expression versus those that were highly overexpressed [175,176].Of course, this simple multiple linear approach ignores uncertainty in the LFC estimates y i , which is accounted for in most gene set enrichment analyses.We addressed this issue by shrinking the l.e.LFC estimates prior to running the GSEA, that is, we took y i to be the the posterior mean LFC estimate after applying adaptive shrinkage, as described above (see the "Quantifying uncertainty and stabilizing LFC estimates" section).The result was that genes that we were more uncertain about had have an l.e.LFC estimate y i that was zero or near zero.
In detail, the GSEA was performed as follows.We performed a separate GSEA for each topic, k = 1, . . ., K .Specifically, for each topic k, we ran the susieR function susie with the following options: L = 10, intercept = TRUE, standardize = FALSE, estimate_residual_variance = TRUE, refine = FALSE, compute_univariate_zscore = FALSE and min_abs_corr = 0. We set L = 10 so that SuSiE returned at most 10 credible sets.For a given topic k, we reported a gene set as being enriched if it was included in at least one CS.We organized the enriched gene sets by (95%) credible sets.We also recorded the Bayes factor for each CS, which gives a measure of the level of support for that CS.For each gene set included in a CS, we reported the posterior inclusion probability (PIP) and the posterior mean estimate of the regression coefficient b j .In the results, we refer to b j as the "enrichment coefficient" for gene set j since it is an estimate of the expected increase in the l.e.LFC for genes that belong to gene set j relative to genes that do not belong to the gene set.Often, a CS contained only one gene set, in which case the PIP for that gene set was close to 1.In several other cases, the CS contained multiple similar gene sets; in these cases, the smaller PIPs indicated that it was difficult to choose among the gene sets because they are similar to each other (note that the sum of the PIPs in a 95% CS should always be above 0.95 and less than 1).Occasionally, SuSiE returned a CS with a small Bayes factor containing a very large number of gene sets.We excluded such CSs from the results.
We repeated these gene set enrichment analyses with two collections of gene sets: (1) all gene sets other than the MSigDB collections C1, C3, C4, and C6 and "archived" MSigDB gene sets and (2) only gene sets from curated pathway databases, specifically Pathway Commons, NCBI BioSystems and "canonical pathways" (CP) in the MSigDB C2 collection, and Gene Ontology (GO) gene sets in the MSigDB C5 collection.In all cases, we removed gene sets with fewer than 10 genes and with more than 400 genes.Table 2 gives the exact number of gene sets included in each GSEA.

Simulations
For evaluating the DE analysis methods, we generated matrices of UMI counts X ∈ R n×m for m = 10,000 genes and n = 200 or n = 1000 cells.We simulated the UMI counts x ij from a Poisson NMF model (15) in which W and H were chosen to emulate UMI counts from scRNA-seq experiments.
The matrices W and H were generated as follows.First, for each cell i, we gener- ated membership proportions l i1 , . . ., l iK then set h ik = s i l ik , for k = 1, . . ., K , where s i is the total UMI count.To simulate the wide range of total UMI counts often seen in scRNA-seq data sets, total UMI counts s i were normally distributed on the log-scale, s i = 10 u i , u i ∼ N (0, 1/5) , where N (µ, σ ) denotes the univariate normal distribution with mean µ and standard deviation σ.
Membership proportions l ik for each cell i were generated so as to obtain a wide range of mixed memberships, according to the following procedure: the number of nonzero proportions was set to K ′ ∈ {1, . . ., K } with probability 2 −K ′ ; the K ′ selected topics t 1 , . . ., t K ′ ⊆ {1, . . ., K } were drawn uniformly at random (without replacement) from 1, . . ., K ; then, the membership proportions for the selected topics were set to 1 when K ′ = 1 , or, when K ′ > 1 , they were drawn from the Dirichlet distribution with shape parameters α t 1 , . . ., α t K ′ .
Expression rates w jk were generated so as to emulate the wide distribution of gene expression levels observed in single-cell data sets and to allow for differences in expression rates among topics.The procedure for generating the expression rates for each gene j was as follows: with probability 0.5, the expression rates were the same across all topics and were generated as f j1 = • • • = f jK = 2 v j , v j ∼ N (−4, 2) .Otherwise, with probability 0.5, the expression rates were the same in all topics except for one topic.The differing topic k ′ was chosen uniformly at random from 1, . . ., K ; then, the expression rate for topic k ′ was set to f jk ′ = 2 v j +e j , e j ∼ N (0, 1) .As a result, the expression rates were roughly normally distributed on the log-scale, and the expression differences were also normally distributed on the log-scale.About half of genes had an expression difference among the topics.
Using this simulation we generated three collections of data sets.The simulation settings were altered slightly for each collection.In the first, data sets were simulated with K = 2 , α = (1/100, 1/100) , n = 200 so that most membership proportions were equal or very close to 0 or 1.In the second, we used K = 2 , α = (1, 1) , n = 200 to allow for a range of mixed memberships.In the third, we generated data sets with K = 6 , α = (1, . . ., 1) , n = 1,000.
DESeq was called from the DESeq2 R package (version 1.34.0) using settings recommended in the package vignette: test = "LRT", reduced = ∼ 1, useT = TRUE, minmu = 1e-6, minReplicatesForReplace = Inf.Size factors were calculated using the calculateSumFactors method from scran version 1.22.1 [23].The LFC estimates returned by DESeq were subsequently revised using adaptive shrinkage [82] by calling lfcShrink in DESeq2 with type = "ashr", svalue = TRUE (as in the GoM DE analysis, the DESeq2 posterior z-scores were defined as the posterior means divided by the posterior standard errors returned by the adaptive shrinkage).
To perform the GoM DE analysis in each of the simulations, we first fit a Poisson NMF model to the simulated counts X using fit_poisson_nmf from the fastTopics R package [48,148] (version 0.6-97).The loadings matrix H was fixed to the matrix used to simulate the data, and W was estimated by running 40 co-ordinate ascent updates on W alone (update.loadings= NULL, method = "scd", numiter = 40).The equivalent topic model fit was then recovered.Three GoM DE analyses were performed using the de_analysis function from the fastTopics R package, with the topic model fit provided as input: one analysis without adaptive shrinkage (shrink.method= "none"), and two analyses with adaptive shrinkage (shrink.method= "ash", ashr version 2.2-51 [144]) in which the MCMC was initialized with different pseudorandom number generator states.In all three runs, posterior calculations were performed with n s = 10,000, ε = 0.01 .Comparison of the two MCMC runs (with adaptive shrink- age) suggested that n s = 10, 000 was sufficient to obtain reasonably accurate posterior estimates in these simulations (Additional file 1: Fig. S2).

Computing environment
Most computations on real data sets were run in R 3.5.1 [185], linked to the OpenBLAS 0.2.19 optimized numerical libraries, on Linux machines (Scientific Linux 7.4) with Intel Xeon E5-2680v4 ("Broadwell") processors.For performing the Poisson NMF optimization, which included some multithreaded computations, as many as 8 CPUs and 16 GB of memory were used.The DESeq2 analysis of the PBMC data was performed in R 4.1.0,using 4 CPUs and 264 GB of memory.The evaluation of the DE analysis methods in simulated data sets was performed in R 4.1.0,using as many and 8 CPUs as 24 GB of memory.More details about the computing environment, including the R packages used, are recorded in the workflowr pages in the companion code repositories [186,187].

Fig. 4
Fig.4 GoM DE analysis vs. DESeq2 analysis in PBMC data.A and B compare differential expression in topics 1 and 2 (Fig.1) with their closely corresponding FACS cell populations.Genes are only shown if the posterior z-score was greater than 2 in magnitude in at least one of the DE analyses.Genes are colored by the "null model" expression rate.The Q-Q plot (C) compares the overall distribution of posterior z-scores for B cells and myeloid cells (x-axis) and for topics 1 and 2 (y-axis).For better visualization of quantiles near zero, posterior z-scores larger than 20 in magnitude are shown as 20 or −20.Analysis of differential expression among the 6 FACS cell populations was performed using DESeq2[79,84]

Fig. 5
Fig. 5 Evaluation of DE analysis methods in single-cell expression data sets in which cells were simulated from two groups without partial membership to these groups.A and B compare posterior mean LFC estimates and posterior z-scores returned by DESeq2 [79] and GoM DE.Each plot shows 200,000 points for 10,000 genes × 20 simulated data sets.C summarizes performance in identifying differentially expressed genes in all simulated data sets; it plots power and false discovery rates (FDR) for the three methods compared as the p-value (MAST [83]), s-value (DESeq2), or lfsr threshold (GoM DE) is varied from 0 to 1. Power and FDR are calculated from the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) as FDR = FP/(TP + FP) and power = TP/(TP + FN).See also Additional file 1: Figs.S2, S3

Table 2
Number of gene sets included in each GSEA