### Software implementations and applications

All analyses were carried out using R version 3.1.1 [47]. The method MAST [18] was implemented using the MAST R package version 0.931, obtained from GitHub at https://github.com/RGLab/MAST. The adjustment for cellular detection rate as recommended in [18] was included in the case study, but not in the simulation study (only the normal component of the test was considered here since no difference in dropout rate was simulated). The method SCDE [17] was implemented using the scde R package version 1.0, obtained from http://pklab.med.harvard.edu/scde/index.html. No adjustment for cellular detection rate was carried out since SCDE cannot accommodate covariates. Since SCDE requires raw integer counts as input, and expected counts are non-integer valued, the ceiling function was applied to the unnormalized counts. For each approach, the target FDR was controlled at 5 %. Specifically, both MAST and SCDE provide gene-specific *p* values and use the method of [35] to control FDR. We followed the same procedure here.

Our method is implemented using version 1.1.0 of the scDD R package, available at https://github.com/kdkorthauer/scDD. The analysis involves a computationally intensive permutation step, which is executed in parallel on multiple cores if available. On a Linux machine using 12 cores and up to 16 gigabytes of memory, this step took approximately 60 minutes for 1000 permutations of 1000 genes in the simulation of 50 samples per condition. Computation time scales approximately linearly with sample size, and this same task takes approximately 90 minutes for 100 samples per condition, and 300 minutes for a sample size of 500 per condition. The computation time to analyze the simulated datasets for SCDE (MAST) ranged from approximately 3 to 30 (0.5 to 5) minutes across the different sample sizes.

### hESC culture and differentiation

All cell culture and scRNA-seq experiments were conducted as described previously [30, 48]. Briefly, undifferentiated H1 and H9 hESCs were routinely maintained at the undifferentiated state in E8 medium on Matrigel (BD Bioscience) coated tissue culture plates with daily medium feeding [49]. HESCs were passaged every 3 to 4 days with 0.5 mM ethylenediaminetetraacetic acid (EDTA) in phosphate-buffered saline (PBS) at 1:10 to 1:15 ratio for maintenance. H1 were differentiated according to previously established protocols [50, 51]. All the cell cultures performed in our laboratory have been routinely tested as negative for mycoplasma contamination.

For DECs, H1 cells were individualized with Accutase (Life Technologies), seeded in E8 with BMP4 (5 ng/ml), Activin A (25 ng/ml) and CHIR99021 (1 *μ*M) for the first 2 days, then withdraw CHIR99021 for the remaining period of differentiation. DECs were harvested at the end of day 5, and sorted for the CXCR4-positive population for scRNA-seq experiments. For NPCs, the undifferentiated H1-SOX2-mCherry reporter line was treated with 0.5 mM EDTA in PBS for 3 to 5 min and seeded in E6 (E8 minus FGF2, minus TGF *β*1), with 2.5 *μ*g/ml insulin, SB431542 (10 *μ*M) and 100 ng/ml Noggin. NPCs were harvested and enriched at the end of day 7, after sorting for the Cherry-positive population for scRNA-seq experiments. All differentiation media were changed daily.

### Read mapping, quality control, and normalization

For each of the cell types studied, expected counts were obtained from RSEM [52]. In each condition there are a maximum of 96 cells, but all have fewer than 96 cells due to removal by quality control standards. Some cells were removed due to cell death or doublet cell capture, indicated by a post cell capture image analysis as well as a very low percentage of mapped reads. For more details on read mapping and quality control, see [30, 48]. DESeq normalization [53] was carried out using the MedianNorm function in the EBSeq R package [54] to obtain library sizes. The library sizes were applied to scale the count data. Further, genes with a very low detection rate (detected in fewer than 25 *%* of cells in either condition) are not considered.

### Publicly available scRNA-seq datasets

Processed FPKM-normalized data from human myoblast cells [31] were obtained from GEO [55] using accession number GSE52529. In this study, we examined the set of cells cultured on standard growth medium (samples labeled with T0) as well as those treated with differentiation-inducing medium for 72 h (samples labeled with T72). Processed TPM-normalized data from mESCs [32] were also obtained from GEO under accession number GSE60749. In this study, we examined the samples labeled as mESC (cultured in standard medium), along with the samples labeled as TwoiLIF (cultured in 2i + LIF differentiation-inhibitory medium).

### Publicly available bulk RNA-seq datasets

The modality of the gene expression distributions in bulk RNA-seq was explored using large, publicly available datasets and the results are displayed in Fig. 2. In this figure, the red bars depict the bulk RNA-seq results, and datasets are labeled according to their source and sample size. Datasets GE.50, GE.75, and GE.100 are constructed by randomly sampling 50, 75, and 100 samples from GEUVADIS [56] to obtain sample sizes comparable to the single-cell sets under study (obtained from the GEUVADIS consortium data browser at www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/analysis_results/GD660.GeneQuantCount.txt.gz). Dataset LC consists of 77 normal lung tissue samples from the TCGA lung adenocarcinoma study [57] (obtained from GEO [55] using accession number GSE40419). All datasets were normalized using DESeq normalization [53] except for LC, for which the authors supplied values already normalized by RPKM.

### Mixture model formulation

#### Dirichlet process mixture of normals

Let \({Y^{c}_{g}}=(y^{c}_{g1},\ldots,y^{c}_{g{J_{c}}})\) be the log-transformed nonzero expression measurements of gene *g* for a collection of *J*
_{
c
} cells in condition *c* out of 2 total conditions. For simplicity of presentation, we drop the dependency on *g* for now, and let the total number of cells with nonzero measurements be *J*. We assume that under the null hypothesis of equivalent distributions (i.e., no dependency on condition), *Y*={*Y*
^{c}}_{
c=1,2} can be modeled by a conjugate DPM of normals given by

$$ \begin{aligned} {y^{c}_{j}} &\sim N(\mu_{j}, \tau_{j}) \\ \mu_{j}, \tau_{j} &\sim G \\ G &\sim \operatorname{DP}(\alpha, G_{0}) \\ G_{0} & = \operatorname{NG} (m_{0}, s_{0}, a_{0}/2, 2/b_{0}) \\ \end{aligned} $$

(1)

where DP is the Dirichlet process with base distribution *G*
_{0} and precision parameter *α*, *N*(*μ*
_{
j
},*τ*
_{
j
}) is the normal distribution parameterized with mean *μ*
_{
j
} and precision *τ*
_{
j
} (i.e., with variance \(\tau _{j}^{-2}\)), and NG(*m*
_{0},*s*
_{0},*a*
_{0}/2,2/*b*
_{0}) is the normal-gamma distribution with mean *m*
_{0}, precision *s*
_{0}
*τ*
_{
j
}, shape *a*
_{0}/2, and scale 2/*b*
_{0}. Let *K* denote the number of components [unique values among \((\mu, \tau) = \{\mu _{j}, \tau _{j}\}_{j=1}^{J}\)]. Note that two observations indexed by *j* and *j*
^{′} belong to the same component if and only if \((\mu _{j}, \tau _{j})=(\mu _{j^{\prime }}, \phantom {\dot {i}\!}\tau _{j^{\prime }})\).

#### Product partition models

The posterior distribution of (*μ*,*τ*) is intractable even for moderate sample sizes. This is because the number of possible partitions (clusterings) of the data grows extremely rapidly as the sample size increases (according to the Bell number). However, if we let *Z*=(*z*
_{1},…,*z*
_{
J
}) be the vector of component memberships of gene *g* for all samples, where the number of unique *Z* values is *K*, the probability density of *Y* conditional on *Z* can be viewed as a PPM [58, 59]. Thus, it can be written as a product over all component-specific densities:

$$ f(Y|Z) = \prod_{k=1}^{K} f(y^{(k)}) \\ $$

(2)

where *y*
^{(k)} is the vector of observations belonging to component *k* and *f*(*y*
^{(k)}) is the component-specific distribution after integrating over all other parameters. In the conjugate normal-gamma setting, this has a closed form given by

$$ f(y^{(k)}) \propto \frac{\Gamma(a_{k}/2) }{(b_{k}/2)^{a_{k}/2}} s_{k}^{-1/2}. $$

(3)

The posterior for the parameters (*μ*
_{
k
},*τ*
_{
k
}) conditional on the partition is

$$ (\mu_{k}, \tau_{k}) | Y, Z \sim \operatorname{NG} (m_{k}, s_{k}, a_{k}/2, 2/b_{k}). $$

(4)

The posterior parameters (*m*
_{
k
}, *s*
_{
k
}, *a*
_{
k
}, *b*
_{
k
}) also have a closed form due to the conjugacy of the model given by Eq. 1. These parameters are given by

$$ \begin{aligned} s_{k} &= s_{0} + n^{(k)} \\ m_{k} &= \frac{s_{0} m_{0} + \sum y^{(k)}}{s_{k}} \\ a_{k} &= a_{0} + n^{(k)} \\ b_{k} &= b_{0} + \sum (y^{(k)})^{2} + s_{0}{m_{0}^{2}} - s_{k}{m_{k}^{2}} \\ \end{aligned} $$

(5)

where *n*
^{(k)} is the number of observations in component *k*. It follows that the marginal posterior distribution of *μ*
_{
k
} conditional on the partition is

$$ \mu_{k} | Y, Z \sim t_{a_{k}} \left(m_{k}, \frac{b_{k}}{a_{k} s_{k}}\right) $$

(6)

where *t*
_{
a
}(*b*,*c*) denotes the generalized Student’s *t* distribution with *a* degrees of freedom, noncentrality parameter *b*, and scale parameter *c*. The product partition DPM model can be simplified as follows:

$$ \begin{aligned} y_{j} \, | z_{j}=k, \mu_{k}, \tau_{k} &\sim N(\mu_{k}, \tau_{k}) \\ \mu_{k}, \tau_{k} &\sim \operatorname{NG} (m_{0}, s_{0}, a_{0}/2, 2/b_{0}) \\ z &\sim \frac{\alpha^{K} \Gamma(\alpha)}{\Gamma(\alpha+J)} \prod_{k=1}^{K} \Gamma(n^{(k)}). \end{aligned} $$

(7)

Then we can obtain the joint predictive distribution of the data *Y* and partition *Z* by incorporating Eq. 7:

$$ \begin{aligned} f(Y, Z) &= f(Z) \prod_{k=1}^{K} f(y^{(k)}) \\ & \propto \alpha^{K} \prod_{k=1}^{K} \frac{\Gamma(n^{(k)}) \Gamma(a_{k}/2) }{ (b_{k}/2)^{a_{k}/2}} s_{k}^{-1/2}. \end{aligned} $$

(8)

#### Model-fitting

The fitting of the model given in Eq. 7 involves obtaining an estimate \(\hat {Z}\) of the partition. The goal is to find the partition that yields the highest posterior mass in Eq. 8, referred to as the maximum a posteriori (MAP) partition estimate. Under this modeling framework, the solution for the MAP estimate is not deterministic and several computational procedures have been developed utilizing Polya urn Gibbs sampling [60–62], agglomerative greedy search algorithms [63, 64], or an iterative stochastic search [65].

These procedures generally involve evaluation of the posterior at many different candidate partitions, and as such tend to be computationally intensive. To avoid this challenge, we recognize the relation to the corresponding estimation problem in the finite mixture model framework, where the partition estimate can be obtained by optimizing the Bayesian information criterion (BIC) of the marginal density *f*(*Y*|*Z*) [66]. In fact, for certain settings of the prior distribution over partitions, the MAP estimate is identical to the estimate obtained by optimizing the BIC [59]. In practice, even when these settings are not invoked, the performance of partition estimates obtained via BIC optimization show comparable performance (see Additional file 1: Section 1). We obtain the partition estimate \(\hat {Z}\) that optimizes the BIC using the Mclust R package [66] and satisfies the criteria for multi-modality described in the next section.

The hyperparameters for the component-specific mean and precision parameters were chosen so as to encode a heavy-tailed distribution over the parameters. Specifically, the parameters were set to *μ*
_{0}=0, \({\tau _{0}^{2}}=0.01\), *a*
_{0}=0.01, and *b*
_{0}=0.01. The Dirichlet concentration parameter was set to *α*=0.01, and choosing this is shown in Additional file 1: Section 1 to be robust to many different settings in a sensitivity analysis.

#### Partition estimation

The partition estimate \(\hat {Z}\) is obtained that optimizes BIC using Mclust [66], in addition to the following filtering criteria. Note that the only constraint imposed on the number of components *K* in the modeling framework is that *K*≤*J*. However, under the sample sizes in this study, we consider only *K*≤5. The first filtering criterion is based on the notion that a two-component mixture model is not necessarily bimodal [67], and relaxes the requirement that the MAP estimate corresponds to the model with the lowest BIC. Specifically, for each candidate model fitted by BIC with *K* components, a split step (if *K*=1, obtain a new partition estimate \(\hat {Z}\) with *K*=2 unique elements) or a merge step (if *K*≥2, obtain a new partition estimate \(\hat {Z}\) restricted to *K*−1 unique elements) is carried out to generate a new candidate partition. The candidate partition with the larger value of *K* becomes the partition estimate only if the component separation suggests multi-modality. Component separation between any pair of components is assessed with the bimodality index (BI) [68]:

$$ \text{BI} = 2 \times \sqrt{\frac{n_{1} n_{2}}{(n_{1}+n_{2})^{2}}} \left(\frac{| \mu_{1} - \mu_{2} | }{\sigma} \right) $$

where the component means *μ*
_{1} and *μ*
_{2} are estimated via maximum likelihood, the common within-component standard deviation *σ* is conservatively estimated with the maximum within-component standard deviation among all components, and *n*
_{1} and *n*
_{2} are the number of cells belonging to each component. BI thresholds for the split and merge step were determined empirically and vary by sample size, as multiple modes are more easily detected as sample size increases [68] (for more details see Additional file 1: Section 4).

The second filtering criterion is designed to reduce the impact of outlier cells. Specifically, components with fewer than three cells are not considered, and the merge step is also carried out if one of the components present has an extremely large variance compared to the others (more than 20 times larger than any other component). Likewise, the split step is not carried out if one of the proposed components has a variance more than 10 times larger than any other component.

### Simulation details

#### Component means and variances

Each gene was simulated based on the characteristics of a randomly sampled unimodal gene with at least 25 *%* nonzero measurements in the H1 dataset. For unimodal genes, the mean and variance were chosen to match the observed mean and variance; for bimodal genes, the component means and variances were selected to be near the observed mean and variance. The proportion of zeroes is chosen to match that observed in the randomly sampled gene, and is not varied by condition. Details are provided in the following sections.

Distances between (log-scale) component means *Δ*
_{
μ
}
*σ* in the multi-modal genes were chosen such that components were separated by a minimum of two and a maximum of six standard deviations, where the standard deviation *σ* is assumed constant (on the log-scale) across components. The specific values of *σ* used for the simulated genes are empirical estimates of the standard deviations of the unimodal case study genes (assuming a lognormal distribution on the raw scale). In this setting, the component distance can also be thought of as a fold-change within condition (across components), where the ratio of the component means (untransformed-scale) is equal to \(\mathrm {e}^{\Delta _{\mu }\hat {\sigma }}\). The ratio of the component standard deviations (raw scale) is also equal to this same fold-change (see Additional file 1: Section 2.1 for more details). The component mean distance values were chosen to represent a range of settings for which the difficulty of detecting multi-modality is widely varied, as well as to reflect the range of observed component mean distances detected empirically in the case studies.

#### Unimodal genes

The parameters of the negative binomial distribution for unimodal genes were estimated from the randomly sampled observed genes using the method of moments. These empirical parameters were used as is to simulate both conditions of EE genes, and condition 1 of DE and DB. Condition 1 of DM was simulated by decreasing the mean by half the value of *Δ*
_{
μ
}. The second condition for DE genes was simulated based on condition 1 parameters using randomly sampled fold-changes that were between two and three standard deviations of the observed fold-changes between H1 and DEC.

#### Bimodal genes

The parameters for the mixture of negative binomial distributions in bimodal genes were also generated using empirically estimated means and variances. The first (lower) component mean was decreased by half the value of *Δ*
_{
μ
} and the second (higher) component mean was increased by half the value of *Δ*
_{
μ
}.

### DD classification algorithm

Genes detected as significantly DD from the permutation test of the Bayes factor score are categorized into patterns of interest. The genes that are not classified as DE, DP, DM, or DB are considered to be no calls, abbreviated NC. These represent patterns that are not of primary interest, such as those that differ only in variance (but not in the number of components or their means). This type of difference may result from cell-specific differences in technical variation [17], which can only be decomposed from biological variation in experimental protocols that allow for independent estimation of technical effects using spike-in controls, for example [69].

An additional step to improve the power to detect genes in the DP category was also implemented. This step was motivated by the observation that the Bayes factor score tends to be small when the clustering process within each condition is consistent with that overall, as in the case of DP. Thus, for genes that were not significantly DD by permutation but had the same number of components within condition as overall, Fisher’s exact test was used to test for independence with biological condition. If the *p* value for that test is less than 0.05, then the gene was added to the DP category (this did not result in the addition of any false positives in the simulation study). In addition, since the Bayes factor score depends on the estimated partition, we increase the robustness of the approach to detect DD genes under possible misspecification of the partition by also assessing evidence of DD in the form of an overall mean shift for genes not significant by the permutation test (using a *t*-statistic with FDR controlled by [35]). This resulted in the detection of between 121 and 689 additional genes in the hESC comparisons and did not add any false positives in 94 *%* of simulation replications (with only a single false positive gene in the other 6 *%* of replications).

Here we present pseudocode for the classification of DD genes into the categories DE, DP, DM, or DB. For every pair of components, we obtain a sample of 10,000 observations from the posterior distribution of the difference in means. The components are considered to overlap if the 100 *%* credible interval contains 0.

### DD classification algorithm