Obstacles to detecting isoforms using full-length scRNA-seq data

Background Early single-cell RNA-seq (scRNA-seq) studies suggested that it was unusual to see more than one isoform being produced from a gene in a single cell, even when multiple isoforms were detected in matched bulk RNA-seq samples. However, these studies generally did not consider the impact of dropouts or isoform quantification errors, potentially confounding the results of these analyses. Results In this study, we take a simulation based approach in which we explicitly account for dropouts and isoform quantification errors. We use our simulations to ask to what extent it is possible to study alternative splicing using scRNA-seq. Additionally, we ask what limitations must be overcome to make splicing analysis feasible. We find that the high rate of dropouts associated with scRNA-seq is a major obstacle to studying alternative splicing. In mice and other well-established model organisms, the relatively low rate of isoform quantification errors poses a lesser obstacle to splicing analysis. We find that different models of isoform choice meaningfully change our simulation results. Conclusions To accurately study alternative splicing with single-cell RNA-seq, a better understanding of isoform choice and the errors associated with scRNA-seq is required. An increase in the capture efficiency of scRNA-seq would also be beneficial. Until some or all of the above are achieved, we do not recommend attempting to resolve isoforms in individual cells using scRNA-seq.

Fig. S20: Some models of isoform choice are more plausible than others.We model the probability of picking any given isoform as a Normal distribution, a Bernoulli distribution and a constant probability, all with the same mean (0.25) (top row of graphs).In the following rows, we show the distributions of the mean number of isoforms per gene per cell detected when each model of isoform choice is used.The second row is H1 hESCs sequenced at 4 million reads, the third row is H9 hESCs sequenced at 1 million reads, the fourth row is H9 hESCs sequenced at 4 million reads.Fig. S21: Some models of isoform choice are more plausible than others.We model the probability of picking any given isoform as a Normal distribution, a Bernoulli distribution and a constant probability, all with the same mean (0.25) (top row of graphs).In the following rows, we show the distributions of the overlap fraction when each model of isoform choice is used.The second row is H1 hESCs sequenced at 1 million reads per cell, the third row is H1 hESCs sequenced at 4 million reads, the fourth row is H9 hESCs sequenced at 1 million reads, the fifth row is H9 hESCs sequenced at 4 million reads.S9: Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to the simulation results generated using the Normal, Bernoulli and p=0.25 models of isoform choice to test whether the distributions generated by different isoform choice models significantly differ.

Fig. S1 :
Fig. S1: Negative control model for H1 hESCs.In the simulation results displayed, no dropouts or quantification errors were simulated.The simulation procedure was otherwise unchanged.

Fig. S2 :
Fig. S2: Negative control model for H1 hESCs.In the simulation results displayed, no dropouts were simulated.The simulation procedure was otherwise unchanged.

Fig. S3 :
Fig. S3: Negative control model for H1 hESCs.In the simulation results displayed, no quantification errors were simulated.The simulation procedure was otherwise unchanged.

Fig. S4 :
Fig. S4: The effect of sequencing depth on isoform detection.a Distributions of the mean number of isoforms detected per gene per cell for H9 hESCs whose cDNA was split and sequenced at approximately 1 million reads per cell or 4 million reads per cell on average.b Distributions of the overlap fraction with the ground truth.

Fig. S5 :
Fig. S5: The impact of dropouts on isoform detection.a shows the distribution of the probabilities of dropouts (p(Dropout)) in each group of H9 hESCs and an approximation of these distributions using a Beta distribution.At 1 million reads per cell, α = 1.31 and β = 0.74 in the approximated Beta distribution.At 4 million reads per cell, α = 0.72 and β = 1.03 in the approximated Beta distribution.b shows five Beta Distributions from which dropout probabilities were sampled from in the simulations used to generate c and d.In c, the distribution of the mean number of isoforms detected per gene per cell is shown for simulations in which one isoform was produced per gene per cell.Each plot corresponds to a simulation in which dropout probabilities were sampled from one of the distributions shown in b. d shows the overlap fraction with the ground truth for each simulation.Plots shown in c & d are for H9 hESCs sequenced at 4 million reads per cell.6

Fig. S9 :
Fig. S9: Different models of isoform choice alter our ability to detect isoforms.a Distributions of the mean number of isoforms detected per gene per cell for H1 hESCs sequenced at approximately 1 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S10 :
Fig. S10: Different models of isoform choice alter our ability to detect isoforms.a Distributions of overlap fraction with the ground truth for H1 hESCs sequenced at approximately 1 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S11 :
Fig. S11: Different models of isoform choice alter our ability to detect isoforms.a Distributions of the mean number of isoforms detected per gene per cell for H9 hESCs sequenced at approximately 4 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S12 :
Fig. S12: Different models of isoform choice alter our ability to detect isoforms.a Distributions of overlap fraction with the ground truth for H9 hESCs sequenced at approximately 4 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S13 :
Fig. S13: Different models of isoform choice alter our ability to detect isoforms.a Distributions of the mean number of isoforms detected per gene per cell for H9 hESCs sequenced at approximately 1 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S14 :
Fig. S14: Different models of isoform choice alter our ability to detect isoforms.a Distributions of overlap fraction with the ground truth for H9 hESCs sequenced at approximately 1 million reads per cell using the Weibull model of isoform choice [2, 3].b shows the same distributions when the random model is used.c shows the distributions when the inferred probabilities model is used.d shows the distributions when the cell variability model is used.See the main text for a detailed description of each model.

Fig. S15 :
Fig. S15: Distributions of the probabilities of dropouts for the isoforms selected by the Weibull model when one, two, three and four isoforms were picked by the model.

Fig. S16 :
Fig. S16: Distributions of the probabilities of dropouts for the isoforms selected by the Random model when one, two, three and four isoforms were picked by the model.

Fig. S17 :
Fig.S17: Distributions of the probabilities of dropouts for the isoforms selected by the inferred probabilities model when one, two, three and four isoforms were picked by the model.

Fig. S18 :
Fig.S18: Distributions of the probabilities of dropouts for the isoforms selected by the cell variable model when one, two, three and four isoforms were picked by the model.

Fig. S19 :
Fig. S19: Distributions of the mean number of isoforms detected per gene per cell under different isoform choice models when dropout probabilities are sampled from the Beta distributions in Figure 3B in the main text.

Fig. S22 :
Fig. S22: a Histograms of mean isoform expression, ordered by isoform rank.b Histograms of dropout probability, ordered by isoform rank.All plots shown are for H1 hESCs sequenced at 4 million reads per cell.

Fig. S23 :
Fig. S23: a Histograms of mean isoform expression, ordered by isoform rank.b Histograms of dropout probability, ordered by isoform rank.All plots shown are for H9 hESCs sequenced at 1 million reads per cell.

Fig. S24 :
Fig. S24: a Histograms of mean isoform expression, ordered by isoform rank.b Histograms of dropout probability, ordered by isoform rank.All plots shown are for H9 hESCs sequenced at 4 million reads per cell.

Fig. S25 :
Fig. S25: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H1 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the random model [2].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S26 :
Fig. S26: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H1 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the inferred model [2].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S27 :
Fig. S27: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H1 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the cell variable model [2, 4].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S28 :
Fig. S28: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H9 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the Weibull model [2, 3].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S29 :
Fig. S29: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H9 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the random model [2].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S30 :
Fig. S30: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H9 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the inferred model [2].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Fig. S31 :
Fig. S31: Mixture models.a and b Distributions of detected isoforms per gene per cell (blue) and log normal fitted distributions (orange) for H9 cells sequenced at 1 million reads per cell (a) or 4 million reads per cell (b) under the cell variable model [2, 4].c and d Mixing fractions vs iterations of expectation maximisation for 1 million reads per cell (c) and 4 million reads per cell (d).Each coloured line represents the distributions for one, two, three or four isoforms being simulated as expressed per gene per cell.

Table S1 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to each row of graphs in Figure5, in other words testing whether the distributions generated by different isoform choice models are significantly different.

Table S2 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to the simulation results generated using the Inferred Probabilities vs the Cell Variable models of isoform choice in Figure5to test whether the distributions generated by different isoform choice models significantly differ.

Table S3 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to each row of graphs in Supplementary Figure8, in other words testing whether the distributions generated by different isoform choice models are significantly different.

Table S4 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to the simulation results generated using the Inferred Probabilities vs the Cell Variable models of isoform choice in Supplementary Figure8to test whether the distributions generated by different isoform choice models significantly differ.

Table S6 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to the simulation results generated using the Inferred Probabilities vs the Cell Variable models of isoform choice in Supplementary Figure10to test whether the distributions generated by different isoform choice models significantly differ.

Table S7 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to each row of graphs in Supplementary Figure12, in other words testing whether the distributions generated by different isoform choice models are significantly different.

Table S8 :
Results of K-sample Anderson-Darling test, which tests whether multiple collections come from the same population.The test was applied to the simulation results generated using the Inferred Probabilities vs the Cell Variable models of isoform choice in Supplementary Figure12to test whether the distributions generated by different isoform choice models significantly differ.