Single-cell transcription and methylation data was generated from a single donor from the Human Induced Pluripotent Stem Cells Initiative (HipSci) [15, 16], using the previously described protocol for single-cell methylation and transcriptome sequencing in the same cells (scM&T-seq) (see [14] for details). Line joxm_1, an induced pluripotent stem cell (iPSC) line derived from fibroblasts cells from HipSci project, was cultured and triggered into differentiation towards endoderm. scM&T-seq data was generated for 93 cells (together with 1 empty well as negative control and two 15-cell and 50-cell positive controls) at the undifferentiated time point (iPS) and the definitive endoderm time point (endoderm), yielding 186 cells for analysis.
Cell handling and differentiation
The joxm_1 IPSC line was cultured in Essential 8 (E8) media (LifeTech) according to the manufacturer’s instructions. For dissociation and plating, cells were washed × 1 with DPBS and dissociated using StemPro Accutase (Life Technologies, A1110501) at 37 °C for 3–5 min. Colonies were fully dissociated through gentle pipetting. Cells were washed × 1 with MEF medium [23] and pelleted gently by centrifuging at 285×g for 5 min. Cells were re-suspended in E8 media, passed through a 40-μm cell strainer, and plated at a density of 60,000 cells per well of a gelatin/MEF-coated 12-well plate in the presence of 10 μM Rock inhibitor—Y27632 [10 mM] (Sigma, Cat # Y0503—5 mg). Media was replaced with fresh E8 free of Rock inhibitor every 24 h post-plating. Differentiation into definitive endoderm commenced 72 h post-plating as previously described [23].
FACS preparation and analysis of cells
During all staining steps, cells were protected from light. Cells were dissociated into single cells using Accutase and washed × 1 with MEF medium as described above. Approximately 1 × 106 cells were resuspended in 0.5 mL of differentiation state-specific medium containing 5 μL of 1 mg/mL Hoechst 33342 (Thermo Scientific). Staining with Hoechst was carried out at 37 °C for 30 min. Unbound Hoechst dye was removed by washing the cells with 5 mL PBS + 2% BSA + 2 mM EDTA (FACS buffer); BSA and PBS were nuclease-free. For the staining of cell surface markers Tra-1-60 (BD560380) and CXCR4 (eBioscience 12-9999-42), cells were resuspended in 100 μL of FACS buffer with enough antibodies to stain 1 × 106 cells according to the manufacturer’s instructions and were placed on ice for 30 min. Cells were washed with 5 mL of FACS buffer, passed through a 35-μM filter to remove clumps, and re-suspended in 250 μL of FACS buffer for live cell sorting on the BD Influx Cell Sorter (BD Biosciences). Live/dead marker 7AAD (eBioscience 00-6993) was added just prior to analysis according to the manufacturer’s instructions, and only living cells were considered when determining the differentiation capacities. Living cells stained with Hoechst but not Tra-1-60 or CXCR4 were used as gating controls.
scM&T-seq
As previously described in Angermeuller et al. [14], scM&T-seq library preparation was performed following the published protocols for G&T-seq [24] and scBS-seq [25], with minor modifications as follows. G&T-seq washes were performed with 20 μl volumes, reverse transcription and cDNA amplification were performed using the original Smart-seq2 volumes [26], and Nextera XT libraries were generated from 100 to 400 pg of cDNA, using 1/5 of the published volumes. RNA-seq libraries were sequenced as 96-plexes on a HiSeq 2000 using v4 chemistry and 125 bp paired-end reads. BS-seq libraries were sequenced as 24-plexes using the same machine and settings, which yielded a mean of 7.4 M raw reads after trimming.
Gene expression quantification
For single-cell RNA-seq data, adapters were trimmed from reads using Trim Galore! [27,28,29], using default settings. Trimmed reads were mapped to the human reference genome build 37 using STAR [30] (version: 020201) in two-pass alignment mode, using the defaults proposed by the ENCODE consortium (STAR manual). Expression quantification was performed separately using Salmon [31] (version: 0.8.2), using the “--seqBias,” “--gcBias,” and “VBOpt” options on transcripts derived from ENSEMBL 75. Transcript-level expression values were summarized at the gene level (estimated counts) and quality control of scRNA-seq data was performed using scater [32]. Cells with the following features were retained for analysis: (i) at least 50,000 counts from endogenous genes, (ii) at least 5000 genes with non-zero expression, (iii) less than 90% of counts are assigned to the top 100 expressed genes per cell, (iv) less than 20% of counts are assigned to ERCC spike-in sequences, and (v) a Salmon mapping rate of at least 40%. These filters jointly removed 9 iPS cells and 36 endoderm cells from our analysis.
Splicing quantification
Of the 186 cells, 84 (iPS) and 57 (endoderm) cells passed QC on gene expression data as described above. Exon splicing rates in individual cells were quantified using the data-dependent module of BRIE [8]. BRIE calls splicing at predefined cassette exons and quantifies splicing using exon reads in single-cell data. By default, BRIE combines informative prior learned from sequence features and a likelihood calculated from RNA-seq reads by a mixture modeling framework that is similar to MISO [33]. As our aim is to model the local and global determinants of splicing, we used splicing rate estimates based on the observed data at individual exons only. We detected and quantified splicing for between 1386 and 4917 exons per cell (minimum coverage 5 reads, in total considered 6265 (iPS) and 3873 (endoderm) cassette exons that were detected in at least 10 cells for further analysis.
The following settings were used to quantify splicing with BRIE: exons have to be located on autosomes and input chromosomes and should not be overlapped by any other alternatively spliced exon. The surrounding introns have to be longer than 100 bp, the length of the alternative exon regions has to be between 50 and 450 bp with a minimum distance of 500 bp from the next TSS or TTS, and the exon has to be surrounded by AG-GT. The default annotation file gencode.v19.annotation.gtf and the reference genome GRCh37.p13.genome.fa were downloaded from https://www.gencodegenes.org/human/release_19.html (May 2018) and used for subsequent analyses.
We used three different measurements to quantify splicing ratios (PSI), namely single-cell splicing ratios, pseudo-bulk splicing ratios, and variance of splicing ratios. To calculate single-cell PSI per cassette exon per cell, we only considered splicing events that were supported by at least five reads and limited the analysis to cassette exons which were observed in at least ten cells. To derive pseudo-bulk PSI per cassette exon, we aggregated the single-cell PSI values per cassette exon. The variance of PSI per cassette exon was defined as the standard deviation of PSI across single cells.
DNA methylation pre-processing and quantification
For DNA methylation data, single-cell bisulfite sequencing (scBS-seq) data was processed as previously described [25]. Reads were trimmed with Trim Galore! [27,28,29], using default settings for DNA methylation data and additionally removing the first 6 bp. Subsequently, Bismark [34] (v0.16.3) was used to map the bisulfite data to the human reference genome (build 38), in single-end non-directional mode, which was followed by de-duplication and DNA methylation calling using default settings. We removed cells with low alignment rates (alignment rate < 15%) and cells with a library size of less than 1 M reads, resulting in 84 iPS cells and 53 endoderm cells with RNA and DNA methylation information.
To mitigate typically low coverage of scBS-seq profiles (20–40%; [17]), we applied DeepCpG [17] to impute unobserved methylation states of individual CpG sites. DNA methylation profiles in iPS and endoderm cells were imputed separately. The cell type-specific models were built using CpG and genomic information according to DeepCpG’s setup of a joint model (see [17] for details and default values; see Additional file 1: Table S1 for imputation accuracy as measured on a validation set per sample).
Predicted methylation states were binarized according to DeepCpG probability outputs as follows: sites with a probability of equal to or lower than 0.3 were set to 0 (un-methylated base), all methylation sites with a probability of greater than 0.7 were set to 1 (methylated base). Intermediate methylation levels were handled as missing. After imputation the methylation data was aligned back to human genome version 37 to match the expression data, using the UCSC lift-over tool [35].
We integrated the imputed methylation information into the DNA sequence by distinguishing methylated (M) and un-methylated (U) cytosines. Cytosines without methylation information after imputation were assigned the value of the closest cytosine with methylation information. If there was no methylation information within 900 bp around the cytosine, its state was set to un-methylated.
Cell and gene model assumptions
To assess if our PSI variation patterns follow the gene or the cell model [18], we compared the distribution of splicing rates to a binomial distribution that is expected according to the cell model and to the expected distribution according to the gene model.
The cell model assumes that each individual cell expresses only a single splice isoform, and hence models PSI variation as a bimodal distribution at the single cell level. The alternative gene model assumes splicing regulation on the gene level. The mean PSI of a gene is determined by the sequence. Each time a gene is transcribed, the probability of exon inclusion equals mean PSI. However, the limited number of transcripts leads to fluctuation in the observed PSI, and the binomial distribution is restrained by the upper boundary of the standard deviation. To obtain this upper boundary, we simulated the PSI of each cell as a binomial distribution and calculated the standard deviation across the cells. We only considered genes that were covered by at least 5 reads per cell in least 10 cells. To obtain the mean standard deviation, we repeated this simulation 400 times.
Sequence features
The genomic features used to predict the splicing ratios and its variance were based on the features described by BRIE and Xiong et al. [5, 8]. As these features were specifically designed to study exon skipping events at cassette exons, they capture sequence variation around the alternatively spliced exon. This region is first split in five genomic contexts: the alternative exon itself, the two neighboring exons and the introns between the exons. Logarithmic length, relative length, and the strength of the splice site motifs at the exon-intron boundaries were calculated per genomic context. The strength of the splice site was defined as the similarity between this splice site and known splice motives. Additional features were calculated on seven genomic contexts, the three exons and the 5′ and 3′ boundaries of the two introns. Only the two boundary contexts of the introns (300 bp length) were used since intron length is highly variable and the boundaries are found to be the most relevant contexts for splicing.
Altogether, 607 features were calculated for these genomic contexts per cassette exon: PhastCons scores [36] that describe sequence conservation, length of the sequence contexts, and sequence composition-based k-mer frequencies (with k ≤ 3) (“genomic” features, the “Methods” section, Additional file 5: Table S4). The k-mers reflect the percentage of nucleotides in the context that match the respective specific motif. The PhastCons scores were retrieved for alignments of 99 vertebrate genomes with the human genome from hg19.100way.phastCons.bw from UCSC (May 2018) [35].
In addition to the genomic features, we defined up to 826 DNA methylation features derived from the imputed DNA methylation information, including an extended k-mer alphabet that takes the methylation status into account, as well as DNA methylation average and variance (across CpG sites), in each of the 7 sequence contexts of a cassette exon. Methylation features describe the methylation patterns of either individual cells (“genomic and cell methylation” features) or averaged across cells (“genomic and mean methylation” features; Additional file 5: Table S4). More specifically, for the single-cell PSI model, we considered cell-specific methylation levels; the k-mer features were extended by including un-methylated (U) and methylated (M) cytosine into the alphabet as follows: Cytosines without methylation information after imputation were assigned the value of the closest cytosine with methylation information. If there was no methylation information within 900 bp around the cytosine, its state was set to un-methylated. For the bPSI model, we included the mean frequencies of the k-mers that contained “M” or “U” across cells and the averaged methylation values as described above.
Splicing categories
In bulk RNA-seq data, splicing events can be broadly categorized into two major categories: included and excluded. Leveraging the single-cell information, we defined more fine-grained splicing categories that reflect both splicing rates and splicing variability across cells (inspired by Song et al. [12]): (1) excluded (mean PSI < 0.2), (2) included (mean PSI > 0.8), (3) overdispersed, (4) underdispersed, and (5) multimodal (Fig. 3a). The latter three categories categorize the extent of splicing variation across cells, since cassette exons with intermediate average splicing rates (here 0.2 ≤ mean PSI ≤ 0.8, Fig. 1) exhibit substantial differences in splicing variance. To characterize cells into these three categories, we calculated the distribution of the distance between the observed and the expected variation per cell type. The expected variation was calculated by a scaled binomial standard deviation, where the scaling factor and the mean splice rate of the alternative exon [18] are fit to all data points. We then defined the overdispersed cassette exons as those for which the deviation from the expected PSI was higher than the third quartile plus 1.5x interquartile range (IQR) (corresponding to > 0.016 in iPS and > 0.022 in endoderm). Likewise, for the definition of the underdispersed cassette exons, we used the first quartile minus 1.5x IQR as the threshold (corresponding to less than − 0.032 in iPS and less than − 0.039 in endoderm cells). The remaining cassette exons were assigned to the multimodal category.
Relating DNA methylation heterogeneity and splicing
We applied Spearman correlation to link splicing at a single locus to variation in DNA methylation observed between cells. The test was performed per sequence context of the cassette exon (Fig. 1c). We only considered cassette exons where variation in splicing and variation of DNA methylation of the relevant context were observed. In total, 5280 iPS and 2622 endoderm cassette exons were tested. The P values were adjusted for multiple testing using the Q value [37, 38] package in R. The gene enrichment across the cassette exons was performed using g:Profiler [20] (version: 2017-10-25, g:Profiler Ensembl 90), using all observed cassette exons per cell type as background. Multiple testing correction for the enrichments was performed within g:Profiler.
Prediction of PSI and categories
We applied linear ridge regression to model single-cell and pseudo-bulk PSI and (multi-class) logistic ridge regression to model PSI categories. The models are based on only the genomic features or on both genomic and DNA methylation features. The performance of linear models was evaluated using Pearson r2 between predicted and observed splicing rates. For the multi-class prediction models, we applied a one-versus-rest scheme and report the per-category and the macro-average area under the receiver operating curves (AUC). To determine the most relevant individual features, we additionally trained regression models based on each single feature. Per feature, we report, in the case of the linear models, Pearson correlation (r, r2) and, in the case of the logistic models, the absolute weight multiplied by the standard deviation of the feature and the AUC. We assessed the performance and parameters of the models by using a tenfold cross validation (CV) with fixed training-validation splits. To assess the variability of prediction performances, we repeated the CV procedure four times with different CV splits. Error bars indicate ± 1 standard deviation of the respective statistic (AUC, r2).
Replication cohort
To replicate our results, we processed the mouse ES single-cell scM&T-seq data (n = 80) presented in Angermueller et al. [14]. We reprocessed the aligned RNA and DNA methylation data to quantify splicing following the same protocols that were applied to the human data, with the following changes: GRCm38 was used as a reference for imputation, genome and transcriptome annotations were based on gencode v18 (“GRCm38.p6.genome.fa” as genomic, “gencode.vM18.annotation.gff3” as transcriptomic reference, available at ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/ [August 2018]), and conservation scores were taken from “mm10.60way.phastCons.bw” downloaded from UCSC [35] (August 2018).
Out of the 80 cells, in total, 12 cells did not pass quality control on the transcriptome data, Cells with less than 500,000 sequenced reads or had less than 80% of the reads aligned to the genome were removed. Additionally, 4 cells did not pass quality on the DNA methylome data. Cells with less than 1 million reads aligned and bismark mapping efficiency below 7% were discarded. The filters yielded 68 cells that were used for the splicing analysis and 64 that are used for the analyses including DNA-methylation data. In these cells, we quantified between 649 and 1433 cassette exons per mouse ES cell (minimum coverage of 5 reads); in the replication analysis, we considered 2194 exons that were supported by at least 1 cells.
Availability of source code
Python and R were used for data processing, modeling, and visualization of the results. All regression models are based on implementations available in the package scikit-learn [39]. Software and scripts are available as jupyter notebooks at https://github.com/PMBio/scmt_splicing [40].