Splicing features
A junction score for the closest exon boundary of each iSNV was calculated based on the position weight matrices (PWMs) derived from canonical splice sites [42]. The junction score was measured by summing the information contents of positions from − 3 to + 7 for donor sites, and positions from − 13 to + 1 for acceptor sites. In addition, for onss iSNVs, the change in junction score caused by allele substitution was also computed and used as a feature.
The impact of iSNVs on RBP binding affinity was measured based on a total of 201 PWMs (position weight matrices) obtained from the RBPDB and cisBPRNA databases [57, 58]. For each iSNV, we compute the following two measures to quantify the impact on the binding of a specific RBP:

i)
The magnitude (M value) of RBP binding change upon each iSNV, which was calculated as the logarithmic ratio between the customdefined RBP binding scores with respect to the alternative allele and the reference allele.

ii)
An estimated probability (P value) of RBP binding event switch upon each iSNV, i.e., an RBP binding site switches to a nonbinding site or vice versa given the two respective alleles.
Denote p as an RBP with a binding motif with length k and q = {q_{i} = A, T, C, G, i = 1…k} as a nucleotide sequence with length k. As a prerequisite for calculating the M and P values from the PWM data, we first calculated the matching score of the sequence q to the PWM of p as:
$$ {S}_q^p={\sum}_{i=1}^k{s}_{i,{q}_i}^p $$
$$ {s}_{i,j}^p={\log}_2\frac{\left({n}_{i,j}^p+{c}_{i,j}\right)/\left(N+{\sum}_{j\in \left\{A,C,G,T\right\}}{c}_{i,j}\right)}{d_j},i=1,\dots, k,j=A,T,C,G $$
where \( {s}_{i,j}^p \) is the logarithmic ratio of the observed frequency of a specific nucleotide j in the ith position of the PWM of p versus the random background frequency, \( {n}_{i,j}^p \) is the count of base j = A, T, C, G on the ith position in the PWM of p, and c_{i, j} is a pseudocount to avoid the negative infinite value of \( {s}_{i,j}^p \) when \( {n}_{i,j}^p=0. \) N is the total number of binding sites used to derive the PWM, and d_{j} is the prior frequency of base j. In this study, we set c_{i, j} as \( \sqrt{N} \) and assumed a constant d_{j} = 0.25 for j = A, T, C, G.
To evaluate if a certain iSNV may significantly impact the binding event for an RBP binding site, we first estimate two empirical distributions—(1) the distribution of \( {S}_q^p \) for the nucleotide sequences that are true RBP binding site of p, and (2) the distribution of \( {S}_q^p \) for the nucleotide sequences that are not RBP binding site of p.
As demonstrated in previous studies [33, 36], it is rational to assume the empirical distributions of the matching scores \( {S}_q^p \) follows a Gaussian distribution. Specifically, mean and variance of \( {S}_q^p \) can be estimated by M_{p} and V_{p} defined below:
$$ {M}_p={\sum}_{i=1}^k\sum \limits_{j\in \left\{A,C,G,T\right\}}{f}_{i,j}^p\times {s}_{i,j}^p $$
$$ {V}_p={\sum}_{i=1}^k\sum \limits_{j\in \left\{A,C,G,T\right\}}\left({f}_{i,j}^p\times {s_{i,j}^p}^2{\left({f}_{i,j}^p\times {s}_{i,j}^p\right)}^2\right) $$
where \( {f}_{i,j}^p \) is the frequency of base j = A, T, C, G at the ith position of the RBP binding site p. With the PWM of p, for the true binding site, \( {f}_{i,j}^p\triangleq \frac{2^{s_{i,j}^p}}{4}=\frac{n_{i,j}^p+{c}_{i,j}}{N+{\sum}_{j\in \left\{A,C,G,T\right\}}{c}_{i,j}} \). To estimate the mean and variance of \( {S}_q^p \) for the sequences that do not form an RBP binding site, we assume \( {f}_{i,j}^p=0.25 \) for j = A, T, C, G at any position i, i.e., an even distribution of the background. The rationale of this assumption is that only a very small number of specific sequences may form the binding site; hence, in their complement set, the frequency of each base at each position tends to be even.
With these assumptions, the two empirical distributions of \( {S}_q^p \) for the nucleotide sequences that serve as true RBP binding site or nonRBP binding sites of p can be computed. It is noteworthy the two Gaussian distribution of the matching scores for binding (B) and nonbinding (NB) sites always exhibit different means and variations, (as illustrated in Additional file 1: Figure S10). With the two empirical distribution computed, we further computed the M value—the magnitude of how an iSNV affects RBP binding, as detailed below:
Denote A as a sequence with the alternative allele of the iSNV, and R as the sequence with the reference allele, their matching scores to the PWM of RBP binding site p were first computed and denoted as \( {S}_A^p \) and \( {S}_R^p \). We developed a custom score, denoted Ω(S) = Φ(S, B)/(1 − Φ(S, NB)), where S = \( {S}_A^p \) or \( {S}_R^p \). Ω is the ratio of a Φ score for a binding event against the one for a nonbinding event, given the matching score S of a specific allele. Here, Φ(S, B) or Φ(S, NB) is the cumulative distribution function (CDF) of the Gaussian distribution characterized by M_{p} and V_{p} (described earlier) for any PWM matching score S (Additional file 1: Figure S10):
$$ \Phi \left(S,B\right)={\int}_{\infty}^{\mathrm{S}}\frac{1}{\sqrt{2\pi {V}_p(B)}}{e}^{\frac{1}{2}{\left(\frac{x{M}_p(B)}{\sqrt{V_p(B)}}\right)}^2}d(x) $$
$$ \Phi \left(S, NB\right)={\int}_{\infty}^{\mathrm{S}}\frac{1}{\sqrt{2\pi {V}_p(NB)}}{e}^{\frac{1}{2}{\left(\frac{x{M}_p(NB)}{\sqrt{V_p(NB)}}\right)}^2}d(x) $$
where M_{p}(B) (M_{p}(NB)) and V_{p}(B) (V_{p}(NB)) are the mean and variance for binding (nonbinding) events.
As shown in Additional file 1: Figure S10, Φ(S, B) is the redshaded area under the density curve of binding events, symbolizing how likely to observe a particular S given that the sequence is a RBP binding site. On the other hand, 1 − Φ(S, NB) is the blueshaded area under the density curve of nonbinding events, indicating how likely to observe the S if the sequence is a nonbinding site. They are designed in this way because we assume that it is more likely to observe a larger matching score to the PWM of an RBP if the specific allele promotes RBP binding, and in contrast, it should be less probable to observe a large matching score if the allele disrupts RBP binding [36].
Thus intuitively, Ω(S), the ratio between Φ(S, B) and 1 − Φ(S, NB), is the relative extent of how likely the sequence is a binding site compared to a nonbinding site, given a specific allele (S = \( {S}_A^p \) or \( {S}_R^p \)). Then, we defined:
$$ \boldsymbol{M}={\log}_2\frac{\Omega \left({S}_A^p\right)}{\Omega \left({S}_R^p\right)} $$
for each RBP (p).
The probability (P value) that a locus switches between RBP binding and nonbinding with and without the variant can be defined as the sum of two probabilities in which the different alleles correspond to binding (B) and nonbinding (NB) events given the observed matching scores \( {S}_A^p \) and \( {S}_R^p \). For each specific RBP binding site p, we simplifiy the denotation as \( {S}_A^p\triangleq {S}_A \) and \( {S}_R^p\triangleq {S}_R \) in all the formulas following below:
$$ \boldsymbol{P}\left(\mathrm{Switch}\right)=P\left(R=B,A= NB{S}_R,{S}_A\right)+P\left(\mathrm{R}= NB,A=B{S}_R,{S}_A\right) $$
Here, R (or A) = B means the reference (or alternative) allele corresponds to an RBP binding event, and R (or A) = NB indicates it is a nonbinding event. We assume that the allele identity of the sequence (R and A) is independent of each other, and observations of the matching scores (S_{R} and S_{A}) are also independent. Therefore, by the Bayes law, the above probability can be transformed into the following cascade of equations:
$$ {\displaystyle \begin{array}{c}P\left(\mathrm{Switch}\right)=P\left(R=B,A= NB{S}_R,{S}_A\right)+P\left(R= NB,A=B{S}_R,{S}_A\right)\\ {}=\frac{P\left(R=B,A= NB\right)P\left({S}_R,{S}_AR=B,A= NB\right)}{P\left({S}_R,{S}_A\right)}+\frac{P\left(R= NB,A=B\right)P\left({S}_R,{S}_AR= NB,A=B\right)}{P\left({S}_R,{S}_A\right)}\\ {}=\frac{P(B)P(NB)P\left({S}_RB\right)P\left({S}_A NB\right)}{P\left({S}_R\right)P\left({S}_A\right)}+\frac{P(NB)P(B)P\left({S}_R NB\right)P\left({S}_AB\right)}{P\left({S}_R\right)P\left({S}_R\right)}\\ {}=\frac{P(B)P(NB)\left[P\left({S}_R\Big\Vert B\right)P\left({S}_A NB\right)+P\left({S}_R NB\right)P\left({S}_AB\right)\right]}{\left(P(B)P\left({S}_RB\right)+P(NB)P\left({S}_R NB\right)\right)\left(P(B)P\left({S}_AB\right)+P(NB)P\left({S}_A NB\right)\right)}\\ {}=\frac{P(B)\left(1P(B)\right)\left[P\left({S}_RB\right)P\left({S}_A NB\right)+P\left({S}_R NB\right)P\left({S}_AB\right)\right]}{\left(P(B)P\left({S}_RB\right)+\left(1P(B)\right)P\left({S}_R NB\right)\right)\left(P(B)P\left({S}_AB\right)+\left(1P(B)\right)P\left({S}_A NB\right)\right)}\end{array}} $$
given that P(NB) = 1 − P(B) for any sequence.
Since S_{R} and S_{A} follow Gaussian distribution for both B and NB (Additional file 1: Figure S10), the following equalities are established:
$$ P\left({S}_RB\right)=\Phi \left({S}_R,B\right) $$
$$ P\left({S}_R NB\right)=\Phi \left({S}_R, NB\right) $$
$$ P\left({S}_AB\right)=\Phi \left({S}_A,B\right) $$
$$ P\left({S}_A NB\right)=\Phi \left({S}_A, NB\right) $$
and each Φ value is calculated as described earlier.
Here, P(B) is a prior probability that a sequence is an RBP binding site. Since it is unknown, we denote it as x ∈ [0, 1] and assume x follows a beta distribution. Shape parameters α and β for the beta distribution were specific to each sequence and estimated by enforcing the following constrains: (1) α, β > 1; (2) mode of the beta distribution (mode = (α − 1)/(α + β − 2)) equals to 1/2^{IC}, where IC is the information content (i.e., sum of the expected selfinformation of the elements) of the PWM; and (3) the beta cumulative distribution function at 1/10th of the mode (i.e., mode/10) equals a predefined level 0.005.
Given any value of x, the P(Switch) becomes the conditional probability with respect to the given x:
$$ \boldsymbol{P}\left(\mathrm{Switch}\ \right\ x\Big)=\frac{x\left(1x\right)\left[\Phi \left({S}_R,B\right)\Phi \left({S}_A, NB\right)+\Phi \left({S}_R, NB\right)\Phi \left({S}_A,B\right)\right]}{\left(x\Phi \left({S}_R,B\right)+\left(1x\right)\Phi \left({S}_R, NB\right)\right)\left(x\Phi \left({S}_A,B\right)+\left(1x\right)\Phi \left({S}_A, NB\right)\right)} $$
with all calculated Φ values substituted in.
According to the Bayesian theorem, the probability P(Switch) can be calculated by the following integral:
$$ \boldsymbol{P}\left(\mathrm{Switch}\right)={\int}_0^1\boldsymbol{P}\left(\mathrm{Switch}\ \right\ x\Big)\bullet {f}_{\mathrm{beta}}\left(x,\alpha, \beta \right) dx $$
where f_{beta} is the probability density function of beta distribution, with α and β estimated as described above.
Based on all above calculations, the two measures M and P were derived for each of the 201 RBPs and used as the splicing features.
Protein structural features
For each iSNV, the protein structural features of its closest exons were evaluated. The proteindisorder score, secondary structure, and solvent accessible surface area (ASA) were precomputed for all known proteincoding genes using SPINED and SPINEX [43, 44]. The known protein domains were extracted from the Pfam database [45]. Percentages of the closest exon regions that overlap with Pfam domains were measured. The posttranslational modification sites (PTMs) were extracted from the dbPTM 3.0 database [46]. The number of PTM sites per 100 amino acids encoded by the closest exons was also calculated.
Evolutionary conservation features
Basewise conservation scores (PhyloP) of 99 vertebrate genomes were downloaded from UCSC Genome Browser [59]. The scores on the iSNVs loci, as well as the average scores of the 3bp and 7bp window regions around the iSNVs, were extracted and used in machine learning.
Machine learning model
Separate random forest classifiers were built for onss and offss iSNVs, respectively. A grid search strategy with threefold crossvalidation was used on the training set to finetune the hyperparameters, such as the number of trees and the maximum tree depth. The source code can be accessed by public repository or DOI reference links [60].
ClinVar database iSNVs
ClinVar (version 2016/05/31) was downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/). We extracted the SNVs located in intronic regions. To ensure the quality of data, we only included the iSNVs that were confirmed by at least two submitters, with the exception of pathogenic offss iSNVs where we only required a single submitter due to the limited number of such iSNVs.
GTEx and ExAC database iSNVs
SNVs from wholegenome sequencing data in GTEx release v6 were downloaded in the VCF format. We focused on the iSNVs within 300 bp of exonintron boundaries. In total, there were 17,194 onss iSNVs and 630,557 offss iSNVs. Similarly, we also downloaded the variants from wholeexome sequencing data in ExAC r0.3.1, where there were 368,489 onss iSNVs and 2,314,839 offss iSNVs. The corresponding allele frequencies were calculated for correlation analysis with predicted diseasecausing probabilities.
ASSETseq plasmid construction
The modified Exontrap plasmid is shown in Fig. 5a, and the sequence is provided in Additional file 4: Text S1. The test oligos consisted of 11 bp of the upstream exon and 60 bp of the adjacent intron containing the iSNV to be tested. Additional 22bp exonic and 19bp intronic sequences homologous to the vector were also included for the seamless insertion of the oligos into the plasmid body. Further, a single nucleotide barcode was introduced to indicate whether the transcript came from the wildtype or variant construct. The 113bp oligos containing different test iSNVs were synthesized in parallel as a pool using OligoMix (LC Sciences, Houston, TX). In the present study, the ASSETseq assays contained 82 pairs of reference and variant test sequences. The synthesized oligos were then cleaved from the chip and amplified via highfidelity PCR with primers paired to the exon and intron homology sequences (Additional file 4: Text S1). The pooled oligos were directionally inserted into the Exontrap plasmid using the NEBuilder HiFi DNA Assembly Reaction (New England Biolabs, Ipswich MA). The assembled plasmids were transformed into bacteria and plated on LB agar plates containing ampicillin. The resulting colonies were scraped and grown in LB + ampicillin medium. Plasmid DNA was isolated using HiSpeed Plasmid Maxi kit (Qiagen, Germantown, MD).
Transfection of cell culture
The plasmid library was used to transfect three human cell lines: HeLa, HEK293, and HepG2. The cells were seeded at a density of 0.9 × 10^{5} in 24well plates. Each plate contained 5 biological replicates per cell line. Twentyfour hours after cell plating, 500 ng of the library pool was complexed with 1.5 μL of Lipofectamine 3000 Reagent (Thermo Fisher Scientific, Waltham, MA) in 50 μL of OptiMEM media as per the manufacturer’s instructions before adding the transfection mixture to each well. Cell culture and transfection reagents were used without antibiotics. HeLa, HEK293, and HepG2 were authenticated using IDEXX Bioanalytics’ CellCheck 9 Plus (Columbia, MO). All were found to be within IDEXX’s range of positive identity matching.
RNA isolation and cDNA synthesis
Transfected cells were lysed in situ 48 h after transfection, and total RNA was isolated using miRNeasy mini kit with the optional DNase digestion step (Qiagen, Germantown MD) following the manufacturer’s protocol. Using 285–400 ng RNA, cDNA was synthesized with QuantiTect Reverse Transcription kit (Qiagen, Germantown, MD) following the manufacturer’s protocol.
Molecular barcoding
To identify the source cell line and replicate of the RNA transcripts, cDNA generated from the plasmid library in the transfected cells was PCR amplified using barcoded primers. A unique 6nt sequence was added to the 5′end of the forward and reverse primer for identification (Additional file 4: Text S1). In separate PCR reactions, 2 μL cDNA were amplified in a 50μL volume containing 2× Invitrogen Platinum SuperFi PCR Master Mix (Thermo Fisher Scientific, Waltham, MA) using 1 μM (final concentration) barcoded primers. PCR conditions used were 98 °C for 30 s, 66.6 °C for 5 s, and 72 °C for 15 s for 28 cycles. PCR samples were purified using the MinElute PCR Purification kit (Qiagen, Germantown, MA) following the manufacturer’s protocol and quantified with Qubit dsDNA BR Assay kit (Thermo Fisher Scientific, Waltham, MA). For sequencing by the NextSeq 500 platform (Illumina, San Diego, CA), 150 ng of each sample was pooled. Pooled samples also contained an equal amount of the original plasmid library with identification barcodes added by PCR as above. Pooled samples contained 20 uniquely barcoded sequence groups representing 5 biological replicates for 3 cell lines plus 5 input plasmid libraries.
Nextgeneration sequencing
The pooled PCR products were sequenced using the NextSeq 500 platform (Illumina, Inc., San Diego, CA). The sequencing library was created by endpolishing the barcoded PCR products, followed by adapter ligation and amplification. The resulting library was quantified, and its quality was assessed with the Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA). Approximately 90 million usable reads were generated. Raw reads were generated as fastq files for bioinformatics analysis.
Bioinformatics analysis for sequencing data
Illumina sequencing adapters were first removed from the raw reads in the fastq files using the tool cutadapt (v1.9.1) [61]. Then, the reads were demultiplexed into the 20 sequence groups according to the barcodes. Sequencing reads were aligned to the transcripts using STAR (Spliced Transcripts Alignment to a Reference, v2.5.3a) [62]. The reference sequence for the alignment was built based on the plasmid (Additional file 4: Text S1), and read counts for the spliced and aberrant (including unspliced) transcripts were documented.
Statistical analysis
Assuming heterogeneity among the sample replicates, we applied the generalized linear mixedeffect model to characterize the difference of splicing patterns between reference and alternative alleles in each of the three experimental cell lines. The splicing outcome was described by the counts of sequencing reads supporting the spliced and aberrant transcripts. Assuming Y is the sequencing read count following negative binomial distribution and x_{allele} and x_{splice} are two binary variables (i.e., 0 or 1), we set up the following regression equation with respect to the (logarithm of) expectation E(Y):
$$ \log \left(E(Y)\right)={\beta}_0+{\beta}_1\bullet {x}_{\mathrm{allele}}+{\beta}_2\bullet {x}_{\mathrm{splice}}+{\beta}_3\bullet {x}_{\mathrm{allele}}:{x}_{\mathrm{splice}}+b\bullet {\varepsilon}_{\mathrm{replicate}} $$
$$ {x}_{\mathrm{allele}}\in \left\{0\mathrm{ref},1\mathrm{alt}\right\},{x}_{\mathrm{splice}}\in \left\{0\mathrm{spliced},1\mathrm{aberrant}\right\} $$
Here, ε is the random effect among the multiple replicates. The p value of coefficient β_{3} determines whether there is a significant change in splicing outcome between the two alleles, i.e., x_{splice} significantly correlates to x_{allele}.