RegSNPs-intron: a computational framework for predicting pathogenic impact of intronic single nucleotide variants

Single nucleotide variants (SNVs) in intronic regions have yet to be systematically investigated for their disease-causing potential. Using known pathogenic and neutral intronic SNVs (iSNVs) as training data, we develop the RegSNPs-intron algorithm based on a random forest classifier that integrates RNA splicing, protein structure, and evolutionary conservation features. RegSNPs-intron showed excellent performance in evaluating the pathogenic impacts of iSNVs. Using a high-throughput functional reporter assay called ASSET-seq (ASsay for Splicing using ExonTrap and sequencing), we evaluate the impact of RegSNPs-intron predictions on splicing outcome. Together, RegSNPs-intron and ASSET-seq enable effective prioritization of iSNVs for disease pathogenesis.

Training data were collected from HGMD and the 1000 Genomes Project. iSNVs were first split into two classes, on-ss and off-ss respectively. 2/3 of the data were used for training, and the remaining 1/3 of data were used for validation. Features were extracted and two separate random forest classifiers were built and then used to predict the disease-causing probabilities for on-ss and off-ss iSNVs. An independent test set from the ClinVar database was also used for additional model validation. The prediction performance was separately evaluated for on-ss and off-ss iSNVs. Genomic features: features related to DNA/RNA, e.g. exon-intron junction score, RBP binding, PhyloP sequence conservation score; structural features: features related to protein domains or functions, e.g. ASA score, intrinsic disorder score, secondary structure score, PTM or Pfam score. Figure S2. Data pre-processing.
Pathogenic iSNVs (HGMD, top) are closer to exons as compared to neutral iSNVs (1000 Genomes, middle). Thus, the distances from splice junction sites between on-ss and off-ss iSNVs were not balanced. Since the numbers of on-ss iSNVs were comparable in the pathogenic and neutral groups, the large number of off-ss neutral iSNVs needed to be downsampled to avoid potential bias in machine learning. Down-sampled off-ss iSNV data are shown at the bottom.

Figure S3. Distribution of changes in splice-junction scores.
Pathogenic iSNVs (red) had significantly lower junction scores than neutral iSNVs (black) at both acceptor splice sites (A) and donor splice sites (B).

Figure S4. Average RBP binding score changes.
RBP binding score changes for on-ss (A) and off-ss (B) iSNVs. Each dot represents one RBP.
The x-axis is the average RBP binding score change induced by pathogenic iSNVs (binding score with alternative allele -binding score with reference allele). The y-axis is the average RBP binding score change induced by neutral iSNVs.

Figure S5. Cumulative probability density of protein structural features.
Pathogenic on-ss (red, A) and off-ss (red, D) iSNVs had lower disorder scores than the neutral iSNVs (black in A and D), indicating that pathogenic iSNVs are more likely to be located near exons encoding structured peptide regions. In addition, pathogenic iSNVs (on-ss, red in B; offss, red in E) tend to be located near exons with smaller average ASA scores as compared to neutral iSNVs (black in B and E). Moreover, pathogenic iSNVs (on-ss and off-ss, red in C and F respectively) are also more likely to be located in the vicinity of exons encoding protein regions that overlap with Pfam domains. Pathogenic iSNVs are at loci that have significantly higher PhyloP scores compared to neutral iSNVs (on-ss, A; off-ss, B).   The matching score for either RBP binding (red curve) or non-binding (blue curve) sites follow a Gaussian distribution. Shapes of the respective distributions (mean and variance) were determined by the actual data. For any matching score S derived from the PWM for a given RBP, the score for being a binding event equals to Φ(S, B), the red-shaded area under the red curve; whereas the score for being a non-binding event equals 1 -Φ(S, NB), which is the blueshaded area under the blue curve.