Analysis of transcript-deleterious variants in Mendelian disorders: implications for RNA-based diagnostics

At least 50% of patients with suspected Mendelian disorders remain undiagnosed after whole-exome sequencing (WES), and the extent to which non-coding variants that are not captured by WES contribute to this fraction is unclear. Whole transcriptome sequencing is a promising supplement to WES, although empirical data on the contribution of RNA analysis to the diagnosis of Mendelian diseases on a large scale are scarce. Here, we describe our experience with transcript-deleterious variants (TDVs) based on a cohort of 5647 families with suspected Mendelian diseases. We first interrogate all families for which the respective Mendelian phenotype could be mapped to a single locus to obtain an unbiased estimate of the contribution of TDVs at 18.9%. We examine the entire cohort and find that TDVs account for 15% of all “solved” cases. We compare the results of RT-PCR to in silico prediction. Definitive results from RT-PCR are obtained from blood-derived RNA for the overwhelming majority of variants (84.1%), and only a small minority (2.6%) fail analysis on all available RNA sources (blood-, skin fibroblast-, and urine renal epithelial cells-derived), which has important implications for the clinical application of RNA-seq. We also show that RNA analysis can establish the diagnosis in 13.5% of 155 patients who had received “negative” clinical WES reports. Finally, our data suggest a role for TDVs in modulating penetrance even in otherwise highly penetrant Mendelian disorders. Our results provide much needed empirical data for the impending implementation of diagnostic RNA-seq in conjunction with genome sequencing.


Introduction
Genome sequencing, enabled by the advent of next-generation sequencing (NGS) technologies, has changed the landscape of diagnostics in the Mendelian diseases space [1]. Whole-exome sequencing (WES) is the most popular NGS diagnostic application and has achieved a diagnostic rate of 25-52% across the spectrum of Mendelian disorders, although higher figures have been reported for certain phenotypic categories [2][3][4][5]. The minimal boost of diagnostic yield offered by whole-genome sequencing (WGS) over WES suggests that the bottleneck is not in the capture/calling of the causal variants in the sequencing stage but rather in their interpretation [6,7]. This notion is supported by studies showing the value of careful reinterpretation of "negative" WES and how misinterpreting the causal variants in WES is a major challenge that cannot be circumvented by WGS [7,8]. Therefore, there is a growing interest in exploring transcriptomics to improve variant interpretation [9]. Indeed, published data suggest an enrichment of "negative" WES cases for cryptic splice-altering variants that are not easily predicted in silico [10,11].
Coding genomic variants modulate phenotypes through their effect on proteins while non-coding variants (NCV) mediate their effects through RNA either directly (transcript-level) or indirectly (chromatin-level). In the context of Mendelian diseases, estimates vary widely on the contribution of variants that affect splicing to the overall mutation pool (15-60% of disease-causing variants) [12]. Two major challenges preclude accurate estimation of this important class of disease-causing mutations. First, many "coding" variants that are presumed to exert their pathogenicity at the protein level are in fact splicing variants whose effect on splicing was never empirically determined. These not only include single base-pair substitutions that may or may not alter the amino acid sequence (nonsynonymous and synonymous missense), but also include protein-truncating variants [13]. Another major challenge is the clear reporting bias in the literature where variants that impact consensus splicing codes are more likely to be tested and reported. Deep intronic, UTR and promoter/enhancer variants are far less likely to be uncovered by conventional Sanger or WES and, even when captured by WGS, are very difficult to interpret using in silico tools despite their clear contribution to Mendelian diseases [14][15][16][17].
Transcriptomics, therefore, holds a promising role in delineating Mendelian phenotypes that are caused by variants that are deleterious at the transcript level [18]. These include variants that reduce the abundance of the transcript, e.g., nonsense-mediated decay (NMD), as well as those that create aberrant splicing. Early experience with RNA-Seq (massively parallel sequencing of RNA) suggests its potential to reveal variants that have been missed at the sequencing stage as well as those that have been missed at the interpretation stage [10,11,[19][20][21]. It is also clear from these studies, however, that there are unique computational challenges to this technology, and although several computational tools have been developed, there is a growing need for a deeper understanding of the nature of transcript-deleterious variants to inform better tools. We have previously shown in a pilot study the power of positional mapping as a tool that is agnostic to the underlying class of mutation to provide unbiased estimate of NCVs [8]. In this study, we provide based on comprehensive positional mapping of 5647 families with suspected Mendelian phenotypes a detailed overview of transcriptlevel deleterious variants and their contribution to Mendelian phenotypes in humans.
We then interrogate the translational potential of that knowledge by exploring the role of RNA-based approaches in patients with "negative" clinical WES results.

Human subjects
Subjects described in this study represent combined cohorts recruited under individual IRB-approved research protocols (KFSHRC RAC# 2121053, 2080006 and 2070023). In each of these protocols, we selectively recruited individuals with at least one of the following features: (a) positive family history consistent with a Mendelian inheritance of the disorder and (b) phenotypic presentation consistent with a previously published Mendelian disease. Informed consent was obtained from all subjects prior to their enrollment. Phenotypic data were collected from all subjects. Blood was collected in EDTA tubes for DNA extraction and in sodium heparin tubes for the establishment of lymphoblastoid cell lines (LCL). Occasionally, blood collected in PAXGene tubes was the only source of RNA. In a subset of cases, cultured skin-derived fibroblasts and urine-derived renal epithelial cells were also obtained as an additional source of RNA.

Positional mapping, WES, and variant identification
The method of combining positional mapping and variant identification using WES has been described elsewhere [1,22]. Briefly, all samples were genotyped on an Axiom SNP platform, and the regions of homozygosity (ROH) were determined to guide the search for the likely causal variant whenever the phenotype and family history are compatible with autosomal recessive inheritance. WES was performed as described before, and the resulting variants were filtered by the autozygome coordinates [3,23]. Variants were filtered using gnomAD and a local population database (2379 exomes) for allele frequency of < 0.001 and were interpreted by following the ACMG guidelines [24] to determine the likely causal variants. Although protein-truncating variants may exert their pathogenic effect at the level of the final transcript via NMD, we have chosen to exclude them because it is very difficult to disentangle their effect on protein from that on RNA. Variants were highlighted as candidate transcript-deleterious variants (TDVs) if they were compatible with pathogenicity potential in terms of frequency and segregation, and involved one of the following six categories: (a) canonical splice donor or acceptor sites (the first and last 2 bp of each intron), (b) the first or last base pair of an exon, (c) non-canonical splice site intronic variants, i.e., other than the first and last 2 bp of an intron, (d) coding exons other than the first or last base pair (regardless of whether the resulting missense is synonymous or nonsynonymous), (e) UTR (5′ and 3′), and (f) promoter/enhancer elements. Variants in categories c, d, e, and f were only considered if no alternate candidate variants were identified. A small subset of cases for which no candidate variants were identified, were subjected to RNA-Seq (see below).

RTPCR
Variants suspected to be deleterious at the transcript level were interrogated by RTPCR using cDNA-specific primers and RNA from blood (LCL or PAXgene) and/or skin fibroblasts. When the index who is homozygous for variant was unavailable, we attempted to test the obligate heterozygous parents. RTPCR followed a standard number of 35 cycles and 2000 ng of RNA as a template. If this standard protocol resulted in a visible band on a gel, the gene was considered "expressed." If additional cycles or higher amount of RNA were needed, the gene was considered "poorly expressed," otherwise, the gene was labeled as "not expressed." The products were analyzed by Sanger sequencing directly and if there was evidence of multiple products, cloning was pursued followed by Sanger sequencing. In cases where no evidence of aberrant splicing was identified, we attempted quantifying the transcript using q-RTPCR.

RNA-Seq and computational analysis
RNA samples of the subjects were prepared at KFSHRC and sent to the KAUST core lab for RNA sequencing. The quality of each RNA sample was determined based on its RNA Integrity Number (RIN) using Agilent 2100 BioAnalyzer. Those samples that scored RIN < 6.0 were not considered further. The sequencing libraries were prepared using Illumina TruSeq Stranded mRNA. Paired-end 150 bp reads were generated on Illumina NovaSeq6000. GTEx RNA-Seq samples [25] for blood and skin tissue types were downloaded from the Database of Genotypes and Phenotypes (dbGaP) and transformed into the fastq format using SRA Toolkit (https://www.ncbi.nlm.nih.gov/sra/ docs/toolkitsoft/). Samples with RIN < 8.0 were not included in our GTEx controls. RNA-Seq reads from both patients and GTEx were also aligned to hg38 (GENCODE 25) using STAR 2.6 [26] with the two-pass option. Only reads mapped to chromosomes 1-22 and X were considered. SAMtools [27] and BEDTools [28] were applied to the BAM files to quantify the occurrence of annotated and unannotated splicing junctions, as well as to count nonsplit reads mapped to intronic regions. Splicing junctions with < 5 read supports were filtered out. To quantify the transcript abundance levels, RNA-Seq reads were also mapped to the reference transcript sequences for hg38 (GENCODE 25) using Kallisto [29]. Using the generated BAM files and transcript abundance levels, error-free normal transcript abundance levels were estimated with the omega quantification [30]. Briefly, omega computes an adjusted count per million (CPM) value for each coding gene g, ω g , as follows: where T g is the set of mRNA transcripts for gene g, w t is the rate to express annotated, normal transcript t based on the RNA splicing data, and x t is the CPM level of transcript t. Thus, low values of ω g can indicate low abundance outliers or splicing outliers that escaped NMD.
From the GTEx data, RNA-Seq datasets of the "Cells -EBV-transformed lymphocytes" and "Cells -Cultured fibroblasts" tissue types were selected as the control for cases derived from blood and skin tissue types, respectively. To ensure the use of an appropriate set of samples in control for each patient, we measured the median of the ω g values of each coding gene for all the blood and skin tissue types in the GTEx datasets and confirmed that the selected tissue type gave the highest level of correlation with the patient data.
Based on the second percentile of the ω g values in the corresponding control, two scores, α g and β g , were measured to analyze the severity of transcriptional aberrations in gene g for each patient. Let ω g (i) and ω g (k i ) represent the value of ω g for patient i and for the second percentile value of the corresponding control k i , respectively. Then, score α g (i) was computed as follows: where ε is a small factor set to 0.001 to avoid division by zero. The α score measures the significance of genes as the low abundance outliers or splicing outliers. The other score, β g , was derived by first computing the fraction of normal transcripts, ρ g which is defined to be the ratio of ω g to X t∈T g x t : Given this definition, score β g (i) for patient i can be expressed as: Thus, the β score measures the significance of likeliness to express transcripts with splicing error. A high alpha score means that the abundance level of normal transcripts of a given gene is lower compared with a lower-end abundance of the same gene in the control set. Similarly, a high beta score means that the fraction of the normal transcripts of a given gene is lower compared with a lower end of the same gene in the control set. With these scores, each coding gene g was selected as a causative candidate for each patient i if all of the following criteria are met: For all the other patients j with RNA samples being the same cell type, α g (i) < α g (j).
Note that criterion 2 is based on 11 RNA-Seq datasets from RNA samples with RIN > 8.5 (4 from fibroblasts and 7 from LCL) and that this criterion was set specifically for comparison based on a small number of patients. To visualize splicing events, BAM files were first converted into the hg19 coordinate using CrossMap [31] and Integrative Genomic Viewer [32] was used. We have also attempted to compare our method to previously published methods as explained in Additional file 1: Supplemental file 1.

Quantifying the contribution of transcript-deleterious variants
Our cohort included 5647 families with suspected Mendelian phenotypes (Fig. 1). The vast majority (94% and 91%) are consanguineous and multiplex, i.e., > 1 affected member, respectively. A likely causal variant was identified in 2438 of these families (n = 1807 nonredundant variants), 272 (15%) of which represent TDVs (TDVs are listed in Additional file 2: Table S1, and their population frequencies are summarized in Additional file 3: Table S2). One limitation of this estimate is the potential for bias against the identification of more challenging classes of transcript-deleterious variants. Therefore, we decided to exploit the agnostic nature of positional mapping to derive unbiased estimate of transcript-deleterious variants. We singled out all families in which we were able to map their recessive Mendelian phenotype to a single locus (n = 157) since these lend themselves more readily to focused and thorough investigations to reveal the underlying variant including the most challenging ones. Indeed, each of these loci was thoroughly interrogated and this resulted in the identification of a likely causal variant (n = 148, two variants were observed in four cases) in 95.5% of cases (150 out of 157, Fig. 2 and Additional file 4: Table S3). The breakdown of variant classes within these loci shows that TDVs accounted for 18.9% of variants (28 out of 148), which suggests that the above figure of 15% may indeed represent an underestimate based on bias against more challenging transcript-deleterious variants. Interestingly, only 2% of the 18.9% are variants not expected to be captured by WES (> 50 bp from the nearest exon), which suggests that, at least in the case of recessive phenotypes in consanguineous families, the overwhelming majority of causal variants are captured by WES pipelines.

RNA as a tool to solve "negative" WES cases
In order to investigate the contribution of RNA analysis to solving "negative" WES cases, we recruited 155 cases for which clinical WES did not reveal a likely causal variant (Fig. 3, Table 1 and Additional file 5: Table S4). A likely causal variant was subsequently identified in 60.6% (88 unique variants in 94 out of 155 cases). Additional file 5: Table S4 shows that many of these cases harbored a likely deleterious variant in a gene that was novel at the time of clinical reporting, i.e., cases unlikely to have benefited from RNA analysis. TDVs accounted for 22.7% of all identified variants (20 out 88 unique variants). As expected, class (a) variants (those affecting the canonical splice sites) were under-represented (21% vs. 64% in the original cohort, see below) since these would have been readily flagged at the time of reporting. On the other hand, the more challenging classes were overrepresented (79% vs. 36% in the original cohort, see below). These include a very deep (+ 335) variant in ABCB4 causing cholestatic disease in all available affected members of an extended family (see Additional file 6: Figure S1). Thus, the hypothetical diagnostic yield of RNA-Seq in the setting of a "negative" WES is 13.5% (21 out 155 cases), at least in the setting of recessive phenotypes. To test this empirically, we set out to investigate the six cases (no RNA was available from the seventh case) whose autosomal recessive Mendelian phenotypes map to single loci, and have "negative" WES, using RNASeq. First, we aimed to establish the sensitivity of our RNA-Seq pipeline and comparing it to previously published pipelines by testing five cases with established transcript-deleterious variants and found that 100% were correctly called, i.e., the mutated gene was chosen among the top or only candidate gene for each of the five cases (Additional file 6: Figure S2 and Additional file 7: Table S5) as follows: 10DG0840 (a case of Troyer syndrome and a class (d) variant in SPG20, see Additional file 2: Table S1): The RNA-Seq-based prediction generated 167 candidates. Among them, SPG20 was ranked 157th on the alpha score and 25th on the beta score. With the autozygome coordinate-based filtering, SPG20 was found to be the only candidate. 11DG0165 (a case of congenital muscular dystrophy and a class (c) variant in POMT2, see Additional file 2: Table S1): The RNA-Seq-based prediction generated 195 candidates. Among them, POMT2 was ranked 2nd on the alpha score and 13th on the beta score. With the autozygome coordinate-based filtering, POMT2 was found to be the top among the 14 final candidates. 15DG2154 (a case of microcephalic primordial dwarfism and a class (c) variant in DONSON, see Additional file 2: Table S1): The RNA-Seq-based prediction generated 324 candidates. Among them, DONSON was ranked 200th on the alpha score  Table S1): The RNA-Seq-based prediction generated 129 candidates. Among them, PEX19 was ranked 17th on the alpha score and 120th on the beta score. With the autozygome coordinate-based filtering, PEX19 was found to be the only candidate. 16DG1620 (a case of osteopetrosis and a class (c) variant in CLCN7, see Additional file 2: Table S1): The RNA-Seq-based prediction generated 112 candidates. Among them, CLCN7 was ranked 79th on the alpha score and 42nd on the beta score. With the autozygome coordinate-based filtering, CLCN7 was found to be the top of the three remaining candidates.
To test this empirically, we set out to investigate six cases whose autosomal recessive Mendelian phenotypes map to single loci with "negative" WES and for whom RNA sources were available. While no likely causal variant was identified in five of these cases, RNA-Seq analysis of blood-derived RNA on patient 15DG2234 (microcephaly, abnormality of the cerebral white matter and intellectual disability) highlighted KCTD3 as the only likely candidate within the candidate autozygome (139 candidates were highlighted prior to the autozygome filter). Indeed, subsequent RTPCR confirmed that this pattern was created by a partial exonic deletion of 38 bps (NM_016121.3:c.1036_1073del:p.(Pro346Thrfs*4)) that was missed by WES and led to the creation of an additional aberrant band in which the involved exon was completely skipped (Additional file 6: Figure S2).

The landscape of transcript-deleterious variants in Mendelian diseases
Additional file 2 Table S1 lists all likely causal TDVs (272 unique variants) identified through our detailed analysis of 5647 families with suspected Mendelian phenotypes. The breakdown of the six classes of TDVs is summarized in Fig. 4 and is described below in detail.
(1) Class (a) variants: a total of 175 (representing 64.3% of all TDVs) unique variants involving the canonical intronic splice sites were identified. RTPCR data were available for 93 (4 from literature and 89 from this cohort). Although this class is generally classified as "loss of function," we note that several resulted in in-frame rather than frameshift indel (Additional file 2: Table S1). More concerning was the finding of canonical splicing variants in established disease genes with no resulting phenotype, i.e., non-penetrance (Additional file 8: Table S6). For example, the variant NM_001172818.1:c.300 + 1G > A in PGM1, was identified in homozygosity in individuals with no phenotype despite its deleterious effect on splicing (confirmed by RTPCR), which explains its high population frequency. Similarly, we have identified an individual with ambiguous genitalia who is homozygous for LRP4 (NM_002334.2:c.796+2T>C) but lacks all features of established LRP4-related syndromes. On the other hand, the finding of ARHGAP31 (NM_020754.2:c.539+1G>A) in asymptomatic individuals despite its deleterious effect on splicing (confirmed by RTPCR) can be attributed to the fact that previously reported mutations in this gene were proposed to be gain-of-function. The SBDS founder variant (NM_016038.2:c.258+2T>C) is also worth highlighting since this is the most commonly reported variant in Schwachman-Diamond syndrome (SDS) and yet we identified it in homozygosity in at least three individuals who lack SDS features. Upon further investigation, we found that this is a leaky splicing variant and that all previously reported SDS patients were compound heterozygous for a more severe truncating variant (Additional file 8: Table S6). Finally, we note the unusual result of normal RTPCR on a patient with Marfan syndrome and a de novo FBN1 (NM_000138.4:c.6872-1G>A) variant, which suggests that the effect of splicing may be tissue-specific (Additional file 8: Table S6). (2) Class (b) variants: RNA was available for 11 of the 13 variants involving the first or last bp of an exon, and in each of these cases an aberrant transcript was observed. This includes a variant in TMX2 in case 19DG2556 with microcephaly and lissencephaly, which represents an independent confirmation of the very recently described TMX2-related disorder [33]. We suggest that this class should be combined with class (a) as canonical splice site variants. This is further supported by the consistently pathogenic prediction these variants received in silico (see below). (3) Class (c) variants: The range of non-canonical splice site intronic variants was remarkable ranging from 3 bp to 649 bp deep in our cohort. Since current capture techniques in WES usually capture < 50 bp of the flanking intronic sequence, we divided class c variants into those amenable for capture by WES, i.e., within 50 bp (n = 65) and those that are not, i.e., more than 50 bp from the nearest exon/intron junction (n = 8). The challenging nature of these variants is amplified when the phenotype is atypical (Additional file 8: Table S6). For instance, the NM_020751.2:c.1167-24A > G and NM_020751.2:c.695-8 T > G variants in COG6 resulted in a phenotype sufficiently different from CDG that it is listed in OMIM as a separate disorder, i.e., Shaheen syndrome [34]. Similarly, we note the surprising finding of NM_182894.2:c.456-6C > G variant in VSX2 causing ectopia lentis rather than the established microphthalmia phenotype, which supports a previously published case report [35]. Perhaps most surprising was the finding of a homozygous NF1 variant (NM_001128147.2:c.586 + 5G>A) in a young child with juvenile myelomonocytic leukemia but the parents did not have any manifestations of neurofibromatosis (Additional file 8: Table S6). In an example of the challenge in proving the pathogenicity of this class of variants, we note that the previously published COL6A2 (NM_001849.3:c.1459-63G>A) variant, which fully segregated with the expected phenotype of Ullrich muscular dystrophy, did not show abnormal RTPCR pattern suggesting the possibility of a tissue-specific splicing effect. (4) Class (d) variants: A total of 6 (2 exonic variants (excluding the first and last bp) were tested by RTPCR and found to be indeed transcript-deleterious. These include 3 that predict silent changes at the protein level. (5) Class (e) variants: Only three UTR (1.1%) variants were identified in the entire cohort (two 3′ UTR and one 5′ UTR mutation), suggesting their rarity, which is further supported by our unbiased analysis of families that map to single loci (Additional file 2: Table S1). (6) Class (f) variants: Only two variants (0.73%) were identified in the promoter or other regulatory regions of genes. The first is a TATA box mutation in UGT1A1 (NM_000463.2:c.-41_-40dupTA) [36]. The second is a deletion (chr16:1480850_1483950del) in a patient with unexplained diarrhea, and this deletion was reported very recently to be the cause of chronic diarrhea secondary to its regulatory effect on PERCC1 [37].

The role of in silico prediction
We have applied four (SpliceAI, TraP-score, S-CAP-score, and CADD) [38][39][40][41] in silico prediction tools to all variants that have been empirically tested for their transcriptdeleterious effect in this cohort (n = 169, including 4 from the literature) (Additional file 9: Table S7). To simplify the analysis, we used the default cutoff value suggested in each of these tools to classify variants as "deleterious" or "non-deleterious." We found that none of these tools achieved > 71% sensitivity in predicting the pathogenic nature of the variants we tested at the RNA level (SpliceAI (65%), TraP-score (63%), S-CAP-score (61%), and CADD (71%) and that at least one of the four tools failed to predict the pathogenicity of 25% of the variants. However, the yield of these tools was widely different between the different classes. Only 8% of class (a) variants compared to 44.8% of the other classes combined (18% for class b, 46% for class c, 33% for class d, 10% for class e, neither of the two class f variants was empirically tested in this study) received inconsistent prediction in silico. In agreement with our suggestion that class (b) variants should be lumped with class (a) (for the purpose of assigning a canonical splicing score on the ACMG classification), we show that in no instance did the four tools disagree on classifying these variants as pathogenic.

In search of tissue-specific aberrant transcripts
In addition to the 169 variants for which patient RNA material was available and tested, we also tested the expression of the genes containing the remaining 103 TDVs, in blood, skin, and urine (renal epithelial cells, see "Materials and methods") derived RNA, since these are the readily available sources of RNA clinically. Please note that blood-derived RNA was extracted from PAXGene and/or LCL and these are listed separately in Additional file 2: Table S1. We found that 84.1% (195 out of 232) of the tested genes are expressed in the blood-derived RNA, 85.8% (199 out of 232) in fibroblast-derived RNA and 90% (209 out of 232) in the renal epithelial cells-derived RNA. The majority of genes were expressed in all three sources of RNA (75.5%), while only 2.6% (6 out of 232 genes) were not expressed in any of these sources. We were able to detect the aberrant transcript associated with TDVs in controls who lack the respective variant in only 11/169 (6.5%) of those that were empirically tested. In all these instances, the aberrant transcript was much less abundant in controls, and in none of these cases was the aberrant transcript listed in Ensembl or UCSC Genome Browser. In the 12 patients for whom we had both skin-and blood-derived RNA, we found no instance of an aberrant transcript that was solely present in one but not the other whenever the gene was expressed in both (n = 11, Additional file 2: Table S1). However, we did encounter two instances of pathogenic variants that did not reveal aberrant transcripts in blood-derived RNA (FBN1:NM_000138.4:c.6872-1G>A and COL6A2:NM_ 001849.3:c.1459-63G>A). We conclude that the deleterious effects of these variants may be tissue-specific.

Discussion
RNA has long been exploited to investigate the effect of variants suspected to alter the final transcript. However, unbiased sequencing of all transcripts in an RNA sample (RNA-Seq) was only possible recently thanks to technological advancements. It is not surprising, therefore, that there is much enthusiasm about RNA-Seq as a supplemental test to genome sequencing to diagnose Mendelian conditions, among other indications. Although the effect of noncoding variants with GWAS significance on splicing is increasingly appreciated, the goal of this study was to study variants only in the context of Mendelian diseases since this is the area that stands to benefit most from the current applications of RNA-Seq [10,19,[42][43][44]. Unlike the relatively homogeneous DNA, RNA is highly heterogeneous spatially and temporally. In addition, there is marked variability in the abundance of different transcripts even in a given cell. Finally, the effect of pathogenic variants on RNA is far more nuanced than the simple "present" or "absent" that characterizes DNA variants (even mosaic DNA variants are either present or absent in a given cell). These factors make the use of RNA-Seq in clinical diagnostics challenging and highlight the need for empirical data, e.g., mapping splicing variations in clinically accessible tissue, that inform the development of computational tools that unlock the full potential of this technology [45,46]. This study is an attempt to contribute to the literature on RNA-based diagnosis of Mendelian diseases. The large volume of our cohort (2438 molecularly characterized Mendelian families) spanning 1807 Mendelian genes, and our unique resource of families that map to single loci and thus offer an unbiased window into the breakdown of disease-causing variants in Mendelian diseases, allowed us to draw several conclusions. First, we estimate the contribution of TDVs to be at least 15% of the overall Mendelian mutation pool, although our unbiased estimate based on single locus families suggests a higher contribution of 18.9%. This has important implications because it suggests that RNA-Seq has a great potential in solving Mendelian phenotypes. Unfortunately, it is not possible to compare this hypothetical yield to what has been achieved in the few reported studies since those studies heavily focused on cases that could not be diagnosed by WES or WGS [10,11], including a recent study involving 94 individuals with undiagnosed rare diseases that suggested a diagnostic rate of 16.7% [19]. This yield is similar to our estimated yield (13.5%) based on extensive positional mapping and RNA analysis of 155 Mendelian cases that could not be diagnosed by WES. Second, our finding of aberrant transcripts not described in databases that are detected in controls, despite their very low frequency, recapitulates the challenge described in previous studies in identifying the signal from noise when interpreting RNA-Seq. We suggest that while greater in magnitude, this challenge is no different in principle from the challenge of identifying the candidate causal variant in WES/WGS, and that filters that improve the signal/noise ratio are even more acutely needed in RNA-Seq. For example, we show in this study that the use of autozygome coordinates drastically reduces the search space in RNA-Seq (up to a factor of 300 in one case). While we acknowledge this filter is not always applicable, it should be pursued even in the absence of clear history of consanguinity since its integration into existing WES/WGS is straightforward as has been shown before [3]. Third, despite the significant investment in the development of in silico prediction tools, these remain far from perfect and our data clearly show that at least 25% of transcript-deleterious variants would be missed by tested tools. This suggests that these tools cannot replace RNA-Seq, which will likely become a standard clinical test for cases with negative WES/WGS. Fourth, and reassuringly, our data also seem to alleviate concerns about access to the relevant tissue since only < 10% of the tested genes were not expressed at all in the three sources of RNA available to us. Whenever the gene was expressed, we were able to demonstrate the effect of splicing in at least one of the two sources of RNA, with only two instances where a clearly pathogenic splicing variant did not result in aberrant transcript in the only tissue available for the respective patient, i.e., blood, despite abundant expression. It should be emphasized here that the overwhelming majority of the tested variants involved brain pathologies in their phenotypic expression. Fifth, we show several instances of abnormal splicing with no resulting phenotype as well as normal splicing with resulting phenotype for well-established disease genes. The apparent non-penetrance in the former scenario could be alternatively explained by a tissue-specific effect, which could also explain the latter scenario. Fortunately, these appear to be the exception; however, they are useful reminders of the expected limitation of RNA-Seq on clinically accessible samples. Sixth, although our study did not specifically aim to compare splicing to other classes of variants, we think that the examples we encountered with respect to the phenotypic expression of homozygous vs compound heterozygous regulatory variants is noteworthy. This phenomenon, first described in the context of thrombocytopeniaabsent radius (TAR) syndrome, has only rarely been invoked since, e.g., SNORD118-related cerebral microangiopathy leukoencephalopathy with calcifications and cysts [47], and TXNL4A-related Burn-McKeown syndrome [48]. We have previously shown that a non-canonical splice-site variant in DONSON causes microcephalic primordial dwarfism when inherited in trans with a hypomorphic variant, but results in an embryonically lethal microcephaly-micromelia syndrome when homozygous [49]. Here, we show that homozygosity for the most common disease-causing mutation in SBDS is not sufficient to cause SDS and that its inheritance in trans with a more severe mutation seems necessary. This calls for caution in inferring pathogenicity of a previously reported and confirmed pathogenic variants depending on their zygosity, and we suggest that regulatory and splicing variants may be particularly prone to this phenomenon.
In conclusion, we report the largest cohort of Mendelian phenotypes with comprehensive analysis of their underlying transcript-deleterious variants. The lessons learned from this cohort expand our knowledge of this class of variants and provide much needed empirical data for the clinical implementation of RNA-Seq as a promising supplemental tool to genome sequencing.