In order to explore the extent to which mature RNA molecules and pre-mRNA molecules form long and stable dsRNA structures, we employ a cross-species whole transcriptome approach, studying the full transcriptome of 49 organisms included in the Ensembl database [21], from yeast to human (Additional file 1: Table S1). Altogether, we analyzed pre-mRNA (genomic sequence from the beginning of the first exon to the end of the last exon, including introns) and the much shorter mature RNA sequences for 724,071 different genes (14,777 per organism, on average), covering 21.7 Gbp.
Despite important recent technological advancements, predicting the detailed structure of RNA molecules, including structural motifs, protein-binding regions, and RNA-RNA interactions, is yet a major challenge. Numerous computational tools are available for predicting RNA secondary structures, but their reliability in predicting full-length molecule structure is limited [22]. Thus, full understanding of the RNA folds requires intricate experimental approaches, currently inapplicable at large scale [23]. However, here, we are not interested in the full and accurate structure, but rather focus on long dsRNA stems, > 40 bp long. These substructures are easily detected by standard sequence alignment tools, e.g., BLAST [24]—if a long region of a molecule is highly similar to the reverse complement of another region in the same molecule, these two regions are likely to pair together and form a long and stable dsRNA. We thus use BLAST to align each of the mRNA and the pre-mRNA sequences to itself, and count the number of reversely oriented duplicated sequences (inverted duplicated sequences, IDS) as well as the total number of nucleotides involved. BLAST is not a perfect predictor of dsRNA stems, as it does not take into account the different pairing energies of A:T and G:C pairs and ignores the binding energy of G:U pairs. However, the size of the database we searched prevents using the much slower RNA folding algorithms, and for long and nearly perfect stems, BLAST provides a reasonable approximation. As a natural control, we look at the number of same-strand hits, showing similar sequence identity for regions on the same strand (tandem duplicated sequences, TDS), which are not relevant to the secondary double-stranded structure (Fig. 1).
One immediately notices a global depletion of IDSs in mRNA molecules, compared to the control TDSs (see Fig. 2). Looking at all organisms combined, one finds only 6525 IDSs in mRNAs covering 786 kbp, compared with 42,946 TDSs covering 3.01 Mbp (p < 1e−100; see Additional file 1: Table S2). This depletion indicates a strong selection against long and stable duplexes in the mRNA, consistent across most organisms. Furthermore, the selection against dsRNAs is much weaker in the pre-mRNA, suggesting it is driven by cytoplasmic processes (57.5M IDSs covering 1.26 Gbp, compared with 61.3M TDSs covering 1.51 Gbp; p < 1e−100). Limiting the gap between the 2 duplicated sequences to 2000 bp does not change the results, qualitatively (Additional file 2: Figure S1). This is consistent with the notion of avoidance of innate immunity activation by endogenous RNA, as viral dsRNA sensors are active in the cytoplasm [25]. Note, however, that stronger depletion of dsRNAs in pre-mRNAs is observed in a few species. For example, yeast introns are uncommon [26], and thus, the pre-mRNA is almost identical to the mRNA. Similarly, while honeybees do have an ADAR enzyme [27], very few retroelements have been identified in the honeybee genome [28], which could account for the lower number of dsRNAs in introns.
Long and nearly perfect duplexes are the prime candidates to provoke an innate immune response, as they resemble viral dsRNAs [25, 29]. Consistently, we observe a stronger genomic depletion for longer and highly base-paired structures (Fig. 3 and Additional file 2: Figure S2). There are only 246 almost perfectly base-paired IDSs (putative helices with > 96% identity), compared with 3436 almost perfect TDSs (32,245 bp vs. 425,753 bp) in all organisms (p < 1e−100). Moreover, depletion of almost perfect hits is noticeable even for pre-mRNAs (1.07M TDSs vs. 0.78M IDSs (p < 1e−100); 205 Mbp vs. 81 Mbp, respectively; Additional file 2: Figure S2 and Additional file 1: Table S3). Similarly, strong depletion is observed for long (> 300 bp) IDSs in mRNAs, with only 195 structures covering 91,310 bp, compared with 3915 TDSs covering 1.41 Mbp (p < 1e−100). In pre-mRNA molecules, one finds only 1.57M long IDSs compared with 2.31M long TDSs (p < 1e−100; 401 Mbp vs. 608 Mbp, respectively) (Additional file 2: Figure S2 and Additional file 1: Table S4). Finally, IDSs that are both long and almost perfectly matching are extremely rare, only 4 such examples are present in mRNA sequences of all organisms examined (manual inspection suggests these are unreliable), compared with 258 such TDSs (p = 2.6e−71; 54,479 vs. 144,989 structures in pre-mRNA) (Fig. 3 and Additional file 1: Table S5). We conclude that long and nearly perfect dsRNA structures are almost nonexistent in mature RNAs and selected against even if they are present only in the pre-mRNA molecules. A possible explanation for the depletion in pre-mRNA is that regions annotated as introns might actually be mis-annotated exons (constitutive or alternative), especially in less-explored transcriptomes. In addition, there might be a selective pressure even against intronic dsRNAs due to the occasional intron retention as a result of imperfect splicing. Although rare, dsRNAs present in these aberrantly spliced transcripts may trigger the immune system and therefore are selected against.
To further support the idea that depletion of IDSs within transcripts is related to the potential risk of endogenous dsRNAs, we study the expression level of all IDSs identified within RefSeq human transcripts in a pool of 30 RNA-seq GTEx [30] samples, originating from 15 different tissues (Additional file 1: Table S6). We find that the expression of IDSs negatively correlates with their length and identity (Fig. 4a, b), as expected if the driving selective disadvantage relates to the expressed RNA molecules. To exclude the possibility of reverse-transcription artifacts related to the secondary structures, we verified that the regions that show no expression in mRNA-seq data are actually well-covered in total RNA datasets, suggesting that the reverse transcription does allow their amplification (Additional file 2: Figure S3). Note that expression here refers to that of the IDS region, which could be much lower than the expression level of the hosting gene, as the IDS is mostly due to intronic sequences.
The above results suggest that the main suppressor of innate immune response that may be triggered by endogenous double-stranded RNAs is a tight purifying genomic selection. However, the balance between the continuous introduction of new putative dsRNAs, mainly due to the proliferation of repetitive elements, and the pruning of dangerous dsRNAs by purifying selection may lead to a residual number of potentially dangerous structures. In fact, we do see that most dsRNA structures reside in repetitive sequences (counting unique dsRNA genomic nucleotides overlapping known repeats: 531/786 kbp in mRNAs and 707/1260 Mbp in pre-mRNAs, for organisms with repetitive elements annotation; Additional file 1: Table S7). Furthermore, the depletion of IDSs in mRNA (but not pre-mRNA) is stronger for dsRNAs not associated with repetitive elements (Additional file 2: Figure S4). These observations are consistent with the view that dsRNAs associated with repetitive elements are being continuously added to the genome, and the purifying selection against them is still ongoing, maintaining the number of long and nearly perfect self-dsRNAs under control.
It was suggested recently that editing by ADAR1 plays an important role in preventing the activation of cytosolic response [7,8,9,10]. Using the abovementioned pool of GTEx tissues, we studied the editing levels in all 197 human IDSs within RefSeq transcripts that are both longer than 300 bp and with identity above 96% (Additional file 1: Table S8). Of these, in only 20 (10%), both arms of the IDSs are expressed at a level exceeding FPKM = 0.01 (to give perspective, FPKM = 1 corresponds, roughly [31], to 0.2–2 RNA molecules per cell). Seven of these are indeed edited strongly enough to bring the edited structure below the 96% identity cutoff (Fig. 4c). Only 2 IDSs are expressed at levels exceeding FPKM = 0.1, and both become much less base-paired by editing (one of these is presented in detail in Fig. 4d). Looking at another pool of 30 samples from the weakly edited muscle tissue (Additional file 1: Table S6), only 12 IDSs (6%) pass the FPKM = 0.01 cutoff, and in 5 of them, editing brings the identity below 96%. Not a single IDS is expressed stronger than FPKM = 0.1 in this pool (Additional file 1: Table S8).
We thus find that purifying genomic selection is the main contributor to protection against false activation of a cytosolic response to canonical endogenous transcripts appearing in a reference transcriptome, such as RefSeq. To the extent ADAR1 editing has any role in this protection, it is limited to a handful of putative dsRNA structures. Clearly, the vast majority of ADAR1 editing activity is irrelevant for this protective role. Furthermore, many of the essential targets of ADAR1 reside, likely, out of canonical transcripts.