Most expressed human snoRNAs are produced from intron-embedded genes
To determine the tissue distribution of snoRNAs and their relative abundance within the human transcriptome, we sequenced total ribodepleted fragmented RNA from seven healthy human tissues (breast, ovary, prostate, testis, skeletal muscle, liver, and brain). Each tissue was sourced from 3 different individuals and sequenced using TGIRT-Seq methodology, which was shown to reliably quantify the abundance of different types of RNA in a same sample [23, 29]. Indeed, in general our ranking of the abundance of RNAs was in agreement with the Genotype-Tissue Expression (GTEx) estimates for protein-coding genes (Additional file 1 - Tables S1A-G) [34]. The clustering of the quantified transcripts of all detected biotypes supports the quality of our datasets. Indeed, despite the expected differences between individuals and variations in sample cell composition, we notice little variability between samples of same tissue (Additional file 1 - Figure S1). Using this sequencing method, we detected RNA (> 1 transcript per million (TPM) in at least one tissue sample) generated from 475 (50%) snoRNA genes out of a total 947 annotated human snoRNA genes (Additional file 1 - Table S2). This is consistent with the fact that most RNAs are poorly expressed and only a minority of the transcriptome is highly expressed (Additional file 1 - Figure S1), as we have previously reported [23]. The majority (433 out of 475 snoRNAs, i.e., 91%) of the expressed snoRNA genes are located in introns, while only 9% (42 out of 475 snoRNAs) are located in intergenic regions and thus likely expressed from an independent promoter (Fig. 1A). In contrast, 21% of all annotated snoRNAs are located in intergenic regions, suggesting that most annotated intergenic snoRNA genes are not expressed. Indeed, intergenic snoRNAs contribute only to 2% of the total snoRNA abundance, confirming the mostly intronic origin of human snoRNAs [35]. Interestingly, most expressed box H/ACA snoRNAs (67%) are found in protein-coding HGs while expressed box C/D snoRNAs do not show clear HG biotype preference (Fig. 1A). Variations in the number of snoRNA embedded in each HG are also observed between the two types of snoRNAs. The majority of box H/ACA snoRNAs (50%) are the only snoRNA embedded within their HG (Fig. 1A middle panel, mono-intronic HG), while the majority of box C/D snoRNAs (78%) are encoded with multiple snoRNAs in separate introns of the same HG (Fig. 1A right panel, multi-intronic HG). Together, these results indicate that the two types of snoRNA have distinct embedding preferences.
SnoRNAs are among the most abundant RNAs in the cell
To evaluate the relative contribution of snoRNAs to the transcriptome of the different human tissues, we compared their abundance to other RNA biotypes detected in each of the tissues examined. Overall, the highest percentage of expressed non-rRNA transcripts was detected within tRNAs where 84% of the annotated genes are expressed at least in one tissue, followed by the protein-coding genes and snoRNA genes (Additional file 1 - Table S2). The lowest proportion of expressed genes was detected in the snRNA and lncRNA biotypes, which put the snoRNAs at the interface between translation associated RNAs and RNAs associated with RNA processing and regulation. Comparison of the number of transcripts (in TPM) generated from each biotype indicates that tRNA genes generate the highest number of transcripts regardless of the tissues examined (Fig. 1B), which is in accordance with biochemical estimates [36]. On the other hand, the snoRNA and snRNA biotypes compete for the second place in the transcriptome in a tissue-dependent manner. In the tissues derived from reproductive organs, except for testis, the snoRNAs are more abundant than snRNAs, while the snRNAs are more abundant in the other tissues, with the highest relative proportion of snRNA abundance detected in testis (Fig. 1B). However, it is important to note that unlike snoRNAs, the snRNA transcripts are generated by only 24% of the annotated snRNA genes and are driven by only a few genes that each generate more than 1000 TPMs like 7SK and spliceosomal snRNA genes (Additional file 1 - Table S2, Figures S2B and S3B). In contrast, half of the annotated snoRNAs generate around 15–20% of non-rRNA transcripts which is half-way between the tRNAs at one extreme where 84% of the annotated genes generated 45% of transcripts and protein-coding RNAs where 73% of the genes generate only 5% of transcripts (Fig. 1B and Additional file 1 - Table S2). In general, box C/D snoRNAs are on average 3 times more abundant than box H/ACA snoRNAs across tissues (Additional file 1 - Figure S4A). This ratio represents a lower abundance difference than what was previously reported between the two snoRNA types [22, 27, 30, 31], which is likely explainable by the low structure bias approach we used. Nonetheless, both box C/D and H/ACA snoRNAs are mostly abundant to at least 1 TPM in all the studied tissues (Additional file 1 - Figure S4B), underlining the widespread importance of both snoRNA types in all human tissues. Overall, the abundance of most snoRNAs and tRNAs is more than 10 TPM in each tissue, whereas the abundance of other biotypes is mostly between 0 and 10 TPM (Additional file 1 - Figure S2). We conclude that on average snoRNA genes generate the highest diversity and number of non-rRNA transcripts after tRNAs in the human genome.
Tissue-dependent distribution of RNA accumulation identifies two snoRNA abundance classes
In most cases, variations of RNA abundance are often taken as a basis for gene regulation and tissue specificity. Accordingly, we examined the pattern of snoRNA abundance in the different tissues and compared it to that of other RNA biotypes. As with snRNAs and tRNAs, the cumulative abundance curves seen with snoRNAs are less variable between tissues than those observed with protein-coding RNAs and lncRNAs (Additional file 1 - Figure S3), highlighting the widespread distribution of housekeeping RNAs across tissues. Of note, the most extreme examples of tissue specialization were observed in the case of the genes coding for albumin (ALB) and haptoglobin (HP), which produce as high as 20% of all protein-coding transcripts in liver (Additional file 1 - Figure S3D). Similarly, most tissues express a very small number of lncRNAs except testis which is known for its permissive chromatin environment (Additional file 1 - Figure S3E) [37]. To enable direct comparison between the tissue distribution patterns of the different RNAs, we calculated the coefficient of variation (CV) for each RNA based on its abundance across the studied tissues (see “Methods” for more details). This metric allows us to numerically differentiate between the different degrees of tissue uniformity and enrichment of the different transcripts. Uniformly expressed RNAs are identified by low CV value, while tissue-enriched RNAs are identified by high CV value. Interestingly, comparison of the CV value of the different biotypes indicates that snoRNAs occupy a middle ground between the highly uniform tRNAs and snRNAs (CV < 125) and highly variable protein-coding RNAs and lncRNAs (CV > 125) (Fig. 1C). In general, the uniformly expressed biotypes like tRNA and snRNA display a single peak with a median CV of around 65. In contrast, the tissue-enriched biotypes like protein-coding RNA and lncRNA display a bimodal distribution of CV, which indicates the presence of two RNA subpopulations, the first peak around a CV of 65 and the other around 260. Like the tissue-enriched protein-coding RNAs and lncRNAs, snoRNAs include two RNA subpopulations, the main one peaking at a CV of 70. However, unlike these tissue-enriched RNAs, the right-most snoRNA peak is much smaller and centered around a CV of 180. This bimodal distribution of snoRNA CVs can be split into two snoRNA abundance classes separated by a CV threshold of 125 (Fig. 1C and Additional file 1 - Figure S5; see “Methods” for more information). Accordingly, we termed the snoRNAs with a CV < 125 “Uniformly expressed” or “UE” and snoRNAs with a CV > 125 “Tissue-enriched” or “TE.” Taken together, these results indicate that snoRNA abundance is at the interface between that of housekeeping RNAs and regulatory RNAs and that snoRNAs can be categorized into two distinct abundance classes.
The majority of tissue-enriched snoRNAs are enriched in brain and reproductive tissues
To understand the origin and distribution of the two snoRNA abundance classes, we followed the accumulation of each RNA of these two classes in the different tissues. As indicated in Fig. 2A, TE and UE snoRNAs generally clustered separately, validating the group identity of most RNA in each class. In addition, snoRNA abundance results in an adequate clustering of the tissues, once again confirming the validity of our datasets (Fig. 2A). Analysis of individual snoRNA distribution indicates that the majority of snoRNAs (n = 390) are uniformly expressed across tissues, whereas 85 snoRNAs are enriched in specific tissues (Fig. 2B). Overall, 47 TE snoRNAs are enriched in the brain and 38 are enriched in male or female reproductive tissues (Fig. 2A and Additional file 1 - Figure S6B). The brain-enriched snoRNAs include the previously established brain-specific snoRNA family SNORD115 (Additional file 1 - Figure S6A) [26], validating our CV-based classification of snoRNAs. Interestingly, four snoRNAs with known rRNA targets (SNORA81, SNORA19, SNORD36A, and SNORD111B) are highly enriched in both studied female reproductive tissues (Additional file 1 - Figure S6).
Most UE snoRNAs are present at an abundance of > 1 TPM in all the examined tissues and the majority has an abundance greater than 100 TPM whereas, in contrast, many TE snoRNAs have an abundance below 1 TPM in most tissues and the majority has an abundance that is less than 100 TPM (Additional file 1 - Figure S7A; Fig. 2C and D, left panel). Interestingly, most of TE snoRNA total abundance is attributable to their expression in the brain, whereas UE snoRNA total abundance is mostly attributable to their expression in reproductive tissues (except for testis) (Fig. 2D, right panel).
The snoRNA abundance classes exhibit distinct RNA levels, target preference, and conservation patterns
The discovery of two snoRNA abundance classes raises the question of whether the tissue-dependent expression of snoRNAs reflects functional specialization, different evolutionary origin, snoRNA type, or simple stochastic variation in expression. To differentiate between these possibilities, we first examined the variation in the abundance of the UE and TE classes in each of the different tissues. As indicated in Fig. 3A, all tissues display a broad spectrum of RNA abundance for both groups. Notably, we observe a loose and subtle inverse correlation between the abundance of the two groups: the tissue expressing the lowest amount of TE snoRNAs (Fig. 3A, breast tissue) appears to express the highest level of the UE class and vice versa. This suggests that the distribution of these two classes is not random but reflects a tissue-specific expression program that chooses between the housekeeping UE snoRNAs and the specialized TE snoRNAs. To determine whether the abundance classes are driven at least in part by snoRNA type, we then compared the proportion of box C/D and H/ACA snoRNAs in each class. As indicated in Fig. 3B, box C/D snoRNAs are well represented in both classes, but the greatest difference is observed with box H/ACA snoRNA, which are significantly more represented in the UE class (Fisher’s exact test, p < 0.001). These differences in abundance and snoRNA type appear to reflect a degree of functional specialization of the snoRNA abundance classes. Indeed, examining the type of RNA targeted by the snoRNA classes, we notice clear differences in the groups’ target preferences. In general, most targets of the UE class are in rRNA or snRNA, while most TE snoRNAs have no known canonical targets (Fisher’s exact test, p < 2 × 10−14) (Fig. 3C).
To further characterize the differences between the two snoRNA abundance classes, we compared the genomic organization and conservation of the genes in each class. Interestingly, we found that while the majority of UE snoRNAs are embedded in the introns of protein-coding genes, the majority of the TE snoRNAs are embedded in the introns of non-coding HGs (mainly lncRNAs) (Fisher’s exact test, p < 2 × 10−10) (Fig. 3D). The presence of snoRNAs in non-coding HGs also suggests a more modern evolutionary origin, since many lncRNAs show low sequence conservation [38]. Indeed, comparison of the gene conservation between the two snoRNA groups indicates that the UE class is much more conserved among vertebrates than TE snoRNAs (Mann-Whitney U test, p < 8 × 10−11) (Fig. 3E). TE snoRNAs also tend to be slightly more conserved across primates than vertebrates, but still significantly less than UE snoRNAs (Mann-Whitney U test, p < 4 × 10−9) (Additional file 1 - Figure S7B and Fig. 3E), highlighting the fact that some TE snoRNAs are potentially only conserved in humans. Altogether, these results indicate that the snoRNA abundance classes represent two groups of snoRNAs with distinct genomic context, conservation, expression patterns, and function.
The snoRNA abundance classes display different degrees of correlation with their HG depending on their HG function and characteristics
Since most snoRNAs in the human genome are embedded in introns [39, 40], it is presumed that their expression is linked to that of their HG. To further characterize the relationship between the abundance of snoRNAs and their HG, we thus calculated Pearson correlation coefficients (Pearson’s r or correlation of abundance) and their associated false discovery rate (FDR)-adjusted p value based on the abundance of the different snoRNA/HG pairs across tissues (Fig. 4A). Surprisingly, we find that 40% of expressed snoRNAs are either non-correlated (− 0.25 ≤ correlation of abundance ≤ 0.25) or anticorrelated (correlation of abundance < − 0.25) with the abundance of their HG transcripts, suggesting that not all snoRNAs are linked to the expression of their HG and supporting recent findings in other models [22,23,24]. Indeed, only 60% of snoRNAs are positively correlated with their HG (correlation of abundance > 0.25) (Fig. 4A). The difference in the correlation patterns is not linked to the abundance of snoRNAs as we find that anticorrelated snoRNAs are expressed at similar levels to non-correlated or positively correlated snoRNAs (Additional file 1 - Figure S8A). On the other hand, snoRNAs are generally more abundant than their HG, and the anticorrelated group in particular is significantly more abundant than their HGs compared to non- or positively correlated snoRNAs (Mann-Whitney U test, p < 0.05 and p < 0.0005, respectively) (Additional file 1 - Figure S9). In addition, we find that in general anticorrelated snoRNAs, regardless of their HG biotype, are more evolutionarily conserved than the other two correlation classes, which underlines their importance in the snoRNome (Additional file 1 - Figure S8B).
Since snoRNA abundance spans a wide and variable range of correlation with the HG abundance (Fig. 4A), we next wanted to uncover where the two snoRNA abundance classes occur within this broad range of correlation. Interestingly, the TE snoRNAs are much more likely to be correlated with the abundance of their HG transcripts than the UE class, which is represented all along the spectrum of correlation of abundance with the HG (Fig. 4B). Since UE and TE snoRNAs have distinct embedding preferences (Fig. 3D), we then re-examined the distribution of correlation of abundance, but this time by splitting the two snoRNA abundance classes based on their HG coding potential (Fig. 4C). Remarkably, non-coding HGs display clear positive correlation of abundance with either UE or TE snoRNAs, whereas protein-coding HGs exhibit a more complex abundance relationship with their embedded snoRNAs (Mann-Whitney U test, p < 4 × 10−15 and p < 1 × 10−5, respectively for UE and TE snoRNAs) (Fig. 4C). Overall, these findings suggest that snoRNAs are not always strictly linked to the expression of their HGs and that the snoRNA abundance classes display distinct patterns of correlation with their HG.
Given that snoRNA abundance classes displayed differences in their HG coding potential, we examined the possibility of a link between the snoRNA abundance patterns and the function of their protein-coding genes. Remarkably, we find that UE and positively correlated snoRNAs are predominantly embedded in HGs coding for ribosomal protein (Fisher’s exact test, p < 2 × 10−4) (Fig. 4E, left panel). On the other hand, most anticorrelated UE snoRNAs are located in genes coding for RNA processing and ribosome biogenesis factors (Fisher’s exact test, p < 0.05) (Fig. 4E, left panel). A similar pattern is observed in the few protein-coding HGs harboring TE snoRNAs, but the small number of HGs prevents accurate estimation of statistical significance (Fig. 4E, right panel). Following the same logic but with non-coding HGs, we explored the possibility that lncRNA functionality could be associated with a snoRNA’s correlation of abundance. Indeed, based on previous characterizations of lncRNAs [41], those with documented functions are significantly more positively correlated with the abundance of their embedded snoRNAs than lncRNAs with no reported function (Mann-Whitney U test, p < 2 × 10−21) (Additional file 1 - Figure S10). Altogether, these results indicate that correlation between the snoRNAs and their HG reflects at least in part the functional relationship of these pairs.
An important characteristic of snoRNA HG groups is the differing stability of their transcripts. Indeed, as reported in [42], mature transcripts encoding ribosomal proteins have a significantly lower decay rate than transcripts from other host gene groups (Mann-Whitney U test, p < 0.01) (Additional file 1 - Figure S11A). The abundance of the highly stable mRNA encoding ribosomal protein correlates better in general with their embedded snoRNAs than other HG types (Additional file 1 - Figure S11B). This increased correlation between the abundance of snoRNA and their host ribosomal protein mRNA may reflect high demand for ribosome synthesis and the need for coordinating the different steps of ribosome biogenesis. We do not observe strong links between the snoRNA-HG correlation of abundance and the stability of non-ribosomal protein host mRNA (Additional file 1 - Figure S11B). In addition, we note that the stability of the host mRNA will affect the ratio of snoRNA to HG transcript abundance within a tissue, but not their correlation of abundance since the decay rate is presumed to be constant across tissues.
To understand the basis of the difference in the abundance pattern of anti-, non-, and positively correlated HGs, we evaluated the susceptibility of HGs to NMD, bearing in mind that NMD could regulate HG transcript levels and thereby modulate the correlation of abundance. NMD-sensitive HGs were defined as such based on their previously determined response to the depletion of NMD factors (see “Methods” for more details) [21]. Interestingly, we find an increased susceptibility to NMD in anticorrelated non-coding HGs which are enriched in the UE snoRNA class (Fig. 4D, top panel). In contrast, we find no association with NMD in the TE class of snoRNAs. This is due to the lack of anticorrelated non-coding HGs of TE snoRNAs and also because non- and positively correlated non-coding HGs of TE snoRNAs are not subject to NMD (Fig. 4D, bottom panel), which is consistent with the fact that most TE snoRNAs are highly correlated with the abundance of their non-coding HG transcripts (Fig. 4C). Of note, NMD does not seem to modulate alone the correlation of abundance between protein-coding HGs and their embedded snoRNAs, as we observe no significant trend across correlations of abundance for either UE or TE snoRNAs (Fig. 4D). Taken together, these findings indicate that NMD may provide means to repress the expression of the HGs without affecting the expression of the embedded snoRNAs and thus enable the uncoupling of the HG and snoRNA expression.
Dual initiation of transcription uncouples the expression of the host and snoRNA genes and generates various snoRNA abundance patterns
Since it was recently suggested that promoters with dual transcription initiation sites may uncouple the expression of host and snoRNA genes [25], we compared the number of HGs with only one type of transcription initiation site (termed here simple-initiation sites (SI)) to those with dual-initiation (DI) promoters in both the UE and TE classes of snoRNAs. In addition to a canonical initiation promoter with pyrimidine/purine (YR) dinucleotide, DI promoters carry an additional intertwined polypyrimidine initiation site (YC or 5’TOP) [25]. We therefore defined HGs with DI promoters based on the presence of both YR and YC initiation sites within the HG, which were previously reported using Cap analysis gene expression sequencing (CAGE-Seq) [25] (see “Methods” for more details). All other HGs were considered as containing an SI promoter. Interestingly, we find that DI promoters are significantly more present in non- and anticorrelated snoRNA/HG pairs regardless of whether they are UE or TE (p < 6 × 10−4 and p < 2 × 10−5, respectively for UE and TE snoRNAs) (Fig. 5A). Furthermore, significantly more HGs with DI promoters than SI promoters are detected in the UE class of snoRNAs (Fisher’s exact test, p < 3 × 10−8) (Fig. 5B). This is consistent with the increased number of non- and anticorrelated genes detected in the UE class of snoRNAs (Fig. 4B) and supports the duality of transcription initiation as a means for uncoupling the HG and snoRNA expression. The initiation pattern-dependent uncoupling of either UE and TE snoRNA abundance is also supported by the increased susceptibility of HG transcripts produced from DI promoter to NMD when compared to those generated from an SI promoter (Fisher’s exact test, p < 2 × 10−14 and p < 0.01, respectively for UE and TE snoRNAs) (Fig. 5C). Strikingly, TE snoRNAs produced from DI and SI promoters have distinct tissue distribution patterns. The SI types are mainly enriched in brain and display positive correlation between the snoRNA and HG, whereas the DI types are highly abundant in breast and ovary tissues and are mostly non- or anticorrelated with their HG (Fig. 5D). Collectively, these results indicate that DI promoters present a way for cells to independently optimize the expression of the HG and snoRNA to meet the difference in the functional requirements of human tissues.