The RNAs of pseudogenes tend to have higher m6A levels than their cognate mRNAs
In order to study the functions and evolution of m6A on the RNAs of pseudogenes, we first ask whether the RNAs of pseudogenes are methylated differently from their cognate mRNAs. We previously developed m6A-LAIC-seq technology to quantify the m6A levels in transcriptome-wide scale [28]. Here, we took advantage of our previously published m6A-LAIC-seq data of GM12878, which is a human B-lymphoblastoid cell line sequenced deeply in 1000 genome project, as well as H1, which is a human embryonic stem cell line, to study the m6A levels of the RNAs of pseudogenes [28].
Since pseudogenes and their cognate protein-coding genes have similar sequences, cross-mapping can happen frequently when mapping the short-reads of next-generation sequencing to the genome, which will distort the expression patterns of pseudogenes. Here, we improved the mapping procedure to minimize the occurrence of cross-mapping by only allowing perfect matches or mismatches at known SNPs when we aligned the reads to hg19 human genome using HISAT2 [29]. Because the SNPs of GM12878 were mostly included in the known SNP database, the new procedure is more powerful for GM12878 and we mainly focused on this cell line for the downstream analyses. Compared with the conventional HISAT2 mapping procedure allowing 3 mismatches, the new procedure resulted in a reduced number of mapped reads for a large number of genes, reflecting the stricter mapping criteria of the new mapping procedure. In contrast, we observed similar numbers of pseudogenes with more or less mapped reads using the new mapping procedure, suggesting that the stricter new procedure cause re-assignment of mapped reads on pseudogenes and their homologous genomic loci compared with the conventional procedure (Additional file 1: Figure S1a-c). As shown in Additional file 1: Figure S1d and e, we observed dramatic read coverage changes in certain regions of pseudogene RPS7P1 based between the two mapping procedures, due to the re-assignment of reads from its protein-coding cognate RPS7 to pseudogene PRS7P1 in the new mapping procedure; in another case, some reads were re-assigned from pseudogene PPP1R14BP3 to its cognate protein-coding gene PPP1R14B in the new procedure.
We took advantage of the annotations from GENCODE, Vega, and psiCube [30] databases to compile a list of 12,143 one-to-one pairs of pseudogenes and corresponding protein-coding genes, including 4078 protein-coding genes with different numbers of pseudogenes (Additional file 2: Table S1). Among them, only 223 transcribed pseudogenes with reliable m6A levels were detected in the m6A-LAIC-seq data of GM12878 cell line, reflecting that most pseudogenes are not transcribed [31]. The cognate protein-coding genes of these pseudogenes were enriched in mRNA catabolic process, endoplasmic reticulum localization, and translational initiation, consistent with the previous report about transcribed pseudogene [9] (Additional file 1: Figure S2a, b). We found m6A levels of pseudogenes were higher than protein-coding genes, but lower than lincRNA and antisense RNAs (Fig. 1a).
We then performed a pairwise comparison between the m6A levels of pseudogenes and their cognate protein-coding genes. We found the m6A levels of pseudogenes were much higher than their cognate protein-coding genes, the differences were much more dramatic in processed pseudogenes than unprocessed pseudogenes (Fig. 1b, c). Consistently, it was more dramatic in intronless pseudogenes than pseudogenes with multiple exons (Additional file 1: Figure S2c, d; Additional file 2: Table S2). Similar results were observed in H1 hESC based on 190 pseudogenes with reliable m6A levels (Additional file 1: Figures S2e, f and S3a-c; Additional file 2: Table S3). These results suggest that the m6A of pseudogenes may get promoted in the evolution after separation from their cognate protein-coding genes.
Because m6A-LAIC-seq quantifies the m6A at gene level, we used the m6A-seq data of GM12878 [32] to test whether the elevation of m6A levels on pseudogenes was due to de novo formation of m6A peaks on pseudogenes or facilitated methylation on ancestral m6A sites. We found 65% of the pseudogenes having m6A peaks; in contrast, 52% of their cognate protein-coding genes had m6A peaks (P = 0.007, two-tailed chi-square test) (Fig. 1d). As shown in Fig. 1e, protein-coding gene DSTN does not have m6A-seq identified m6A peak and the m6A-LAIC-seq data show that the full-length RNAs are mostly in m6A negative fraction. In contrast, m6A-seq identified three m6A peaks on its pseudogene DSTNP2, and the m6A-LAIC-seq data showed that its full-length RNAs were greatly enriched in m6A positive fraction (Fig. 1e). Similar results were also observed for other pairs of protein-coding genes and pseudogenes in GM12878 (Fig. 1e) as well as H1 cell lines (Additional file 1: Figure S3d). The above results suggest that the m6A levels of pseudogenes are increased on novel sites, indicating de novo formation of m6A motifs on pseudogenes. As shown in Additional file 1: Figure S3e, there are pseudogene-specific m6A motifs in the pseudogene-specific peaks of DSTNP2 and NAP1L4P1, suggesting that mutations can generate novel m6A motifs on pseudogenes.
Convergent evidence support the m6A motifs on pseudogenes evolved under positive natural selection
If de novo formation of m6A motifs on pseudogenes is under positive natural selection, we would expect to see a higher probability of obtaining novel m6A motifs than losing m6A motifs on pseudogenes in the evolution. To test this, we used the m6A motifs on the antisense strand as a neutral background, because of equal natural mutation probabilities on both strands. As shown in Fig. 2a, a significantly higher proportion of pseudogene on sense strand than on antisense strand has gained m6A motifs, supporting positive natural selection of the gained m6A motifs on pseudogenes (P = 1.4× 10−35, two-tailed chi-square test).
To further elucidate the evolutionary landscape of the m6A sites on pseudogenes, we took advantage of INSIGHT (Inference of Natural Selection from Interspersed Genomically coHerent elemenTs) to estimate ρ, which represents the proportion of negatively selected m6A sites on pseudogenes. Our estimate of ρ of the m6A sites on pseudogenes was 0.2 ~ 0.3 (Fig. 2b), lower than the previously estimated ρ of 0.33~0.56 for the m6A sites on 3′UTRs of mRNAs [33], suggesting that the m6A on pseudogenes are less conserved. In addition, the estimated ages of the m6A sites, based on a phylogenetic tree of human and representative non-human primates (Fig. 2c), were much younger than the non-m6A “DRACH motifs” on pseudogenes, suggesting that the birth rate of m6A sites is faster than random drift in primates evolution (Fig. 2d). We then calculated the rejected substitution scores, which measure the nucleotide-level constraint [34], to study the relationship between m6A and site conservation. We found that the less conserved m6A sites had significantly higher m6A peak intensities than conserved m6A sites (Fig. 2e). These results are consistent with our finding that there is a selection pressure of gaining new m6A motifs on pseudogenes, further supporting that m6A sites on pseudogenes are positively selected.
We were then interested in whether evolutionarily older pseudogenes tended to have stronger m6A modification. Based on the above calculated ages of these m6A sites, we found that the older the “DRACH” motifs are, the higher proportions of them are within detectable m6A peaks in GM12878 cells (Fig. 2f). In contrast to that only 13% of human-specific “DRACH” motifs were detected in m6A peaks in GM12878 cells, 22% of human and rhesus shared “DRACH” motifs were within m6A peaks detected in the same cell line. Consistently, we found a significant positive correlation between ages and intensities of m6A peaks (Fig. 2g). These results suggest that generating “DRACH” motifs works as the critical first step, followed by long evolution for them to get strong m6A methylation. In addition, we found a significant positive correlation between the sequence divergences and m6A levels for processed pseudogenes (P = 2.5× 10−12, Pearson correlation) and unprocessed pseudogenes (P = 0.006, Pearson correlation) respectively (Fig. 2h, i). Considering that the peak intensities of m6A peaks are overall conserved between human and mouse [35], the above results support that pseudogenes especially processed pseudogenes were originally methylated at a low level but evolved to have high methylation level quickly. Interestingly, highly m6A methylated pseudogenes, especially processed pseudogenes, tend to have cognate protein-coding genes with lower dN/dS ratios (ratio of nonsynonymous substitution rate to synonymous substitution rate), which represent the genes with higher essentiality and stronger functional constraint in the evolution, suggesting that important genes are more likely regulated by the modification of their processed pseudogene transcripts (Fig. 2j, k). On the other hand, we found the highly m6A methylated processed pseudogenes tend to have cognate protein-coding genes with shorter 5′UTRs, longer CDSs and 3′UTRs, lower GC contents, suggesting that these gene features of cognate coding genes may also relate to the m6A evolution of processed pseudogenes (Additional file 1: Figure S4a-d). Similar results were observed in H1 cells (Additional file 1: Figure S5a-l).
m6A facilitates the cytosolic degradation of processed pseudogenes
In order to understand the evolutionary pressure for pseudogenes to become highly methylated, we explored the functional consequences of m6A methylation on pseudogenes. m6A has been reported to promote nuclear export [24] and cytosolic degradation [21,22,23], we first tested whether m6A affected the gene expression of pseudogenes using the input of m6A-LAIC-seq data. We found the highly methylated (m6A level ≥ 0.6) (P < 0.001, two-tailed Wilcoxon test) and moderately methylated (0.6 > m6A level ≥ 0.3) (P = 0.030, two-tailed Wilcoxon test) processed pseudogenes had dramatically lower gene expression than lowly methylated (m6A level < 0.3) processed pseudogenes (Fig. 3a) in GM12878 cells, suggesting that m6A promotes the degradation of processed pseudogenes. Nevertheless, we did not observe a similar result for unprocessed pseudogenes (Fig. 3b), suggesting that the m6A-induced cytosolic degradation might be a specific function to processed pseudogenes. To further confirm whether different expression of processed pseudogenes is due to m6A, we sequenced the input, cytoplasmic, and nuclear RNAs of control and METTL3-knockdown GM12878 cells. Expression differences among the processed pseudogenes with different categories of m6A levels were observed in control cells (Additional file 1: Figure S6a). However, the differences became not significant in METTL3-knockdown cells, indicating the expression differences of processed pseudogenes are dependent on m6A (Fig. 3c).
To test whether m6A facilitates the nuclear export of pseudogene RNAs, we took advantage of the CSHL RNA-seq of separated cytosol and nucleus RNAs in GM12878 and H1 cell lines [36]. We found the highly methylated processed pseudogenes had dramatically higher nuclear indexes (ratio of expression between nucleus and cytosol) than moderately methylated (P = 0.034, two-tailed Wilcoxon test) and lowly methylated processed pseudogenes (P < 0.001, two-tailed Wilcoxon test) (Fig. 3d) in GM12878 cell line, suggesting stronger nuclear retention or cytosolic depletion of m6A methylated processed pseudogenes. We did not observe a similar result for unprocessed pseudogenes (Fig. 3e), which is consistent with the above result that the m6A of processed other than unprocessed genes is negatively correlated with gene expression (Fig. 3a, b). Similar results for processed pseudogenes were also observed in control GM12878 cells (Additional file 1: Figure S6b), but the differences were not significant in METTL3-knockdown cells, indicating that the nuclear index differences of processed pseudogenes are due to m6A differences (Fig. 3f). Because m6A has been reported to promote the nuclear export other than nuclear retention of RNAs [21], m6A likely facilitates the cytosolic degradation of processed pseudogenes. We then found the m6A of processed pseudogenes were negatively correlated with the gene expression in cytosol (Fig. 3g), but not significant with the gene expression in nucleus (Fig. 3h), strongly supporting the role of m6A on degrading the processed pseudogenes in cytosol.
Interestingly, we also found that the m6A levels of processed other than unprocessed pseudogenes were negatively correlated with the gene expression of their cognate protein-coding genes (Fig. 3i; Additional file 1: Figure S6c) and positively correlated with the nuclear indexes of their cognate protein-coding genes (Additional file 1: Figure S6d, e), suggesting that the m6A on the RNAs of pseudogenes may affect mRNA expression of their cognate protein-coding genes. Similar results were observed in H1 cells (Additional file 1: Figures S7a-f; S8a-d).
m6A of processed pseudogenes disrupt their crosstalk with their cognate protein-coding genes
To further address whether m6A of pseudogenes affects the crosstalk between pseudogenes and their cognate protein-coding genes, we analyzed the RNA-seq data of B-lymphoblastoid cell lines (BLCL), the same cell type as GM12878, from 462 European participants of 1000 genome project [37, 38]. We found that 164 out of 223 pairs of pseudogenes and cognate protein-coding genes showed significant correlations (FDR < 0.05) of gene expressions, 74% of them are positive correlations, which is consistent with the positive regulatory roles of ceRNAs (Examples of positive correlations are shown in Fig. 4a, b).
To test whether the m6A of pseudogenes affects the crosstalk between pseudogenes and their cognate protein-coding genes, we compared the correlation coefficients of the gene expressions between pseudogenes and their cognate protein-coding genes in three categories of gene pairs with different m6A levels of pseudogenes. We found that the processed pseudogenes with higher m6A levels had significantly lower correlation coefficients (P = 0.018, two-tailed Wilcoxon test) (Fig. 4c), suggesting that m6A of processed pseudogenes disrupt their crosstalk with their cognate protein-coding genes. However, we did not observe a similar result for unprocessed pseudogenes (Fig. 4d).
Experimental validation of the m6A on two processed pseudogenes reduce their crosstalk with their cognate protein-coding genes via promoting the decay of pseudogenes
To experimentally test whether the m6A of pseudogenes affects the expression of their cognate protein-coding genes, we selected two representative processed pseudogenes DSTNP2 and NAP1L4P1 for further validation in GM12878 cell line.
First, we tested whether DSTNP2 can regulate its protein-coding cognate DSTN. We found knockdown of DSTNP2 by siRNA significantly down-regulated mRNA as well as the protein expression of DSTN (Fig. 5a, b; Additional file 1: Figure S10a), while overexpression of DSTNP2 significantly upregulated the expression of DSTN (Fig. 5c), consistent with our observation that their gene expressions were positively correlated in a population of B-lymphoblastoid cell lines (Fig. 4a). In order to test whether DSTNP2 modules DSTN via ceRNA mechanism, we first confirmed that knockdown of DSTNP2 could promote the degradation of DSTN (Fig. 5d). We then predicted the miRNA binding sites and found 6 miRNAs potentially targeting both DSTNP2 and DSTN (Additional file 1: Figure S9a). We selected miR-362-5p for experimental validation due to their higher expression level according to ENCODE miRNA-seq data of GM12878 [39]. We found inhibition of miR-362-5p significantly increased the expression of both DSTNP2 and DSTN, indicating that miR-362-5p targets both DSTNP2 and DSTN (Fig. 5e).
To test whether the m6A of DSTNP2 affects the crosstalk between DSTNP2 and DSTN, we mutated all the 8 m6A sites of DSTNP2 without disrupting the binding sites of miR-362-5p (Additional file 1: Figure S9b), and then overexpressed the wild type and mutant respectively into GM12878 cell line. Compared with wild-type DSTNP2, we found overexpressing DSTNP2 mutant resulted in significantly higher mRNA as well as protein expression of DSTN (Fig. 5f, g; Additional file 1: Figure S10b). We then separated the cytosolic and nuclear RNAs and found the mutations of DSTNP2 m6A sites specifically affected the expression of cytosolic RNAs of DSTN (Fig. 5f). To test whether the m6A of DSTNP2 affects the stabilities of cytoplasmic RNAs, we measured the degradation rates of DSTNP2 and DSTN RNAs. We found that DSTNP2 mutant had significantly higher stability than wild type DSTNP2 (Fig. 5h), and overexpression of DSTNP2 mutant resulted in significantly higher stability of its cognate protein-coding gene DSTN (Fig. 5i; Additional file 1: Figure S10c), suggesting that the m6A of DSTNP2 promotes the degradation of DSTNP2 and indirectly decreases the stability of DSTN. When miR-362-5p was inhibited, overexpression of wild-type and mutant DSTNP2 did not show significantly different effects on the gene expression, RNA stability, and protein expression of DSTN, suggesting that m6A mediated degradation of DSTNP2 reduced the ceRNA effects of DSTNP2 on DSTN (Fig. 5j–l).
Similar results were observed for the other processed pseudogene NAP1L4P1. NAP1L4P1 can also upregulate the expression of NAP1L4 (Additional file 1: Figure S11a, b). Overexpression of NAP1L4P1 m6A-mutant resulted in significantly higher cytoplasmic abundance as well as stability of its cognate protein-coding gene NAP1L4 (Additional file 1: Figure S11c-e), suggesting that the m6A of NAP1L4P1 promotes the degradation of NAP1L4P1 and indirectly decreases the stability of NAP1L4. However, we did not succeed in identifying the miRNA that mediates this process.