An integrated computational pipeline for designing CRISPRi sgRNAs that target human pseudogenes
We developed an integrated computational pipeline to design a CRISPRi sgRNA library for the annotated human pseudogenes. As the effectiveness of CRISPRi-based transcriptional repression relies heavily on the precise recruitment of the effector complex to the target gene transcription start site (TSS) [19], an accurate annotation of the TSSs of individual genes is essential to the success of designing effective sgRNAs. However, the TSSs of pseudogenes are relatively poorly annotated in comparison with that of protein-coding genes. To address this issue, we first integrated the FANTOM5 cap analysis of gene expression (CAGE) [21] data with GENCODE V22 transcriptome annotation to define the TSSs on a transcriptome-wide level, including the pseudogenes (Fig. 1A and “Methods”), as described previously [22]. A total of 97,074 CAGE clusters were assigned as the TSSs of transcripts in GENCODE V22. Next, we used the algorithm Sequence Scan for CRISPR (SSC) [23] to scan for sgRNA targets based on the genomic sequence within a 500-bp window centered on each TSS. After filtering out sgRNA sequences of low quality from the initial 770,965 sgRNAs, we selected 359,082 uniquely mapped sgRNAs that target the TSS-proximal regions of 42,609 genes. There was a total of 57,031 uniquely mapped sgRNAs that target TSS-proximal regions of 7762 pseudogenes (Additional File 1: Table S1).
As a proof-of-principle systematic study of human pseudogene function with the designed CRISPRi sgRNAs, we focused on breast cancer, which is the most commonly diagnosed cancer and the leading cause of cancer death in women worldwide and has well-defined molecular subtypes [24]. Luminal A is the most common subtype and triple negative/basal-like is a more aggressive subtype in breast cancer. To generate a CRISPRi sgRNA library that target the expressed pseudogenes in MCF7 and MDA-MB-231 cell lines, the two breast cancer cell line models representing luminal A and triple negative/basal-like breast cancer, we used the RNA-seq data from The Cancer Cell Line Encyclopedia (CCLE) [25] to filter out the lowly expressed pseudogenes (FPKM < 0.5 in both breast cancer cell lines). We also filtered out the pseudogenes with less than three designed sgRNAs. After filtering, there were 5703 designed sgRNAs corresponding to 850 pseudogenes, with the median number of 6 sgRNAs per pseudogene (Additional File 1: Table S1; Additional File 2: Fig. S1A). Furthermore, to interrogate the function of both parent genes (if available) and pseudogenes in the same screen, sgRNAs targeting high-confidence parent genes [26] (Fig. 1A) that passed the same filters as for pseudogenes were included in the library, resulting in 3727 sgRNAs that target 380 parent genes (“Methods,” Additional File 1: Table S1). In addition to 9430 gene-specific CRISPRi sgRNAs targeting pseudogenes and parent genes, we included 568 sgRNAs targeting 71 core fitness genes as positive controls, and 267 sgRNAs targeting AAVS1 and 83 non-targeting sgRNAs as negative controls, as described previously [27] (Fig. 1B, Additional File 1: Table S1). The pseudogenes in our screen covered all three major classes of pseudogenes, including processed (n = 642), unprocessed (n = 185), and unitary pseudogenes (n = 17), as well as non-classified/misc (n = 6) pseudogenes (Fig. 1C). For the majority of parent genes that were included, there were no more than two corresponding pseudogenes with targeting sgRNAs in the library (Fig. 1D).
A CRISPRi screen identifies functional human pseudogenes of different categories
To identify the pseudogenes that critically contribute to luminal A breast cancer cell growth and/or survival (fitness), we conducted a pooled CRISPRi screen (Fig. 2A and “Methods”). We conducted the screen in MCF7 cell line that stably express the streptococcus pyogenes (Sp) dCas9-KRAB fusion protein [20] in triplicates, in a similar way to the CRISPR-Cas9 or CRISPRi screen performed previously [20, 28]. Briefly, the oligonucleotides containing both sgRNAs and flanking linker sequences were synthesized as a pooled library and the resultant library was amplified and cloned into the lentiviral vectors. Cells transduced with the lentiviral vectors encoding sgRNA library were selected with puromycin (puro). The puro-selected cells were then split into three replicates and passaged for 21 days. We collected individual replicates on day 0 (D0) and day 21 (D21). The abundance change of individual sgRNAs between the initial and final cell populations were quantified by next-generation sequencing to identify the genes that are important for cell fitness. The sgRNA abundance from the three replicates showed a significant correlation (p < 2.2 × 10−16) with each other on day 0 and day 21 (Additional File 2: Fig. S1B). As expected for the working positive controls, we observed a notable depletion in the abundance of the sgRNAs targeting positive control core essential genes (Additional File 2: Fig. S1C and D) in final cell populations (D21) compared with the initial ones (D0). Interestingly, the sgRNAs targeting parent genes also showed a general decrease in abundance in final cell populations (Additional File 2: Fig. S1C and D), suggesting that many of them are essential to cell fitness. Moreover, we used the parent genes included in our library that were previously identified as essential genes in MCF7 cells by CRISPR-Cas9 knockout screens [29] to serve as independent positive controls. As expected, the sgRNAs that target the essential parent genes previously identified by CRISPR-Cas9 knockout screens showed a statistically significant larger fold change of decrease between D21 and D0, compared with the ones targeting all parent genes (Mann-Whitney U test, p < 1.21 × 10−4, Additional File 2: Fig. S1D). Similar result was observed in the gene-level histograms where the fold change of each gene was represented by the 2nd largest fold change of its corresponding sgRNAs, based on the output of MAGeCK program [30] (Additional File 2: Fig. S1E).
To identify the genes that are essential to cell fitness, we used MAGeCK [30] to assess the statistical significance of the level of sgRNA depletion and identify the genes that were under significantly negative selection in the screens (p < 0.05, FDR < 0.25 and log2Fold-Change≤-log2(1.5), “Methods”). Bidirectional promoters are an important source of off-target effect for CRISPRi and can result in false positives in cell fitness screens [27]. To control for the false positives caused by bidirectional promoters, we excluded the negatively selected genes from screen hits, if their TSSs were within 1 kb from the TSSs of another gene based on GENCODE V22 annotation (“Methods,” Additional File 3: Table S2). We used 1 kb as a cutoff, based on our previous finding that genes located up to 1 kb from an essential gene are more likely to be scored as an essential one in a fitness screen, due to CRISPRi off-target effect [27]. After filtering out the potential false positives due to bidirectional promoters, we identified 69 pseudogene hits (out of 850 pseudogenes) that were negatively selected in MCF7 cells (Fig. 2B). In contrast, we did not find any significant positively selected genes in the screen (p < 0.05, FDR < 0.25 and log2Fold-Change≥log2(1.5), Additional File 3: Table S2). The negatively selected pseudogene hits contained all three major classes of pseudogenes (Fig. 2B). In addition, we identified 69 parent gene hits (out of 380 parent genes) that were negatively selected (Fig. 2C). Interestingly, we found that the negatively selected parent genes showed a larger magnitude of sgRNA depletion than their corresponding pseudogenes (paired t-test, p < 3.03 × 10−52, Fig. 2D), suggesting that in general, parent genes are functionally more important for cell fitness than their pseudogene counterpart. However, a small fraction of pseudogenes showed a larger magnitude of sgRNA depletion than their corresponding parent genes, suggesting that the function of these pseudogenes might be less dependent on their parent genes (Fig. 2D).
To investigate the effect of potential off-targeting sgRNAs on the screen results, we used Cas-OFFinder [31] to predict the putative off-target sites of individual sgRNAs in the human genome (“Methods”). Because the off-target effect is much weaker when there are > 1 nucleotides (nt) of mismatches [32] or there is any RNA/DNA bulge [33] in the potential off-target sites, we focused on the predicted off-target sites with 1-nt mismatch from a given sgRNA sequence. We found that most sgRNAs targeting pseudogene/parent gene were associated with no or very small number (≤ 1) of predicted off-target sites in the human genome (Additional File 2: Fig. S1F). Moreover, the number of predicted genomic off-target sites associated with off-targeting sgRNAs did not show significant difference between pseudogene/parent gene hits and non-hits (Mann-Whitney U test, p ≥ 0.18, Additional File 2: Fig. S1G). Importantly, we found that the pseudogene/parent gene hits did not have a significantly larger proportion of off-targeting sgRNAs with a large number (≥ 10) of off-target sites, compared with the other pseudogenes/parent genes (Fisher’s exact test, p > 0.32). Collectively, these results suggest that in overall, the potential off-targeting sgRNAs may have little impact on differentiating the screen hits from the other genes and thus the results of our CRISPRi screens. Aside from the global analysis of potential off-target effect, we further investigated whether the pseudogene hits identified from our screen could be confounded by the potential off-targeting sgRNAs from its corresponding parent genes or vice versa. We found that out of the 69 pseudogene and 69 parent gene hits, 15 pseudogenes and their corresponding parent genes were both identified as hits. Among a total of 30 (15 pseudogene and 15 parent gene) hits, we found 6 of them have one and the only one significant negatively selected sgRNA (p < 0.05, log2Fold-Change≤-log2(1.5)) that harbors a predicted off-target site within [− 2 kb,+ 1 kb] from the TSS of its corresponding pseudogene/parent gene (“Methods,” Additional File 3: Table S2). After removing the only one putative functional off-targeting sgRNA for all six pseudogene/parent gene hits, four of them still had at least two significant negatively selected sgRNAs and two of them had one significant negatively selected sgRNA. These results indicate that the vast majority of the pseudogene/parent gene hits are not confounded by the predicted off-targeting sgRNAs from their corresponding pseudogenes/parent genes.
Validating top pseudogene hits with an upregulated expression in breast cancer
To validate the top pseudogene hits from our screen that are relevant to breast cancer, we focused on the pseudogenes, whose targeting sgRNAs showed the strongest growth inhibitory effect in MCF7 cells and that were significantly upregulated in breast cancer (Fig. 3A), compared with normal breast tissues (log2Fold-Change≥log2(1.5) and FDR < 0.05, “Methods”). We selected four candidates DDX12P, TUBBP5, MGAT4EP, and PRELID1P1 that had at least 50% effective and negatively selected sgRNAs scored by MAGeCK for functional validation. To determine the role of these four pseudogenes in cancer cell fitness, we examined their loss-of-function phenotype in MCF7 cells with CRISPRi-mediated gene silencing. For each pseudogene, we selected the top two sgRNAs (> 55 bp apart from each other in the genome) that showed the strongest growth inhibitory effect in CRISPRi screen for gene silencing (“Methods”). Real-time quantitative reverse transcription PCR (qRT-PCR) experiments confirmed that the selected gene-specific sgRNAs effectively reduced RNA level of the corresponding pseudogenes in comparison with the non-targeting sgRNA (Fig. 3B, Additional File 4: Table S3). The effective depletion of these pseudogenes by either of the two gene-specific sgRNAs inhibited the growth of MCF7 cells (Fig. 3C–F) and impaired their clonogenic capacity (Fig. 3G, H). To assess the robustness of our observations to the use of different negative controls, we included two genome-targeting negative control sgRNAs with one targeting the Adeno-Associated Virus Integration Site 1 (AAVS1) region (sg-AAVS1) and the other targeting the genomic region distant from AAVS (sg-nAAVS1) in the loss-of-function study (Additional File 4: Table S3). The sg-AAVS1 and sg-nAAVS1 were selected from our CRISPRi sgRNA library and a CRISPR-Cas9 sgRNA library used in previous knockout screens [34], respectively, because they showed an insignificant abundance change across conditions. Similar to the case of using a non-targeting sgRNA as a negative control, gene-specific sgRNAs effectively reduced RNA level of the corresponding pseudogenes compared with negative controls of genome-targeting sgRNAs (Additional File 2: Fig. S2A). The effective depletion of these pseudogenes consistently impaired the clonogenic capacity of MCF7 cells and inhibited their growth (Additional File 2: Fig. S2B and C) when the genome-targeting sgRNAs were used as a negative control. Taken together, these results indicate that the loss-of-function phenotypes we observed are robust to different negative controls. To rule out the possibility that the observed loss-of-function phenotype for these pseudogenes are caused by CRISPRi-mediated off-target effect, we further performed rescue experiments, by overexpressing the corresponding cDNAs of these pseudogenes in the presence of CRISPRi-mediated knockdown. We found that cDNA expression was able to rescue the CRISPRi-mediated loss-of-function phenotype in both cell growth and clonogenic formation (Additional File 2: Fig. S2D and E). These results indicate that the observed loss-of-function phenotypes for these four pseudogenes are not due to off-target effect. In addition, we aligned the spacer sequences of the sgRNAs used in the validation experiments against the promoter sequences of the corresponding parent genes (if available) and chose the first PRELID1P1-targeting sgRNA (Additional File 4: Table S3), which only has 1-nt mismatch to the promoter of its parent gene, to assess its specificity in inhibiting pseudogene transcription. Importantly, we found CRISPRi-mediated silencing by this sgRNA significantly reduced Pol II binding to the promoter of PRELID1P1, but not to that of its parent gene PRELID1 (Additional File 2: Fig. S3A). This result confirmed that our CRISPRi screen enabled transcriptionally inhibiting pseudogene transcription without directly interfering with the transcription of its parent gene, which is critical for systematically interrogating the function of pseudogenes independent from their parent genes.
MGAT4EP is a cancer-testis unitary pseudogene that promotes the growth of breast cancer cells
Among the four pseudogene hits that we validated, MGAT4EP showed a consistently strong loss-of-function phenotype in both cell growth and clonogenic assay, and a larger fold change in expression between breast cancer and normal breast tissues. Therefore, we focused on this unitary pseudogene for a detailed functional characterization. Interestingly, MGAT4EP not only showed a significant upregulation in breast cancer compared with normal breast tissue based on the Cancer Genome Atlas [35] (TCGA) RNA-seq data [36], but also showed a much higher expression in testis than other normal tissues based on the RNA-seq data generated by the Genotype-Tissue Expression (GTEx) project [37] (Additional File 2: Fig. S3B and C), indicating that MGAT4EP is a cancer-testis unitary pseudogene. We further performed 5′ and 3′ RACE (rapid amplification of 5′/3′ complementary DNA ends), and confirmed that the experimentally determined 5′ and 3′ end of MGAT4EP transcript (NR_038135.2) were consistent with its original RefSeq annotation (Additional File 2: Fig. S3D).
Some of the human pseudogenes were found to undergo translation and might express functional proteins [38]. To rule out the possibility that MGAT4EP encodes a protein/micropeptide and has a coding-dependent function, we first analyzed the publically available [39] and in-house ribosome profiling (ribo-seq) data (unpublished) in MCF7 cell line (“Methods”) and found no ribo-seq reads that support the ribosome occupancy on MGAT4EP RNAs. Second, we predicted putative ORFs encoded by MGAT4EP using an ORF prediction module that solely relies on the sequence information and is implemented in the Ribo-TISH package [40] (“Methods”), and searched the publically available mass-spectrometry (MS) data in MCF7 and T47D cells [41] for the MS/MS spectra that matched the protein sequences corresponding to these putative ORFs (“Methods”). We found no MS evidence of the protein products encoded by these putative ORFs. Finally, we performed an in vitro translation assay (“Methods”) and found no evidence of any protein products generated by MGAT4EP translation (Additional File 2: Fig. S3E). These results indicate that MGAT4EP does not encode a protein/micropeptide and it functions as an ncRNA.
To validate the function of MGAT4EP using an alternative loss-of-function approach to CRISPRi, we performed siRNA-mediated silencing of MGAT4EP to assess its effect on cell growth (“Methods”). Consistent with the results obtained by CRISPRi method, we found that the effective siRNA-mediated depletion of MGAT4EP (Additional File 2: Fig. S3F) inhibited the growth of both MCF7 and T47D, the two independent luminal A breast cancer cell lines (Additional File 2: Fig. S3G).
MGAT4EP is predominantly localized in the nucleus and interacts with transcription factor FOXA1
An important mechanism, whereby many pseudogenes [6,7,8] exert their function, is through competing for miRNA binding with its parent gene or other protein-coding genes, the so-called sponge [11]/ceRNA [12, 42] mechanism. This mechanism is not specific to pseudogenes that have parent genes with high sequence homology (i.e., processed/unprocessed pseudogenes). A previous study [8] revealed that Pbcas4, a mouse unitary pseudogene that does not have functional protein-coding counterparts in the mouse genome and lost its protein-coding capability specifically during rodent evolution, can also serve as a ceRNA with conserved miRNA target sites. A critical factor determining the efficacy of a sponge/ceRNA regulation is the cytoplasmic localization of the involved RNAs [43] where most miRNA-based regulation occurs. Because sub-cellular localization is important for dictating mechanism of pseudogene function, we determined sub-cellular localization of MGAT4EP, using nuclear/cytoplasmic fractionation coupled with qRT-PCR. The good quality of the nuclear/cytoplasmic fractionation was supported by the enrichment of nuclear/cytoplasmic protein (Additional File 2: Fig. S3H) and RNA controls (Fig. 4A) in their respective sub-cellular compartments. Interestingly, we found that MGAT4EP was dominantly localized in the nucleus (Fig. 4A), suggesting that it may exert a nuclear function, which is distinct from a traditional sponge/ceRNA mechanism.
To infer its nuclear function and systematically identify the nuclear proteins that may form physical interaction with MGAT4EP, we used an RNA pull-down method [44] (“Methods”) based on in vitro transcription of 3′ end-biotin-labeled RNAs and affinity purification of RNA-interacting nuclear proteins, followed by mass spectrometry (MS). The antisense (AS) sequence of MAGAT4EP transcript (NR_038135.2) was used as a negative control RNA in pull-down experiment, to filter out non-specific interactions (Fig. 4B). We found that among the proteins specifically identified in MAGAT4EP pull-down group (i.e., at least two unique MS-identified peptides in the MAGAT4EP group and zero MS-identified peptide in the AS negative control group), eight transcription factors/epigenetic regulators showed significant upregulation in the luminal A breast cancer subtype compared with normal breast tissues (Additional File 2: Fig. S4A). Notably, one of them is FOXA1 (forkhead box A1), also known as HNF3α (hepatocyte nuclear factor 3α), a transcription factor that impacts estrogen receptor signaling and is key to mammary ductal development and progression of luminal A subtype breast cancer [45]. We confirmed by western blot analysis that FOXA1 was enriched by MAGAT4EP RNA pull-down (Fig. 4C). In contrast, SP1, a negative control protein that was not identified by MS, was not detected (Fig. 4C). To identify the regions in the MGAT4EP RNA that was required for its interaction with FOXA1, we generated four serial deletion mutants with the deletion of 1–700, 700–1400, 1400–2100, or 2100–2819 bps, respectively. The RNA pull-down of antisense, full-length, and serial deletion mutants of MGAT4EP RNA followed by anti-FOXA1 western blotting showed that the deletion of 1400–2100 bps of MGAT4EP abolished its interaction with FOXA1 (Fig. 4D), suggesting that this region is critical for MGAT4EP-FOXA1 interaction. We further performed RNA immunoprecipitation (RIP) coupled with qRT-PCR for MGAT4EP RNA and two negative controls: GAPDH mRNA for cytoplasmic RNAs and MALAT1 for nuclear RNAs, respectively. Indeed, FOXA1 was associated with MGAT4EP RNA, but not the negative control of GAPDH mRNA and MALAT1 (Fig. 4E).
MGAT4EP upregulates the expression of FOXM1, a direct target of FOXA1, by enhancing FOXA1 binding to its promoter
To identify common protein-coding gene targets that are co-regulated by MGAT4EP and FOXA1 and are important for mediating their tumor-promoting function in luminal A breast cancer cells, we performed an integrated analysis of the RNA-seq data generated from cells with/without sgRNA-mediated MGAT4EP depletion, FOXA1 ChIP-seq data in luminal A breast cancer cell lines [46], and TCGA breast cancer data [36] (Fig. 5A). First, using our RNA-seq data, we identified 103 downregulated protein-coding genes (log2Fold-Change≤ −log2(1.5) and FDR < 0.05) in MGAT4EP knockdown cells (Additional File 5: Table S4; Additional File 2: Fig. S4B). Next, using publicly available FOXA1 ChIP-seq data in MCF7 and T47D cells [46], we identified 5310 protein-coding genes that harbored at least one FOXA1 binding site in their promoter-proximal regions (− 1.5 kb, 0.5 kb) in either cell line. In total, 19 protein-coding genes showed downregulation upon MGAT4EP knockdown and harbored at least one FOXA1 binding site in their promoter regions, thereby representing potential common targets co-regulated by MGAT4EP and FOXA1. Finally, through an analysis of TCGA data, we found that four of the 19 candidates, including FOXM1, MALRD1, RMI2, and TGFB3, showed a significant upregulation in the luminal A breast cancer compared with normal breast tissues (log2Fold-Change≥1 and FDR < 0.05). Given that FOXM1 is an established oncogenic transcription factor [47] which is upregulated in a variety of human cancers, we focused on characterizing the mechanism, whereby MGAT4EP/FOXA1 axis regulates its expression.
Both FOXA1 and FOXM1 showed a significant upregulation in luminal A breast cancer subtype compared with normal breast tissues (Additional File 2: Fig. S4C). To validate the regulation of FOXM1 expression by MGAT4EP, we assessed the effect of sgRNA-mediated depletion of MGAT4EP on FOXM1 expression and found that sgRNA-mediated knockdown of MGAT4EP significantly reduced FOXM1 expression at both RNA and protein level in MCF7 cells (Additional File 2: Fig. S4D and E). Consistent with the results from CRISPRi-based silencing, siRNA-mediated depletion of MGAT4EP reduced FOXM1 expression at both RNA and protein level in MCF7 and T47D cells (Fig. 5B, C). In addition, effective siRNA-mediated depletion of FOXA1 markedly reduced FOXM1 expression at RNA and protein level in MCF7 and T47D cells (Fig. 5D, E). These results confirmed that FOXM1 as a common downstream target of FOXA1 and MGAT4EP. We also found that siRNA-mediated depletion of MGAT4EP did not affect the protein level of FOXA1 (Additional File 2: Fig. S4F), indicating that MGAT4EP did not regulate FOXM1 expression by impacting the FOXA1 protein level. Given the previous findings that long noncoding RNAs (lncRNAs) can promote the recruitment of epigenetic modifiers to specific genomic locations [48], we hypothesized that unitary pseudogene MGAT4EP may regulate FOXM1 expression by enhancing FOXA1 binding to its promoter region. To test this hypothesis, we evaluated the effect of siRNA-mediated depletion of MGAT4EP on FOXA1 binding to the promoter region of FOXM1 by ChIP-qPCR. Based on the FOXA1 binding site in the FOXM1 promoter that was identified using publicly available ChIP-seq data [46] (Fig. 5F), we found that FOXA1 bound to FOXM1 promoter and its binding was indeed impaired upon siRNA-mediated depletion of MGAT4EP (Fig. 5G). To determine the role of FOXM1 in luminal A breast cancer cells, we examined its loss-of-function phenotype in MCF7 and T47D cells with short hairpin RNA (shRNA)-mediated gene silencing. Consistent with the oncogenic role of FOXM1 in other cancers, the effective depletion of FOXM1 (Fig. 5H) by either gene-specific shRNA inhibited the growth of MCF7 and T47D cells (Fig. 5I) and impaired their clonogenic capacity (Additional File 2: Fig. S4G).
Many unitary pseudogenes show clinically relevant expression patterns in human cancer
Unitary pseudogenes are a special class of pseudogenes that derive from acquisition of disrupting mutations in functional protein-coding genes, without duplication or retrotransposition events. They do not have functional protein-coding gene counterparts in the same genome. With the discovery of MGAT4EP, a novel functional cancer-testis unitary pseudogene, we further investigated among 170 annotated unitary pseudogenes (GENCODE V22), whether there are other unitary pseudogenes, the expression of which is elevated in tumor compared with the corresponding normal tissues and/or is associated with clinical outcomes including patient overall survival (OS) and/or relapse-free survival (RFS) across different types of cancers, via integrative analyses of TCGA data. Interestingly, the majority of unitary pseudogenes showed a significant differential expression (|log2Fold-Change| ≥ log2(1.5), p < 0.05 and FDR < 0.25) between tumors and the corresponding normal tissues (Fig. 6A and Additional File 6: Table S5), with 132 unitary pseudogenes showing a significant up-/downregulation in at least three cancer types. In addition, the expressions of 111 and 34 unitary pseudogenes were associated with OS and/or RFS of patients (Fig. 6B, C and Additional File 6: Table S5) respectively in at least two cancer types, based on multivariate Cox proportional hazards regression analysis (p < 0.05 and FDR < 0.25).
Some examples of unitary pseudogenes, whose expression was significantly associated with OS and/or RFS in the same cancer type or different cancer types, are of particular interest. The first example is CMAHP (cytidine monophospho-N-acetylneuraminic acid hydroxylase, pseudogene), a unitary pseudogene that encodes a dysfunctional version of the cytidine monophospho-N-acetylneuraminic acid hydroxylase (Cmah) from other mammals. The enzyme encoded by Cmah in non-human mammals hydroxylates N-acetylneuraminic acid (Neu5Ac), producing N-glycolylneuraminic acid (Neu5Gc) [49]. Neu5Ac and Neu5Gc are two most common forms of sialic acid in many non-human mammals. In contrast, Neu5Gc is not detectable in normal human tissues and is immunogenic in human [49]. Higher CMAHP expression was associated with better patient OS in lung adenocarcinoma (LUAD, log-rank test, p = 0.00142)) and cutaneous melanoma (SKCM, log-rank test, p = 7.47 × 10−6), respectively (Fig. 6D). It was also significantly downregulated in these two cancer types compared with the corresponding normal tissues (data not shown), suggesting its tumor-suppressive role. The second example is CPHL1P (ceruloplasmin And Hephaestin Like 1, pseudogene). Different from CMAHP, higher expression of unitary pseudogene CPHL1P was associated with worse patient OS (log-rank test, p = 4.98 × 10-6) in clear cell renal cell carcinoma (KIRC) and worse patient RFS (log-rank test, p = 0.00245) in prostate cancer (PRAD), respectively (Fig. 6E). It was also significantly upregulated in KIRC and PRAD compared with the corresponding normal tissues (data not shown), suggesting its tumor-promoting role. The third example is MYH16 (myosin heavy chain 16 pseudogene). Like MGAT4EP, MYH16 is a cancer-testis unitary pseudogene, which was upregulated in solid tumors such as pancreatic adenocarcinoma (PAAD) and showed a much higher expression in testis than other normal tissues (data not shown). It encodes a deficient sarcomeric myosin heavy chain that is otherwise expressed and functional in non-human primate masticatory muscles [50]. The pseudogenization of the sarcomeric myosin heavy chain in human lineage was associated with the marked size reductions in individual muscle fibers and entire masticatory muscles in human [50]. Interestingly, a higher MYH16 expression was associated with both worse patient OS (log-rank test, p = 0.000454) and RFS (log-rank test, p = 0.00363) in PAAD (Fig. 6F).