Aberrant splicing isoforms detected by full-length transcriptome sequencing as transcripts of potential neoantigens in non-small cell lung cancer
Genome Biology volume 22, Article number: 9 (2021)
Long-read sequencing of full-length cDNAs enables the detection of structures of aberrant splicing isoforms in cancer cells. These isoforms are occasionally translated, presented by HLA molecules, and recognized as neoantigens. This study used a long-read sequencer (MinION) to construct a comprehensive catalog of aberrant splicing isoforms in non-small-cell lung cancers, by which novel isoforms and potential neoantigens are identified.
Full-length cDNA sequencing is performed using 22 cell lines, and a total of 2021 novel splicing isoforms are identified. The protein expression of some of these isoforms is then validated by proteome analysis. Ablations of a nonsense-mediated mRNA decay (NMD) factor, UPF1, and a splicing factor, SF3B1, are found to increase the proportion of aberrant transcripts. NetMHC evaluation of the binding affinities to each type of HLA molecule reveals that some of the isoforms potentially generate neoantigen candidates. We also identify aberrant splicing isoforms in seven non-small-cell lung cancer specimens. An enzyme-linked immune absorbent spot assay indicates that approximately half the peptide candidates have the potential to activate T cell responses through their interaction with HLA molecules. Finally, we estimate the number of isoforms in The Cancer Genome Atlas (TCGA) datasets by referring to the constructed catalog and found that disruption of NMD factors is significantly correlated with the number of splicing isoforms found in the TCGA-Lung Adenocarcinoma data collection.
Our results indicate that long-read sequencing of full-length cDNAs is essential for the precise identification of aberrant transcript structures in cancer cells.
In cancer cells, transcript regulations are disrupted at various steps, resulting in the accumulation of substantial numbers of aberrant transcripts . In normal cells, these aberrant transcripts, even if transcribed, are subsequently degraded by the mRNA quality control system in a process known as nonsense-mediated mRNA decay (NMD). NMD factors scan each transcript and subject mRNA-harboring premature translation termination codons (PTCs) to degradation . However, the NMD regulation mechanism is also frequently disrupted in cancer cells [3,4,5], allowing the aberrant transcripts to escape degradation. Therefore, accumulations of aberrant transcripts are frequently observed in cancer cells. For example, UPF1, a highly conserved core NMD factor, is frequently mutated in pancreatic adenosquamous carcinoma . Disruption of the UPF1 function causes the accumulation of aberrant transcripts in the tumor [6, 7]. The relationship between NMD and tumor immunity has been reported  and considered to be a therapeutic target for cancers in some conditions .
Aberrations are also caused by impaired gene expression of splicing-related factors . Indeed, the relationship between the aberrations and disrupted splicing regulation has been reported for various cancer types. U2AF1 and SF3B1, the core components of spliceosomes, are frequently mutated in cases of myelodysplastic syndrome (MDS) [11, 12] and several types of solid tumor [13, 14]. BUD31 was reported as an MYC-synthetic lethal gene in mammary epithelial cells and was found to disrupt processing in precursor mRNA . Further, some specific aberrant isoforms directly contribute to tumorigenesis and malignancy . Thus, splicing factors have also been identified as therapeutic targets for cancer [17, 18].
Recently, it has been suggested that putative functionally neutral aberrant transcripts have the potential to serve as novel molecular markers for immune check point inhibitor (ICI) therapy, which uses ICIs such as nivolumab, pembrolizumab, and atezolizumab. For current ICI treatment, the tumor mutational burden (TMB), which is determined by the total number of nonsynonymous point mutations, has been frequently considered to be a biomarker to identify patients with expected effects [19,20,21]. These nonsynonymous mutations alter amino acids and produce novel cancer-specific neoantigens, some of which are subsequently expressed in the HLA molecule and recognized by T cell receptors, which activates cytotoxic T cells to kill the cancer cells. [22, 23] The assumption is that the higher the expression of neoantigens in cancer cells, the greater the potential for the cancer cell to be attacked by immune cells following ICI treatment.
Currently, massive parallel sequencing techniques, such as whole-exome sequencing and panel sequencing, have been employed to measure TMB in various conditions. Several studies have reported that the TMB correlates with the efficacy of ICI treatment, especially in melanoma patients . However, it has been gradually revealed that the TMB may not be a perfect marker of ICIs in many types of cancer [25,26,27]. Recent studies have reported that transcripts that harbor frameshift mutations and aberrant splicing patterns also produce antigenic peptides under the presumed condition of disrupted NMD [28,29,30,31]. Under such circumstances, the aberrant transcripts may have no less important potential as biomarkers for ICIs.
To consider all the effects of mutations and aberrant isoforms, proper peptide sequences from full-length transcript structures are needed. However, it is difficult to precisely identify the complete structure and aberration patterns of a transcript from the fragmented reads produced by short-read RNA sequencing technology. Recently, long-read sequencing technology has enabled us to obtain full-length transcriptome profiles. Particularly, MinION, a nanopore-type sequencer from Oxford Nanopore Technologies, can produce full-length cDNA data, covering the entire transcript at sufficient sequencing depth. Recent papers have also shown that the long-read sequencing achieves approximately 90% accuracy , which is sufficient for the precise extraction of splicing patterns.
In this study, in combination with the long-read and short-read sequencing technologies, we attempt to construct a catalog of full-length transcript structures in a series of lung cancer cells. Based on this catalog, we examined whether the deduced full-length protein sequences with mutations and/or frameshift aberrations can serve as predictive markers for ICIs in addition to TMB. For this purpose, we started with a cell line panel of lung adenocarcinoma as the model. We then applied the developed method to clinical samples from Japanese non-small cell lung cancer (NSCLC) patients.
Cataloging full-length transcript isoforms of lung cancer cell lines
MinION full-length DNA sequencing was performed for 22 NSCLC cell lines (Additional file 1: Table S1), in which characteristic genomic driver mutation patterns and transcriptomic profiles of lung cancers were collectively represented. The schematic representation of the analytical pipeline constructed for this study is shown in Additional file 12: Fig. S1A. An average of 3.5 million reads at an average read length of 1.6 kb were generated from each cell line (Additional file 12: Figs. S1B, S2 and Additional file 2: Table S2). The obtained full-length cDNA reads were mapped to the human genome by Minimap2 as previously reported . To overcome the inaccuracy of MinION sequencing , low-quality and ambiguously aligned reads were filtered out (Additional file 2: Table S2). All splicing junctions were confirmed using the Illumina short-read RNA sequencing dataset of the same cell lines. The obtained splicing patterns were further compared to the current transcript models of the Reference Sequence (RefSeq) database (see “Materials and methods” for further details on the procedure for the identification and characterization of splicing junctions). Among the reads mapped to RefSeq transcripts, more than 50% of reads successfully covered full-length transcripts of up to 8000 genes (Additional file 12: Fig. S1C). For each gene, the number of reads per million (RPM) of MinION and transcripts per million (TPM) calculated from short-read sequencing were strongly correlated (Additional file 12: Fig. S1D and Fig. S3).
We identified the complete exon-intron structures of transcript isoforms and classified them into the following types: unannotated exon, exon skipping, exon shuffling, intron retention, alternative first exon, alternative last exon, alternative 5′ splice site, and alternative 3′ splice site (Fig. 1a and Additional file 3: Table S3). As a result, we identified 3474 non-RefSeq isoforms from all the cell lines and named them putative “aberrant splicing isoforms,” hereinafter referred to as “isoforms.” Of these, 2021 isoforms harbored at least one splicing event that was not represented in the RefSeq or the GENCODE datasets (Fig. 1b). We also identified novel combinations of these splicing events on a single full-length transcript, which would have been difficult to detect via the fragmented reads of short-read sequencing data. Collectively, these patterns accounted for 13.6% of the unannotated isoforms (Fig. 1c).
The patterns and numbers for each of the NSCLC cell lines represent characteristic aberrant splicing events (Fig. 1d and Additional file 3: Table S3). The number of isoforms ranged from 323 to 725. The compositions of isoform patterns differed among the cell lines. For example, intron retention and alternative last exon composed the largest proportions of isoforms in RERF-LC-Ad2, while unannotated exon and exon skipping were characteristics in H1819. These results suggest the diversity of aberrant splicing isoform patterns even in cell lines. An example of a novel complicated combination isoform of splicing events is shown in Fig. 1e. CTSV in ABC-1 cells expressed two isoforms, one of which included the combination of alternative last exon and intron retention that occurred between exons 2 and 3.
Genes with multiple isoforms were also observed. For example, HNRNPA2B1 showed four isoforms that contained alternative splicing events within 3′ UTR of PC-3 cells (Fig. 1f). This extensive accumulation of splicing events in the 3′ UTR of HNRNPA2B1 was also observed in a previous study that performed a UPF1 knockdown experiment in HeLa cells . We identified 663 genes with multiple isoforms in total (Fig. 1g). These results reflect the great potential of long-read sequencing to comprehensively detect novel and complicated isoforms.
To validate of our computational pipeline, we compared the isoform patterns in VMRC-LCD cells detected by our pipeline and by the TALON software. The TALON software, which was launched by the ENCODE project , identifies novel isoforms using long-read sequencing datasets. As a result, 708 of 725 isoforms (98%) detected in our pipeline were also detected in TALON, and 676 of them (93%) successfully passed TALON’s filtering conditions (Fig. 1h). The 17 isoforms that could not be detected by TALON were supported by reads whose junctions were altered by the long-read error correction tool TranscriptClean , a part pf TALON pipeline. Although 21,745 isoforms were detected only in TALON, 99% were covered by reads that were filtered out by our previous filtering conditions (attributed to ambiguous alignments or low expression levels, data not shown). Our filtering conditions aimed to avoid false-positive detection and to keep higher expressed transcripts, which can be translated as neoantigens; therefore, our pipeline extracted more conservative isoform patterns than the isoform patterns extracted by the TALON pipeline.
To confirm the effect of repetitive element loci, which can lead to misalignment of the reads, we evaluated 5508 splice junctions detected in 22 cell lines. We extracted ± 50 bp regions around the splice sites and searched repetitive regions using RepeatMasker. As a result, 94.9% of these splicing sites did not overlap repetitive sequences (Additional File 12: Fig. S1E). This result suggests that novel splice sites were not particularly associated with the repetitive sequences.
Some gene-fusion events were reported to function as driver genes contributing to tumorigenicity in many cancers [37, 38]. To address this issue, we firstly manually inspected the presence or absence of the previously reported frequently “fused” genes. We successfully identified the EML4-ALK fusion transcript, as a driver fusion event, in H2228 (Additional file 12: Figs. S4A). In addition, we identified the ERGIC2-CHRNA6 transcript in H1437 (Additional file 12: Figs. S4B). This is a fusion event causing the frameshift, invoking a drastic amino acid change, thus may serve as an important source of a novel neoantigen. For these fusions, it is true that our previous studies have suggested their presences by the short-read sequencing [1, 39]. However, we firstly identified their complete transcript forms by using the MinION long reads in this study. Particularly interestingly for the ERGIC2-CHRNA6 fusion gene, we identified four alternative splice isoforms, consisting of an unannotated exon and an alternative last exon.
Characterization of the detected isoforms
We further characterized the novel isoforms and found that 45% of the isoforms were shared among the cell lines, although 1894 isoforms were uniquely detected in each cell line (Fig. 2a). A total of 1354 genes harbored aberrant splicing isoforms in two or more cell lines (Fig. 2b and Additional file 12: Fig. S5).
To characterize isoform-enriched genes, we compared lengths, the number of exons, and expression levels of genes with/without isoforms. Genes harboring at least one isoform showed significantly shorter lengths and consisted of a smaller number of exons compared to those of genes without isoforms (Additional file 12: Figs. S6, S7, and S9); however, expression levels were found to be significantly higher in genes that harbored isoforms (Additional file 12: Figs. S8 and S9). We also found that isoform variety was associated with gene length and expression level. Several cell lines showed significant differences in gene length and expression level between genes with one isoform and those with two or more isoforms. For example, in VMRC-LCD, isoform-enriched genes tend to have shorter length and higher expression than genes with only one isoform (Fig. 2c–e).
To examine the relation between the expression levels and the detection probability, we subsampled sequencing reads of VMRC-LCD cells to 1/2, 1/5, 1/10, 1/50, and 1/100 (n = 100) and calculated the detection probability for each isoform. We divided these isoforms into three classes, High, Middle, and Low, for each of the three categories as below: (1) TPM of the gene (total gene expression), which was calculated from the short-read RNA sequencing data (Additional file 12 Fig. S9B); (2) isoform-reads ratio (isoform frequency within the gene): [the number of MinION reads covering the isoform pattern] / [the total number of MinION reads assigned to the gene] (Additional file 12 Fig. S9C); (3) isoform-reads (absolute count of the isoform): the number of MinION reads covering the isoform pattern (Additional file 12 Fig. S9D). As a result, we found that three classes of the isoforms showed similar detection probability in the categories of “TPM” and “isoform-reads ratio,” suggesting that the isoform detection was somewhat independent of these categories. On the other hand, “isoform-reads” showed significant differences among the classes for the detection probabilities, suggesting the low level isoforms were, indeed, difficult to be detected at low sequencing depth. However, at the sequencing depth conducted in this study, each case seems to have reached to a plateau to some extent. Also note that, on average, 360,000 mRNA molecules are estimated to exist in a single lung cancer cell . Thus, one copy of mRNA per cell corresponds to 3 TPM. The minimum TPM of the gene for which at least one isoform was detected was 6 TPM in VMRC-LCD. These facts suggest that we were able to identify the isoforms to the very low expression level within a cell.
We also conducted a gene ontology analysis using genes with at least one isoform and found that RNA-binding proteins involved in the translation pathway were significantly enriched (Fig. 2f and Additional file 12: Fig. S10). This result is consistent with a previous study of clinical specimens of MDS . The study reported that aberrant splicing events in patients with mutations of SF3B1, U2AF1, and SRSF2 were enriched in mRNA metabolic processes, including translation . Expressions of some of the ribosomal protein genes and splicing-related genes are generally regulated via alternative splicing events and thus may be vulnerable to alternative splicing [42, 43]. In general, ribosomal protein genes were found to be shorter and to have higher expression levels than other genes . This feature may contribute to the tendency of isoform-enriched genes to have the shorter gene length and higher expression levels (Fig. 2c–e) that were observed in this study.
No significant association was found between the number of aberrant splicing isoforms and EGFR, KRAS, or NRAS driver mutations in the cell lines (Fig. 2g). Significantly, we found that the number of aberrant splicing isoforms showed poor correlation with the genomic TMB, which is a source of neoantigens and one of the known markers for ICI efficacy (r = 0.30, Fig. 2h). The aberrant splicing isoforms detected in this study also have the potential to be translated and presented as neoantigens; therefore, we expect further analysis of these splicing patterns to provide an independent measure to define cancer cells in addition to TMB.
Biological validation of the aberrant splicing isoforms
Next, we investigated the potential causes of aberrant splicing isoforms in cancer cells. As we were able to analyze the isoforms as a form of full-length transcript, we counted aberrant isoforms containing PTCs, which would have been targeted by NMD. We found that ~ 30% of the aberrant isoforms contained PTCs (Fig. 3a and Additional file 3: Table S3). Indeed, when we examined the case of VMRC-LCD, which showed the highest number of aberrant splicing isoforms, we found this cell line harbored a splice site mutation in UPF1, which is a key NMD factor (Additional file 12: Fig. S11A). To more directly validate the possible cause of the accumulation of aberrant isoforms, we conducted an siRNA knockdown experiment for UPF1 in A549 (Fig. 3b). We similarly analyzed the obtained full-length cDNA MinION reads in combination with the Illumina short-read RNA sequencing data. As expected, the proportion of NMD-targeted isoforms was significantly increased by the UPF1 knockdown (Fig. 3c). For example, an intron-retained isoform in the SURF2 gene was detected only in the UPF1 knockdown cells despite this isoform containing a PTC and being potentially targeted by NMD (Fig. 3d). To validate the expression of this isoform specifically in UPF1-depleted cells, we perform RT-PCR with primers flanking the isoform-specific junction (Additional file 12: Fig. S11C). The alternative 5′ splice site in exon 2 of SURF2 showed a two- to three-fold increase in response to the UPF1 knockdown experiment (Fig. 3e). This increase was also detected as size differences among PCR products (Additional file 12: Fig. S11D).
In the previous study, four p-UPF1-binding motifs (CCUGGGG, CCUGGGA, CCUGGAA, and CCUGAGA) were reported . We searched them in the 3′UTR of the isoforms that were detected only in the UPF1 knockdown in A549 cells. As a result, 34 of the 91 isoforms included the p-UPF1 motifs. We also evaluated the motif enrichment in UPF1-depleted A549 cells using the MEME Suite  and found that possible GC-rich motifs were enriched in the 3′UTR of isoforms. Further studies would be necessary to confirm whether they can be actually bound by p-UPF1 (Additional file 12: Fig. S12A).
SF3B1 is a well-known splicing factor that is mutated in several diseases and causes the increase of aberrant splicing isoforms [47, 48]. We assessed the effect of impairment of splicing by SF3B1 knockdown to investigate whether aberrations of splicing factors influence isoform generation (Additional file 12: Fig. S10B). We found a significant increase in the proportion of exon skipping, which was concordant with the previous report using an SF3B1 inhibitor  (Fig. 3f). For example, exons 3 and 6 of PSMD7 were altered in the knockdown condition (Fig. 3g). The expressions of exon-skipping isoforms of PSMD7 were significantly increased, and RefSeq types were decreased conversely in SF3B1-depleted A549 cells (Fig. 3h, Additional file 12: Fig. S10E and S10F). To confirm consensus sequence in the proximal regions of the splice sites, we collected exon-skipping isoforms detected only in the SF3B1 knockdown in A549 cells. For this analysis, we considered ± 10 bp regions surrounding regions of novel splicing junctions and skipped exons. Although the canonical splice consensus sequences of GT/AG were found in almost all the isoforms similar to the total RefSeq transcripts, no other distinct feature related to the SF3B1 knockdown event was detected (Additional file 12: Figs. S12C and D). In addition, the genes with exon-skipping isoforms showed significant enrichment in the translation and ubiquitin-proteasome pathways (Additional file 12: Fig. S13). Similar enrichment of ubiquitin-proteasome pathways has been previously reported in a SF3B1 knockdown experiment using myeloid cell lines . Interruption of these factors, therefore, may alter at least some of the aberrant splicing isoforms and may contribute to their accumulation in lung cancer cells.
K700E is one of the most common hotspot mutations located in the HEAT-repeat domain of SF3B1 [48, 50]. As shown in a previous study, the K700E hotspot mutation downregulates intron retention and upregulates alternative 3′ splice site events . Except for exon skipping, the most affected splicing was intron retention (Additional file 12: Fig. S12B). This result was as expected, since it is considered to be a gain-of-function variant  and our knockdown assay of SF3B1 should produce the opposite results. Alternative 3′ splice site events were not significantly affected, which was not always consistent with the previous results, suggesting other cellular contexts also play a role (Additional file 12: Fig. S12B).
Aberrant transcripts as potential templates for producing neoantigens
Accumulated aberrant splicing isoforms in tumors have been reported as possible sources of neoantigens [29,30,31]. To investigate whether the detected isoforms can be represented as neoantigens, we attempted to evaluate the potential antigenicity of the aberrant peptides encoded by these aberrant splicing isoforms. For this purpose, we estimated the binding affinity between peptides and HLA molecules using the standard method of NetMHC (Additional file 12: Fig. S14A). We conducted whole-genome sequencing (WGS) as well as sequence-based HLA typing for each cell line. Four-digit HLA types and the presence or absence of somatic mutations for each cell line was confidently determined (Additional file 12: Fig. S14B, Additional file 1: Table S1 and Additional file 4: Table S4). For the peptide side, based on the somatic mutations detected by the WGS, we deduced the altered peptide sequences of the isoforms by considering full-length transcript structures for all possible 9-mer peptides. Indeed, the aberrant splicing isoforms frequently and drastically altered the protein sequences by causing frameshift or early termination of translation (Additional file 12: Fig. S14C). Since these peptides were not represented in RefSeq or GENCODE databases, they may be peptides that have never been exposed to immune cells under normal conditions in healthy individuals. These neoantigens accounted for the greatest proportion of total potential neoantigens in most of the cell lines (Fig. 4a) Aberrant splicing isoforms and frameshift mutations contributed to the production of more of these novel peptides than missense or in-frame mutations (Fig. 4b). As expected, the number of neoantigens that were predicted as the “strong binder” by NetMHC was also greater both in splicing isoforms and frameshift mutations (Fig. 4c). In a comparison of the highest NetMHC scores of peptides from each isoform, the peptides that were derived from those aberrant isoforms showed higher score distribution than peptides of the missense and in-frame mutations that would be usually identified using the TMB detection approach (Fig. 4d).
To experimentally validate whether the aberrant isoforms were translated into proteins, we employed a shotgun proteomics approach using liquid chromatography coupled with tandem mass spectrometry (LC/MS/MS) for 11 NSCLC cell lines (A427, A549, H1650, H2228, II-18, PC-9, RERF-LC-Ad1, RERF-LC-Ad2, RERF-LC-KJ, RERF-LC-MS, and VMRC-LCD). For peptide identification using the Mascot search engine, we customized peptide sequence database based on MinION data for each cell line. As previously noted , short-read RNA sequencing showed a higher coverage of genes than LC/MS/MS proteomics due to its sequencing capacity (Fig. 4e and Additional file 12: Fig. S15). The number of peptides per gene detected by LC/MS/MS proteomics and TPM calculated by RNA sequencing data correlated (r = 0.52, Fig. 4e) and most of the genes detected by LC/MS/MS proteomics were also covered by RNA sequencing (Fig. 4f). We successfully detected 7 peptides translated from aberrant splicing isoform-specific regions (Table 1 and Additional file 12: Fig. S16). For example, the peptide derived from the isoform with alternative 5′ splicing in exon 3 of KRT7 existed in RERF-LC-Ad1 (Fig. 4g). This isoform was not found in the GENCODE database, but the isoform-specific junction was observed in ENST00000547613 (KRT7-204 in Additional file 12: Fig. S17) which was considered as processed transcript without an open reading frame. The expression was also confirmed in H1437, H2126, and II-18 by MinION (Table 1). Furthermore, this isoform had the potential to produce several neoantigens predicted by NetMHC (Additional file 5: Table S5). These results suggest that at least some aberrant splicing isoforms were truly translated into peptides and could play a role in producing neoantigens in cancer.
Aberrant splicing isoforms in lung cancer specimens
To examine whether aberrant splicing isoforms also exist in cancer cells in vivo, we next analyzed clinical lung cancer specimens (Additional file 6: Table S6). We applied the same analytical scheme used for the cell line analysis of clinical samples (Additional file 7: Table S7 and Additional file 8: S8). Using this method, we were again able to identify aberrant splicing isoforms in each patient (Fig. 5a). We selected the isoforms that were expressed at least twofold higher in the tumor samples than in the non-cancer counterparts (Fig. 5b) and identified 982 cancer-enriched splicing isoforms among all the clinical samples. Of these, 448 isoforms were not represented in either RefSeq or GENCODE (Additional file 12: Fig. S18A). There were no significant correlations between the number of aberrant isoforms and the TMB (Fig. 5c). As an example of the detected isoforms, novel alternative first exons in SMOC2 are represented in Fig. 5d and were expressed only in the tumor of case 3. Similar to the results from the cell line analysis, we could identify several unique combination patterns of independent splicing events, and these patterns accounted for 14.5% of these unannotated isoforms (Additional file 12: Fig. S18B). Potential NMD-targeted isoforms that remained in the cancer cells were also identified, which suggests that the NMD mechanism is disrupted in the corresponding cancers (Additional file 12: Fig. S18C). Notably, in cases 3 and 4, the largest number of potential NMD-targeted isoforms were identified to harbor either frameshift or nonsense mutations in the key NMD factors, UPF3B and SMG8 (Fig. 5a).
Similar to the method used for analysis of the cell line datasets, NetMHC analysis was conducted to identify peptides that could be potential neoantigens. For this, we identified somatic mutations and HLA types in each patient using the genomic sequencing data of the clinical samples (Additional file 12: Fig. S18D, Additional file 6: Table S6 and Additional file 9: Table S9). We detected 101–255 neoantigen candidates per case (Fig. 5e). We found that peptides derived from the splicing isoforms showed higher distribution scores compared to the peptides derived from missense mutations (Fig. 5f). Indeed, they accounted for the majority of total neoantigen candidate peptides in most samples (Fig. 5e). These results support the fact that aberrant splicing events in clinical samples can be detected by MinION and have great potential to produce more neoantigen candidate peptides than do missense mutations.
Evaluation of the potential neoantigens derived from aberrant splicing isoforms in clinical samples
To evaluate the antigenicity of peptides that were identified from the aberrant splicing isoforms and frameshift mutations, we immunized HLA-A24 transgenic mice with the candidate peptides according to the scheme shown in Fig. 6a. We chose 17 candidate peptides based on the NetMHC scores for HLA-A:24:02 (Table 2 and Additional file 12: Fig. S19). We confirmed that the peptide sequences showed no similarity with those of human or mouse protein databases by BLAST-P, but non-cancer-specific regions, which were similar to those of RefSeq transcripts, were represented in both human and mouse. One week after the last vaccination, we isolated splenocytes from mice and subjected the isolates to an enzyme-linked immune absorbent spot (ELISpot) assay. By doing so, we attempted to detect neoantigen-specific spleen lymphocyte responses. The ELISpot results showed that 8 out of 17 peptides induced significantly high IFN-γ production (n = 2) compared to the PBS and adjuvant-alone groups (Fig. 6b, c). These results demonstrate that peptides derived from splicing isoforms and frameshift mutations could activate the T cell response through interaction with HLAs.
Re-evaluation of the TCGA short-read sequencing data from the viewpoint of potential neoantigen candidates
Finally, to further utilize our aberrant isoform catalog based on MinION sequencing data, we re-evaluated the whole-exome sequencing and short-read RNA sequencing datasets of 436 lung adenocarcinoma (LUAD) and 46 matched normal samples registered in The Cancer Genome Atlas (TCGA). We mapped their RNA sequencing data to our isoform catalog. We then counted the reads which were aligned to the isoform-specific regions according to our catalog (Additional file 12: Fig. S20A, see detail in “Materials and methods”). To eliminate the isoforms in non-tumor lung tissues, we also constructed an isoform panel of normal samples. We selected isoforms that expressed at least two-fold higher in tumor samples compared with those in the panel of normal. In addition, we removed isoforms whose junctions were expressed in the lung tissues samples stored in the Genotype-Tissue Expression (GTEx) project database  and counted the number of isoforms for each specimen (see “Materials and methods” for additional details on the procedure). Among the novel splice junctions, which were not represented in either RefSeq or GENCODE, only 12.5% were represented in the lung specimens of GTEx, and 52% were novel junctions even in all tissues taken together in GTEx (Additional file 12: Fig. S20B). As a result, we found 13 specimens harboring damaging mutations on the NMD factors of frameshift, nonsense, and splice site mutations (Fig. 7a, upper right panel), and these cases showed significantly higher distributions of the number of isoform distributions compared to the others (Fig. 7a, upper left panel). We further expanded this analytical scheme to other cancer types, such as lung squamous cell carcinoma (LUSC), colon adenocarcinoma (COAD), and pancreatic adenocarcinoma (PAAD). We determined that some of the isoforms were commonly expressed (Fig. 7b–d). The number of isoforms did not correlate with the TMB or cancer stage in any types of cancer (Additional file 12: Figs. S20C and D). We also found that the damaging profile of NMD factors was not always consistent with the number of isoforms, depending on cancer subtypes (Fig. 7b–d). NMD efficiency differed depending on the PTC position in isoforms  and tissues . Moreover, the impact of NMD in individual cancer types has been partly indicated by recent studies [56, 57]. By expanding our analysis scheme to a pan-cancer approach, these aberrant splicing isoforms would facilitate a new field to search neoantigen targets and provide insights into NMD in a variety of cancer types.
For more detailed characterizations of each sample, we applied a gene set variation analysis (GSVA)  and calculated the correlation between the number of isoforms and the GSVA enrichment scores. Notably, we found positive correlations between the number of isoforms and the splicing-related signature scores in all cancer types (Fig. 7a–d, lower panel). These results suggest that the upregulation of splicing-related genes may contribute to the increase in the number of aberrant isoforms. Otherwise, there was no significant relationship between the number of isoforms and the specific immune cell signatures for each cancer type (Additional file 12: Figs. S20E) [59, 60]. Since the clinical specimens we analyzed did not have any immune status information, such as the proportion of tumor-infiltrating lymphocytes or the ICI response, further study is necessary to conclude the effects of these aberrant isoforms on immune system response to tumors.
In this study, we examined the full-length transcript structures of aberrant splicing isoforms by using long-read sequencing data. We constructed a catalog of such isoforms for the NSCLCs. To address the inaccurate sequencing of MinION, we used short-read sequencing data to identify the exact splicing junctions. This approach revealed a substantial number of novel aberrant transcripts and some of these isoforms were detected as peptides by the proteome analysis. Furthermore, an ELISpot assay demonstrated that at least some of these isoforms, having strong antigenicity, had the potential of becoming neoantigens. These results clearly show the capability of long-read sequencing to search for novel isoforms and neoantigens that may be overlooked by the current short-read sequencing approaches.
In the clinical setting, the TMB, which is determined by the total number of nonsynonymous mutations, is considered to be a predictor of the effectiveness of immunotherapy for cancer. However, the prediction accuracy is not sufficient for many types of cancer. The current study shows that aberrant splicing isoform and frameshift mutations have a significant potential for producing a larger number of neoantigens. Multiple previous reports suggest that this potential may have an influence on shaping the tumor immune landscape of the patients in addition to the genomic TMB [28,29,30]. When transcriptomes cannot be directly examined in a given clinical setting, the NMD status and other splicing factors may provide important information. Indeed, previous reports have indicated that the number of frameshift mutations that are predicted not to trigger NMD is higher in responders to immunotherapy and that factor combined with the TMB improves the accuracy of prediction for immunotherapy responders .
We mainly focused on NSCLC in this study. However, we were able to identify potential aberrant isoforms for other types of cancer by considering our full-length catalog. Obviously, when as more cancer types are added to the catalog, the sensitivity and selectivity of the detection will further increase. Also note that, since GTEx is based on short-read RNA sequencing datasets, the representation of the junctions in GTEx do not necessarily indicate the existence of the isoform in their full-length forms in normal tissues. To calculate and extract truly cancer-specific isoform patterns, we should generate a larger dataset of long-read isoforms for normal tissues in future study. Further in-depth analyses would elucidate the possibility that these transcriptomic features provide a complementary indicator for predicting the effectiveness of immunotherapy, in addition to genomic features.
n this study, we indicated that the long-read sequencing of full-length cDNAs in tumors is essential to precisely identify aberrant transcript structures that have been overlooked by short-read sequencing. The aberrant splicing isoforms showed a significant potential for producing a larger number of neoantigens in tumors. These novel transcriptomic features obtained from the long-read sequencing would be helpful for estimating the tumor immune landscapes, which could potentially improve the accuracy of prediction of responders to immunotherapy when used in combination with the current indicators that only use genomic mutations.
Material and methods
Twenty two lung adenocarcinoma cell lines (A427, A549, ABC-1, H1299, H1437, H1648, H1650, H1819, H2126, H2228, H2347, H322, II-18, PC-14, PC-3, PC-9, RERF-LC-Ad1, RERF-LC-Ad2, RERF-LC-MS, RERF-LC-OK, RERF-LC-KJ, and VMRC-LCD, Additional file 1: Table S1) were cultured as previously described . Raw sequence data were obtained from the DNA Data Bank of Japan (DDBJ) with accession numbers DRA001859 (whole-genome sequencing), DRA001846 (short-read RNA sequencing) , and DRA008295 (long-read RNA sequencing of RERF-LC-KJ, RERF-LC-MS, and PC-9, 62].
Clinical samples were obtained with the appropriate informed consent from the National Cancer Center Japan. Surgical specimens from seven patients were pathologically checked. All seven patients were previously diagnosed with primary lung cancer, including four cases of lung squamous cell carcinomas and three cases of lung adenocarcinomas. Fresh frozen surgical specimens were used to extract DNA and RNA as described below.
Full-length RNA sequencing using MinION
Total RNA was isolated and purified using an RNeasy Mini Kit (Qiagen). Library preparation for full-length transcriptome analysis using MinION (Oxford Nanopore Technologies) was performed as previously described . In brief, 1.5 μg of the full-length-cDNA library was applied for 1D2 sequencing according to the manufacturer’s protocol with some modifications using the 1D2 sequencing kit, SQK-LSK308, and R9.5 flow cell, FLO-MIN107 (Oxford Nanopore Technologies).
Conventional short-read RNA sequencing
For A549 used for knockdown assays, total RNA was isolated and purified using an RNeasy Mini Kit (Qiagen). Libraries were prepared using a TruSeq Stranded mRNA kit. Then RNA sequencing was conducted using the NovaSeq 6000 platform (Illumina) with 150 bp paired-end reads according to the manufacturer’s protocol.
For clinical samples, total RNA was extracted by an RNeasy Mini Kit and re-purified with an RNeasy MinElute Cleanup Kit (Qiagen) for some cases. Library preparation was performed using a TruSeq Stranded mRNA kit. Sequencing was performed using the HiSeq 2500 platform (Illumina) with 75 bp paired-end reads.
Conventional short-read DNA sequencing
Genomic DNA was extracted using a DNeasy Blood & Tissue kit (Qiagen). In total, 150 ng aliquots of genomic DNA from paired cancerous and non-cancerous samples were fragmented by a Covaris-E220 evolution instrument (Covaris) to provide DNA fragments with a base pair peak at 150 to 200 bp. The DNA fragments were end-repaired and ligated with paired-end adaptors (SureSelect XT Library Prep Kit, Agilent Technologies). The resulting DNA library was purified using an Agencourt AMPure XP Reagent (Beckman Coulter) and amplified by PCR (9 cycles). In total, 750 ng aliquots of the adaptor-ligated libraries were hybridized for 16–24 h at 65 °C with biotinylated oligo RNA bait, SureSelect Human All Exon V5 (Agilent Technologies). The hybridized genomic DNA was subjected to ten cycles of PCR re-amplification. Following the manufacturer’s standard protocols, the whole-exome DNA library was sequenced on the Illumina HiSeq 2500 platform (Illumina) with 75 bp paired-end reads.
Analysis of long-read sequencing data
Base calling of the FAST5 data from MinION was performed using Albacore 2.2.7 and Guppy v3.3.0, then was converted into FASTQ files. Only 1D reads for the following analyses . Reads were aligned to the reference human genome, GRCh38, using Minimap2 (v2.2.14) . To eliminate pseudogene mapping and low-quality reads, aligned reads that met any of the following five conditions were discarded: (1) secondary or supplementary aligned flag; (2) mapping identity, defined as the percentage of matched bases to the sum of matched bases, substitutions, insertions, and deletions, lower than 0.8; (3) unmapped length of reads within splice junctions longer than 10 bp; (4) exon length shorter than 25 bp; (5) overlapping the pseudogene region of GENCODE v27. After these filtering steps, introns shorter than 50 bp were corrected as exons and extracted reads were assigned to a single gene for further analysis. Seqtk (https://github.com/lh3/seqtk) was used for subsampling reads in the fastq file.
Analysis of short-read RNA sequencing
Raw reads were trimmed with quality < 20 and adaptor filtered using Trimmomatic (v0.32) . rRNA sequences were removed by Bowtie2 (v 22.214.171.124) . Cleaned up reads were aligned to the reference human genome, GRCh38, and transcripts per million (TPM) was calculated using STAR (v2.6.1d)  and RSEM (v1.3.1) .
Analysis of short-read DNA sequencing
Raw reads with a quality < 20 were trimmed and adaptor filtered using Trimmomatic (v0.32) then were aligned to reference human genome GRCh38 by BWA-mem (v0.7.17) . We preprocessed the BAM files including marking duplicates and base recalibration steps using GATK (v126.96.36.199) . For variant calling, we used HaplotypeCaller (GATK) for lung cancer cell line samples and filtered low confidence SNPs using the recommended hard filter settings: QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum − 12.5, ReadPosRankSum < − 8.0. The following filters were also applied to remove low confidence indels: QD < 2.0, FS > 200.0, ReadPosRankSum < − 20.0. Mutect2 and FilterMutectCalls were used for clinical samples with default parameters and tumor-normal mode. We filtered out variants with allele depth < 5 and annotated them with Variant Effect Predictor (VEP, v95) . To extract somatic variants from cell line sample data, we eliminated variants that were registered in dbSNP (v151) unless they were also present at least 5 samples in COSMIC (v90) from the dataset.
Computational analysis for detection of transcript isoforms from MinION data
We compared all junctions in MinION reads with the RefSeq transcripts (downloaded from the UCSC Genome Browser and Blat software (University of California Santa Cruz) in July 2017, allowing for a margin of 20 bp gaps. Then, we removed reads that were identical to RefSeq model transcripts (Type A) or were truncations of RefSeq model transcripts (Type B).
Type A satisfied three conditions: (1) It ignored the difference between the five-prime and the three-prime end; (2) it had the same number of exons; (3) the junctions from the read were the same as those of the RefSeq model transcripts.
Type B included the remaining reads that satisfied the following three conditions: (1) The putative truncated end of the read was located within the exon; (2) the number of exons was different; (3) the junctions from the read were the same as those of the RefSeq model transcripts.
Considering the inaccuracy of MinION reads, we compared the remaining reads to the short-read sequencing junction sets detected in at least five reads using STAR. If a junction in the read was not identical to that of an RefSeq transcript and was confirmed with short-read sequencing, we classified the read as an isoform. We compared all junctions in the MinION reads with those of the RefSeq transcripts.
After merging isoforms that contained the same junctions, the isoforms were filtered using the following condition: MinION read coverage ≥ total MinION reads/100,000. Then, we create a GTF file of the isoforms and RefSeq transcripts for each sample. Using the GTF file as a reference, we mapped the reads from the short-read sequencing and calculated the TPM. To remove low confidence isoforms, the following filters were applied: TPM ≥ 10, isoform read counts/total read counts assigned to the same gene > 10%. Finally, we classified isoforms by the following nine types: alternative 5′ splice site, alternative 3′ splice site, alternative first exon, alternative last exon, intron retention, exon shuffling, intron retention, unannotated exon, and the corresponding combination patterns (Fig. 1a).
Detection of transcript isoforms using TALON
After aligning the reads, TranscriptClean (v2.0.2)  was performed for error correction of the SAM file using the junction file that was obtained from the short-read RNA sequencing by STAR. TALON (v5.0.0)  was executed using the parameter setting --cov 0.5. We counted isoform patterns that harbored the same splice junctions ignoring strand.
siRNA knockdown in A549
Reverse siRNA transfections were performed using Lipofectamine RNAiMAX (Thermo Fisher Scientific) and siRNA sets (Additional file 10: Table S10). siRNA for UPF1 was kindly provided by Dr. R. Onoguchi-Mizutani and Dr. N. Akimitsu from The University of Tokyo . We applied 75 pmol of siRNA for UPF1 knockdown and mixed two different siRNAs (12.5 pmol each) for SF3B1. Lipofectamine RNAiMAX (7.5 μl) and each siRNA were diluted in 400 μl of OptiMEM, then incubated for 20 min at room temperature. We seeded 1.2 × 105 cells of A549 in antibiotic-free DMEM into each well of a 6-well plate. After incubation, we added 400 μl of the RNAiMAX and siRNA mixture to each well. After 24 h, we replenished the medium and re-added the RNAiMAX and siRNA mixture. Cells were collected 48 h after the last transfection.
RT-PCR assay and bioanalyzer
Total RNA was isolated using an RNeasy Mini Kit (Qiagen), and cDNA synthesis was performed using a SMART-Seq v4 Ultra Low Input RNA Kit for Sequencing (Takara Bio) as previously described . RT-PCR was performed using Power SYBR Green Master Mix (Thermo Fisher Scientific) and the LightCycler 96 system (Roche). The data were normalized to the expression of GAPDH or common exonic regions of each gene using the 2−ΔΔCt method. We quantified and compared each PCR product size using a bioanalyzer (Agilent Technologies). Primers were designed using Primer3 Plus  and are listed in Additional file 11: Table S11. The RT-PCR experiments were conducted in triplicate.
Identification of repetitive elements in splice sites
RepeatMasker (v4.1.0)  was performed to detect repetitive elements with the “-species human” option. We extracted ±50 bp regions around the splice sites and counted sequences with > 80% to repetitive elements.
Motif enrichment analysis
To find splice site consensus motifs, we collected exon-skipping isoforms in SF3B1- depleted A549 cells. We extracted ± 50 bp regions around the splice sites or skipping exons and generated sequence logos using WebLogo (v2.8.2) . Motif enrichment analysis was performed by using MEME Suite (v5.1.1)  with the maximum width of motifs set to ten. The sequences of 3′UTR in aberrant splicing isoforms in UPF1-depleted A549 cells were extracted as the input for the analysis. 3′UTR sequences of all RefSeq transcripts were used to generate the background model based on the hidden Markov model.
Computational analysis for the detection of antigenic peptides
All possible 9-mer peptides were computed considering the variants and isoform patterns. We filtered out peptides listed in GENCODE v31 or RefSeq to extract cancer-specific aberrant peptides. HLA class I loci (HLA-A, B, and C) were typed at four-digit resolution using OptiType (v1.3.1) . Binding affinities between peptides and HLA alleles were predicted using NetMHCpan4.0 , and strong binders were defined as having %rank < 0.5.
Materials for LC/MS/MS proteomics
Ammonium bicarbonate, sodium deoxycholate (SDC), sodium N-lauroylsarcosinate (SLS), tris (hydroxymethyl) aminomethane (Tris), dithiothreitol (DTT), iodoacetamide (IAA), lysyl endopeptidase (Lys-C), ethyl acetate, acetonitrile (ACN), acetic acid, and trifluoroacetic acid (TFA) were obtained from FUJIFILM Wako. Protease inhibitor cocktail and phosphatase inhibitor cocktail 2 and 3 were obtained from Sigma-Aldrich. Trypsin was obtained from Promega. Empore disks for StageTips were obtained from GL Sciences.
Sample preparation for LC/MS/MS proteomics
Protein digestion was performed according to the phase transfer surfactant (PTS) protocol . Briefly, cell lysates were lysed in a PTS buffer (12 mM SDC, 12 mM SLS, 1% protease inhibitor, and 1% phosphatase inhibitors 2 and 3 in 100 mM Tris-HCl, pH 9.0). Proteins were reduced with 10 mM DTT and alkylated with 50 mM IAA. After five-fold dilution with 50 mM ammonium bicarbonate, Lys-C and trypsin were added at a 1:100 (w/w) protease-to-protein ratio, followed by incubation overnight at 37 °C. Then an equal volume of ethyl acetate was added, and the solution was acidified with TFA. After removing the organic phase, the samples were dried by SpeedVac and reconstituted in 5% ACN and 0.1% TFA. The peptides were fractionated into eight fractions with SCX-StageTips and desalted using SDB-XC StageTips [76, 77].
The peptides were analyzed by Orbitrap Fusion Lumos mass spectrometry (Thermo Fisher Scientific) coupled with an UltiMate 3000 RSLCnano (Thermo Fisher Scientific) pump and an HTC-PAL autosampler (CTC Analytics). Self-pulled needle columns (150 mm length, 100 μm ID, 6 μm needle opening) packed with ReproSil-Pur C18-AQ (3 μm, Dr. Maisch) were used as analytical columns. The injection volume was 5 μl and the flow rate was 500 nl/min. The mobile phases consisted of 0.5% acetic acid (A) and 0.5% acetic acid in 80% ACN (B). The gradient program was as follows: 5–40% B (65 min), 40–99% B (5 min), 99% B (10 min), 99–5% B (0.1 min), and 5% B (29.9 min). Both MS1 and MS2 scans were obtained with Orbitrap. The MS1 survey scan was performed at 120,000 resolution in the scan range of m/z 300–1500 with an AGC target value of 4e5. The MS2 scan was performed at a resolution of 15,000 with an AGC target value of 5e4, and the scan cycle was 3 s. Dynamic exclusion was set to 30 s, and the normalized collision energy for HCD was 38%.
Identification and quantification of peptides and proteins
Peak lists were generated from the raw MS/MS spectra using MaxQuant ver.188.8.131.52 . Then the resulting mgf files were searched against our peptide sequence database based on MinION data using the Mascot search engine (Matrix Science) with a precursor mass tolerance of 5 ppm and a fragment ion mass tolerance of 20 ppm. Carbamidomethylation on cysteine was set as fixed modifications. Oxidation on methionine was set as a variable modification. Up to one missed cleavage was allowed for trypsin and Lys-C digestion. The results were filtered at a 1% peptide-level false discovery rate (FDR).
HLA-A24 transgenic (A24Tg) mice were kindly provided by Institute Pasteur  (Paris, France), and bred in Sankyo-Lab Service. Young adult (7- to 20-week-old) mice were maintained under specific pathogen-free conditions in our animal facility and were used for all experiments.
Peptide vaccination and ELISpot assays
A24Tg mice were immunized subcutaneously three times (at days 0, 7, and 14) with 50 μg peptides pooled from 5 predicted peptides, and with 8 μg poly-ICLC (Hiltonol, ONCOVIR) used as the adjuvant. One week after the last vaccination, the A24Tg mice were euthanized and splenocytes were harvested. To detect peptide-specific immune response, IFN-γ ELISpot assays were performed using a BD ELISPOT kit for Mouse IFN-γ (BD Bioscience) according to the manufacturer’s protocols. 2 × 106 splenocytes were incubated with 10 μg of each peptide for 20 h at 37 °C and 5% CO2. A peptide (QYDPVAALF) derived from cytomegalovirus pp65 was used as a positive control. All analyses were performed in triplicate for two mice. The spots were automatically counted by the Eliphoto system (Minerva Tech). The immune response was considered to be positive when the P value for replicates was < 0.05 and the ratio of spot counts for wells vs. the negative controls was > 5.
Computational analysis of GTEx data
The exon-exon junction read count matrix was downloaded from the GTEx Portal (https://gtexportal.org/home/) . We extracted junction sets detected from at least 20 reads for each sample, then collected junctions that were expressed as “normal junctions” in each tissue in more than 50% of specimens.
Computational analysis of TCGA data
BAM files for short-read RNA sequencing mapped by STAR and VCF files for somatic mutations detected by the MuTect2 algorithm in the TCGA were downloaded from the National Cancer Institute Genomic Data Commons data portal (https://portal.gdc.cancer.gov/) .
We built genome index files for STAR using the reference human genome and our constructed isoform catalog. BAM files were converted into FASTQ files and remapped using the index. We extracted isoforms covered by at least 20 reads that were mapped to isoform-specific exons or junctions. The panel of normal samples was constructed using the max RPM of each isoform expressed in at least five matched normal samples. Finally, we identified tumor-specific isoforms that expressed at least twofold higher than in the panel of normal samples and not represented in normal junction datasets of GTEx.
Statistical analysis was performed using Python3.6.5 and R3.5.0 software. For a nonparametric multiple comparison of the distributions, we used the nonparametric Kruskal–Wallis test followed by Dunn’s post hoc test adjusted with Bonferroni correction. Fisher’s exact tests were performed for gene ontology enrichment analysis and to compare the proportion of isoforms in knockdown experiments. Welch’s t-test was used to compare the number of isoforms with driver mutation. Student’s t test was used for ELISpot assay and the two-tailed Mann–Whitney U test was used for the TCGA analysis.
Gene ontology analysis
The Cytoscape Bingo plugin was used with gene ontology annotations . To remove redundancies, we took account of gene ontology terms with > 50 and < 1000 genes. The P value was adjusted by Benjamini and Hochberg FDR correction. The result of the enrichment analysis was visualized using a REVIGO treemap .
Gene set variation analysis
For single-sample gene set enrichment, we used the gene set variation analysis  (GSVA) program to derive the absolute enrichment scores for gene signatures using the following datasets: (1) the C5 gene ontology biological process subset of the Molecular Signature Database version 7.2 , and (2) gene sets representing immune cell populations reported in several previous publications [59, 60]. We applied log2TPM values calculated using RSEM for GSVA with the options “min.size = 10” and “max.size = 1000.” Spearman’s rank correlation coefficient was calculated for the number of isoforms and GSVA scores.
Availability of data and materials
The sequencing data of cell lines have been published in the DNA Data Bank of Japan (DDBJ)  under accession numbers DRA010214  and DRA010215 . The sequencing data for clinical samples have been deposited to the Japanese Genotype-phenotype Archive (JGA) , which is hosted by the National Bioscience Database Center (NBDC) and DDBJ, under accession number JGAS00000000245 . The proteomics data have been deposited to the ProteomeXchange Consortium  via the jPOST partner repository [91, 92] with the dataset identifier PXD019915 .
Suzuki A, Makinoshima H, Wakaguri H, Esumi H, Sugano S, Kohno T, et al. Aberrant transcriptional regulations in cancers: genome, transcriptome and epigenome analysis of lung adenocarcinoma cell lines. Nucleic Acids Res. 2014;42:13557–72.
Nicholson P, Yepiskoposyan H, Metze S, Orozco RZ, Nicole Kleinschmidt OM. Nonsense-mediated mRNA decay in human cells. Cell Mol Life Sci. 2010;67:677–700.
Lu JW, Plank TD, Su F, Shi XJ, Liu C, Ji Y, et al. The nonsense-mediated RNA decay pathway is disrupted in inflammatory myofibroblastic tumors. J Clin Invest. 2016;126:3058–62.
Liu C, Karam R, Zhou Y, Su F, Ji Y, Li G, et al. The UPF1 RNA surveillance gene is commonly mutated in pancreatic adenosquamous carcinoma. Nat Med. 2014;20:596–8.
Popp, Maximilian W. LEM. Nonsense-mediated mRNA decay and cancer. Curr Opin Genet Dev. 2018;48:44–50.
Cowen LE, Tang Y. Identification of nonsense-mediated mRNA decay pathway as a critical regulator of p53 isoform β. Sci Rep. 2017;7:1–9.
Karam R, Carvalho J, Bruno I, Graziadio C, Senz J, Huntsman D, et al. The NMD mRNA surveillance pathway downregulates aberrant E-cadherin transcripts in gastric cancer cells and in CDH1 mutation carriers. Oncogene. 2008;27:4255–60.
Pastor F, Kolonias D, Giangrande PH, Gilboa E. Induction of tumour immunity by targeted inhibition of nonsense-mediated mRNA decay. Nature. 2010;465:227–30.
Bokhari A, Jonchere V, Lagrange A, Bertrand R, Svrcek M, Marisa L, et al. Targeting nonsense-mediated mRNA decay in colorectal cancers with microsatellite instability. Oncogenesis. 2018;7:1–9.
Seiler M, Peng S, Agrawal AA, Palacino J, Teng T, Zhu P, et al. Somatic mutational landscape of splicing factor genes and their functional consequences across 33 cancer types. Cell Rep. 2018;23:282–96. e4.
Graubert TA, Shen D, Ding L, Okeyo-Owuor T, Lunn CL, Shao J, et al. Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes. Nat Genet. 2012;44:53–7.
Yoshida K, Sanada M, Shiraishi Y, Nowak D, Nagata Y, Yamamoto R, et al. Frequent pathway mutations of splicing machinery in myelodysplasia. Nature. 2011;478:64–9.
The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511:543–50.
Furney SJ, Pedersen M, Gentien D, Dumont AG, Rapinat A, Desjardins L, et al. SF3B1 mutations are associated with alternative splicing in uveal melanoma. Cancer Discov. 2013;3:1122–9.
Hsu TYT, Simon LM, Neill NJ, Marcotte R, Sayad A, Bland CS, et al. The spliceosome is a therapeutic vulnerability in MYC-driven cancer. Nature. 2015;525:384–8.
El Marabti E, Younis I. The cancer spliceome: reprograming of alternative splicing in cancer. Front Mol Biosci. 2018;5:1–11.
Lee SCW, Abdel-Wahab O. Therapeutic targeting of splicing in cancer. Nat Med. 2016;22:976–86.
Seiler M, Yoshimi A, Darman R, Chan B, Keaney G, Thomas M, et al. H3B-8800, an orally available small-molecule splicing modulator, induces lethality in spliceosome-mutant cancers. Nat Med. 2018;24:497–504.
Yarchoan M, Hopkins A, Jaffee EM. Tumor mutational burden and response rate to PD-1 inhibition. N Engl J Med. 2017;377:2500–1.
Büttner R, Longshore JW, López-Ríos F, Merkelbach-Bruse S, Normanno N, Rouleau E, et al. Implementing TMB measurement in clinical practice: considerations on assay requirements. ESMO Open. 2019;4:1–12.
Chan TA, Yarchoan M, Jaffee E, Swanton C, Quezada SA, Stenzinger A, et al. Development of tumor mutation burden as an immunotherapy biomarker: utility for the oncology clinic. Ann Oncol. 2019;30:44–56.
Matsushita H, Vesely MD, Koboldt DC, Rickert CG, Uppaluri R, Magrini VJ, et al. Cancer exome analysis reveals a T-cell-dependent mechanism of cancer immunoediting. Nature. 2012;482:400–4.
Cohen CJ, Rosenberg SA, Robbins PF, Cohen CJ, Gartner JJ, Horovitz-fried M, et al. Isolation of neoantigen-specific T cells from tumor and peripheral lymphocytes. J Clin Invest. 2015;125:3981–91.
Van Allen EM, Miao D, Schilling B, Shukla SA, Blank C, Zimmer L, et al. Genomic correlates of response to CTLA-4 blockade in metastatic melanoma. Science. 2015;350:207–12.
McDermott DF, Huseni MA, Atkins MB, Motzer RJ, Rini BI, Escudier B, et al. Clinical activity and molecular correlates of response to atezolizumab alone or in combination with bevacizumab versus sunitinib in renal cell carcinoma. Nat Med. 2018;24:749–57.
Miao D, Margolis CA, Gao W, Voss MH, Li W, Martini DJ, et al. Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science. 2018;359:801–6.
M.D. Hellmann, L. Paz-Ares, R. Bernabe Caro, B. Zurawski, Kim SW, Carcereny Costa, K, Park, A, Alexandru, L. Lupinacci, E. de la Mora Jimenez, H. Sakai, I. Albert, A. Vergnenegre, S. Peters, K. Syrigos, F. Barlesi, M. Reck, H. Borghaei, J.R. Brahmer, K.J. O’Byrne, W.J. Geese, P. Bhagavatheeswaran, S.K. Rabindran, R. S and SSR. Nivolumab plus ipilimumab in advanced non–small-cell lung cancer. N Engl J Med. 2019;381:2020–2031.
Turajlic S, Litchfield K, Xu H, Rosenthal R, McGranahan N, Reading JL, et al. Insertion-and-deletion-derived tumour-specific neoantigens and the immunogenic phenotype: a pan-cancer analysis. Lancet Oncol. 2017;18:1009–21.
Smart AC, Margolis CA, Pimentel H, He MX, Miao D, Adeegbe D, et al. Intron retention is a source of neoepitopes in cancer. Nat Biotechnol. 2018;36:1056–8.
Kahles A, Lehmann K Van, Toussaint NC, Hüser M, Stark SG, Sachsenberg T, et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell; 2018;34:211–224.e6.
Shen L, Zhang J, Lee H, Batista MT, Johnston SA. RNA transcription and splicing errors as a source of cancer frameshift neoantigens for vaccines. Sci Rep. 2019;9:1–13.
Seki M, Katsumata E, Suzuki A, Sereewattanawoot S, Sakamoto Y, Mizushima-Sugano J, et al. Evaluation and application of RNA-Seq by MinION. DNA Res. 2018;26:55–65.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
McGlincy NJ, Tan LY, Paul N, Zavolan M, Lilley KS, Smith CWJ. Expression proteomics of UPF1 knockdown in HeLa cells reveals autoregulation of hnRNP A2/B1 mediated by alternative splicing resulting in nonsense-mediated mRNA decay. BMC Genomics. 2010;11:565.
Wyman D, Balderrama-gutierrez G, Reese F, Jiang S, Rahmanian S, Zeng W, et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv. 2020;672931.
Wyman D, Mortazavi A. TranscriptClean : variant-aware correction of indels , mismatches and splice junctions in long-read transcripts. Bioinformatics. 2019;35:340–2.
Mano H. Non-solid oncogenes in solid tumors: EML4–ALK fusion genes in lung cancer. Cancer Sci. 2008;99:2349–55.
Yang W, Lee KW, Srivastava RM, Kuo F, Krishna C, Chowell D, et al. Immunogenic neoantigens derived from gene fusions stimulate T cell responses. Nat Med. 2019;25:767–75.
Suzuki A, Suzuki M, Mizushima-Sugano J, Frith MC, Makałowski W, Kohno T, et al. Sequencing and phasing cancer mutations in lung cancers using a long-read portable sequencer. DNA Res. 2017;24:585–96.
Cui Y, Irudayaraj J. Inside single cells: quantitative analysis with advanced optics and nanomaterials. Wiley Interdiscip Rev Nanomed Nanobiotechnol. 2015;7:387–407.
Pellagatti A, Armstrong RN, Steeples V, Sharma E, Repapi E, Singh S, et al. Impact of spliceosome mutations on RNA splicing in myelodysplasia: Dysregulated genes/pathways and clinical associations. Blood. 2018;132:1225–40.
Cuccurese M, Russo G, Russo A, Pietropaolo C. Alternative splicing and nonsense-mediated mRNA decay regulate mammalian ribosomal gene expression. Nucleic Acids Res. 2005;33:5965–77.
Ni JZ, Grate L, Donohue JP, Preston C, Nobida N, O’Brien G, et al. Ultraconserved elements are associated with homeostatic control of splicing regulators by alternative splicing and nonsense-mediated decay. Genes Dev. 2007;21:708–18.
Yoshihama M, Uechi T, Asakawa S, Kawasaki K, Kato S, Higa S, et al. The human ribosomal protein genes: sequencing and comparative analysis of 73 genes. Genome Res. 2002;12:379–90.
Imamachi N, Salam KA, Suzuki Y, Akimitsu N. A GC-rich sequence feature in the 3′ UTR directs UPF1-dependent mRNA decay in mammalian cells. Genome Res. 2017;27:407–18.
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME suite: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:202–8.
Wu G, Fan L, Edmonson MN, Shaw T, Boggs K, Easton J, et al. Inhibition of SF3B1 by molecules targeting the spliceosome results in massive aberrant exon skipping. RNA. 2018;24:1056–66.
Darman RB, Seiler M, Agrawal AA, Lim KH, Peng S, Aird D, et al. Cancer-associated SF3B1 hotspot mutations induce cryptic 3′ splice site selection through use of a different branch point. Cell Rep. 2015;13:1033–45.
Dolatshad H, Pellagatti A, Fernandez-Mercado M, Yip BH, Malcovati L, Attwood M, et al. Disruption of SF3B1 results in deregulated expression and splicing of key genes and pathways in myelodysplastic syndrome hematopoietic stem and progenitor cells. Leukemia. 2015;29:1092–103.
Papaemmanuil E, Cazzola M, Boultwood J, Malcovati L, Vyas P, Bowen D, et al. Somatic SF3B1 mutation in myelodysplasia with ring sideroblasts. N Engl J Med. 2011;365:1384–95.
Tang AD, Soulette CM, van Baren MJ, Hart K, Hrabeta-Robinson E, Wu CJ, et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat Commun. 2020;11:1–12.
Wang D, Eraslan B, Wieland T, Hallström B, Hopf T, Zolg DP, et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol Syst Biol. 2019;15:1–16.
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45:580–5.
Lindeboom RGH, Supek F, Lehner B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat Genet. 2016;48:1112–8.
Zetoune AB, Fontanière S, Magnin D, Anczuków O, Buisson M, Zhang CX, et al. Comparison of nonsense-mediated mRNA decay efficiency in various murine tissues. BMC Genet. 2008;9:1–11.
Zhao B, Pritchard JR. Evolution of the nonsense-mediated decay pathway is associated with decreased cytolytic immune infiltration. PLoS Comput Biol. 2019;15:1–17.
Wu CC, Beird HC, Andrew Livingston J, Advani S, Mitra A, Cao S, et al. Immuno-genomic landscape of osteosarcoma. Nat Commun. 2020;11:1008.
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14;1–15. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7.
Bindea G, Mlecnik B, Tosolini M, Kirilovsky A, Waldner M, Obenauf AC, et al. Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer. Immunity. 2013;39:782–95.
Sabarinathan R, Piulats JM, Muntasell A. A pan-cancer landscape of interactions between solid tumors and infiltrating immune cell populations. Clin Cancer Res. 2018;24:3717–29.
Lindeboom RGH, Vermeulen M, Lehner B, Supek F. The impact of nonsense-mediated mRNA decay on genetic disease, gene editing and cancer immunotherapy. Nat Genet. 2019;51:1645–51.
Sakamoto Y, Xu L, Seki M, Yokoyama TT, Kasahara M, Kashima Y, et al. Long-read sequencing for non-small-cell lung cancer genomes. Genome Res. 2020;30:1243–57.
Bolger AM, Lohse M, Usadel B. Genome analysis Trimmomatic : a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–60.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics. 2011;12:41–74.
Li H, Durbin R. Fast and accurate long-read alignment with Burrows – Wheeler transform. Bioinformatics. 2010;26:589–95.
Mckenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit : a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.
Mclaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:1–14.
Untergasser A, Nijveen H, Rao X, Bisseling T, Geurts R, Leunissen JAM. Primer3Plus, an enhanced web interface to Primer3. Nucleic Acids Res. 2007;35:W71–4.
Smit, AFA, Hubley, R, Green P. RepeatMasker Open-4.0. Available from: http://www.repeatmasker.org. Accessed 6 Oct 2020.
Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–90.
Schubert B, Mohr C, Sturm M, Feldhahn M, Kohlbacher O. Sequence analysis OptiType : precision HLA typing from next-generation sequencing data. Bioinformatics. 2014;30:3310–6.
Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J Immunol. 2017;199:3360–8.
Masuda T, Tomita M, Ishihama Y. Phase transfer surfactant-aided trypsin digestion for membrane proteome analysis. J Proteome Res. 2008;7;731–40. https://pubs.acs.org/doi/abs/10.1021/pr700658q.
Rappsilber J, Mann M, Ishihama Y. Protocol for micro-purification, enrichment, pre-fractionation and storage of peptides for proteomics using StageTips. Nat Protoc. 2007;2:1896–906.
Rappsilber J, Ishihama Y, Mann M. Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. Anal Chem. 2003;75:663–70.
Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26:1367–72.
Pascolo S, Bervas N, Ure JM, Smith AG, Lemonnier FA, Pérarnau B. HLA-A2.1–restricted education and cytolytic activity of CD8. J Exp Med. 1997;185:2043–51.
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375:1109–12.
Mayakonda A, Lin D, Assenov Y, Plass C, Koeffler HP. Maftools : efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 2018;28:1747–56.
Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005;21:3448–9.
Tomislav S. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6:e21800.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL. Gene set enrichment analysis : a knowledge-based approach for interpreting genome-wide. PNAS. 2005;102:15545–50.
Kodama Y, Shumway M, Leinonen R. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–6.
Suzuki Y. Aberrant transcript isoforms detected by full-length transcriptome sequencing as transcripts of potential neoantigens in non-small cell lung cancer. Datasets. DNA Data Bank of Japan. http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=DRA010214 (2020).
Suzuki Y. Aberrant transcript isoforms detected by full-length transcriptome sequencing as transcripts of potential neoantigens in non-small cell lung cancer. Datasets. DNA Data Bank of Japan. http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=DRA010215 (2020).
Kodama Y, Mashima J, Kosuge T, Katayama T, Fujisawa T, Kaminuma E, et al. The DDBJ Japanese genotype-phenotype archive for genetic and phenotypic human data. Nucleic Acids Res. 2015;43:D18–22.
Nakatsura T. Identifying aberrant splicing isoforms and potential neoantigens in non-small cell lung cancer. Datasets. Japanese Genotype-phenotype Archive. https://humandbs.biosciencedbc.jp/hum0236-v1 2020.
Deutsch EW, Csordas A, Sun Z, Jarnuczak A, Perez-Riverol Y, Ternent T, et al. The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. Nucleic Acids Res. 2017;45:D1100–6.
Okuda S, Watanabe Y, Moriya Y, Kawano S, Yamamoto T, Matsumoto M, et al. jPOSTrepo : an international standard data repository for proteomes. Nucleic Acids Res. 2017;45:1107–11.
Moriya Y, Kawano S, Okuda S, Watanabe Y, Matsumoto M, Takami T, et al. The jpost environment: an integrated proteomics data repository and database. Nucleic Acids Res. 2019;47:D1218–24.
Suzuki Y Aberrant splicing isoforms detected by full-length transcriptome sequencing as transcripts of potential neoantigens in non-small cell lung cancer. Datasets. Japan Proteome Standard Repository/Database. https://repository.jpostdb.org/entry/JPST000874 2020.
We thank K. Imamura, K. Abe, H. Wakaguri, Y. Kuze, T. Horiuchi, and E. Ishikawa for their technical assistance; Dr. T. Yoshida, Dr. K. Aoki, Dr. H. Ichikawa, and National Cancer Center Research Institute Core Facility for sequencing the clinical samples in this study; Dr. R. Onoguchi-Mizutani and Dr. N. Akimitsu for generously providing the siRNA for UPF1. The results shown in this study were based in part on data generated by the TCGA Research Network (https://www.cancer.gov/tcga) and GTEx Portal (https://gtexportal.org/home/). The super-computing resource was provided by the Human Genome Center, the University of Tokyo (http://sc.hgc.jp/shirokane.html).
The review history is available as Additional file 15.
Peer review information
Anahita Bishop was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
This work was supported by JSPS KAKENHI Grant Numbers 16H06279 (PAGS), 17H06306, 19K16108, and 19K16792. This work was also supported in part by National Cancer Center Research and Development Fund (29-A-2 and 29-A-6).
Ethics approval and consent to participate
Patient informed consent and ethics approval for this study was obtained from the National Cancer Center Japan (IRB# 2014-165) and The University of Tokyo (IRB# 16-217). All animal experiments were approved by the Animal Care and Use Committee (Permit No. K18-M007) and were performed under institutional guidelines set by the Animal Research Committee of the National Cancer Center Hospital East. All experiments complied with the Helsinki Declaration of “Ethical Principles for Medical Research Involving Human Subjects.”
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
General information on NSCLC cell lines.
General statistics of nanopore sequencing in NSCLC cell lines.
The number of aberrant splice isoforms in NSCLC cell lines.
The number of somatic mutations in NSCLC cell lines.
The neoantigen candidates derived from isoforms detected by proteome analysis.
Clinical information on clinical samples.
General statistics of nanopore sequencing in clinical samples.
The number of aberrant splice isoforms in clinical samples.
The number of somatic mutations in clinical samples.
siRNA used for knockdown experiments in A549.
Primers used for RT-PCR analysis.
Aberrant splicing isoform dataset of cell lines (gtf). The full-length structures of isoforms detected in our dataset of cell lines.
Aberrant splicing isoform dataset of clinical specimens (gtf). The full-length structures of isoforms detected in our dataset of clinical specimens.
About this article
Cite this article
Oka, M., Xu, L., Suzuki, T. et al. Aberrant splicing isoforms detected by full-length transcriptome sequencing as transcripts of potential neoantigens in non-small cell lung cancer. Genome Biol 22, 9 (2021). https://doi.org/10.1186/s13059-020-02240-8