Pairagon+N-SCAN_EST: a model-based gene annotation pipeline
© Arumugam et al.; licensee BioMed Central Ltd. 2006
Published: 7 August 2006
This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets.
On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean.
With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments.
There are three fundamental approaches to automated construction of exon-intron structure for protein-coding genes: native alignment - alignment of expressed sequences (including high quality cDNA sequences, expressed sequence tags (ESTs), and protein sequences) to the loci from which they were transcribed; trans alignment - non-native alignment of expressed sequences to loci that could potentially express similar sequences (can be within or between species); and de novo - prediction using the sequences of one or more genomes as the only inputs (no expressed sequences).
Native alignments of full insert, high quality cDNA sequences are the unquestioned gold standard in high-throughput annotation. However, even a concerted, high-budget effort to sequence cDNA libraries produces a full-open reading frame (ORF) sequence for only about 50% to 60% of loci in a mammalian genome . Thus, trans alignments have played a key role in producing the most trusted genome predictions, including the ENSEMBL predictions (sometimes termed 'evidence based') that have been used in the first published analyses of many new genome sequences. Nonetheless, the evidence they provide for expression is circumstantial rather than direct - for example, the annotated genomic locus may represent a pseudogene derived from the true genomic source of the expressed sequence. Even when a trans alignment identifies a functional homologous gene locus, the alignments tend to be inaccurate in their details unless the expressed sequence is highly similar to the genomic sequence [2, 3].
De novo predictions have always been viewed with some suspicion. This suspicion derives in part from the tendency of gene predictors developed in the 1990s to predict far too many false positive genes and exons. It may also result, in part, from the fact that one cannot point to the evidence supporting de novo predictions - a large ensemble of individually weak statistical patterns - the way one can point to a single expressed sequence. Nonetheless, statistical evidence is biological evidence, with a track record extending back to Gregor Mendel.
If de novo prediction were indeed inaccurate, relying heavily on trans alignments would make sense when analyzing a genome for which few EST or cDNA sequences are available. However, the rapidly increasing accuracy of de novo prediction and the large number of very high quality cDNA sequences available for human suggest the possibility that high quality annotations might be produced without using trans alignments. A system that does not use trans alignments might be more accurate than one that does, since all alignments would have near 100% identity. Even if its accuracy were merely equal to that of a system using trans alignments, the evidence supporting each prediction might be considered more direct.
To build an annotation pipeline without trans alignment, we combined a number of tools that have been recently developed in our lab. These tools include Pairagon, a cDNA-to-genome aligner, N-SCAN_EST , a multi-genome gene predictor capable of taking guidance from EST alignments, and PPFINDER , a program for eliminating pseudogenes from sets of predicted protein-coding genes.
Pairagon uses a PairHMM to produce native cDNA alignments
Our strategy was to use alignments of expressed sequences directly only when very high quality sequences were available. Thus, we applied Pairagon only to full ORF Mammalian Gene Collection (MGC) sequences [1, 9, 10] and human RefSeq mRNAs .
N-SCAN_EST threads complete gene structures through EST alignments
In the genomic regions between Pairagon's cDNA alignments, we combined EST alignments with de novo gene prediction by using N-SCAN_EST . N-SCAN_EST is based on N-SCAN [12, 13], a multi-genome de novo gene predictor, which was the most accurate de novo predictor in the EGASP assessment  by every measure except nucleotide sensitivity. (De novo includes both the 'ab initio' and 'multi-genome' assessment categories.) N-SCAN_EST is a version of N-SCAN that takes guidance from EST alignments. Specifically, it takes as input a representation of EST alignments that we call ESTseq, by analogy to the 'conservation sequence' used in TWINSCAN (a three-character alphabet representing genome sequence conservation between two species) [15, 16]. N-SCAN_EST takes guidance from EST alignments, but it does not follow them blindly. Instead, it also considers the DNA sequence of the target genome and the evolutionary conservation information provided by alignments of the target genome with the genomes of other organisms. It predicts complete transcripts by extending or merging EST alignments or by building gene structures in which some exon regions are supported by EST evidence while others are not. We have shown elsewhere that this approach increases sensitivity and specificity not only for the genes that have EST support, but even for those that do not .
Pairagon+N-SCAN_EST annotates genomes without using transalignment
To apply N-SCAN_EST, we downloaded human ESTs from dbEST and aligned them to the human genome using BLAT  (Figure 2, right side). We also downloaded alignments of the human, mouse, rat, and chicken genomes produced by MULTIZ  from the University of California Sant Cruz (UCSC) genome browser. These EST and genomic alignments were input to N-SCAN_EST. N-SCAN_EST was run on human genomic sequence that had been masked with PPFINDER, our processed pseudogene masker . After the final round of N-SCAN_EST, all predicted transcripts that overlapped Pairagon alignments were removed. The remaining transcripts (one per locus) were combined with the Pairagon alignments to produce the final gene set.
In the remainder of this paper we present accuracy statistics for both the EGASP version of the pipeline and an updated version and analyze the relative contributions of Pairagon versus N-SCAN_EST. We then examine a series of examples where our pipeline gave a revealing result, whether correct or incorrect. Finally, we draw some lessons about how the pipeline could be improved in the future.
Results and discussion
RefSeq and MGC cDNA sequences mapped to the ENCODE regions were downloaded from the UCSC Genome Browser and alignments were generated using the Stepping Stone implementation of Pairagon v0.5 as described in Materials and methods. GenBank's coding sequence (CDS) annotations of these cDNA sequences were used to produce 451 aligned transcripts annotated with GenBank ORFs (141 from MGC sequences and 310 from RefSeq sequences). Merging identical gene structures and removing inconsistent structures (for example, gap in the coding region leading to a frame shift in the genome) yielded 413 unique gene structures. N-SCAN_EST predictions were generated as described in Materials and methods. The 94 N-SCAN_EST predictions that did not overlap the 413 Pairagon gene structures were added to the gene set. We obtained seven gene structures by aligning sequences from our RT-PCR experiments. Two of these did not overlap the existing set and were included in our submission to the 'any evidence' category. We do not discuss this set in detail because it is almost identical to the submission to the 'mRNA/EST evidence' category. The accuracy statistics for this set can be found in the EGASP assessment report .
The official assessment of Pairagon+N-SCAN_EST shows high accuracy
Prediction accuracy measures of mRNA/EST evidence based gene prediction methods
Pairagon + N-SCAN_EST
Pairagon's cDNA alignments are highly accurate
Individual prediction accuracies of Pairagon alignments and N-SCAN_EST predictions in the submission
Identifying the correct splice boundaries is the crucial step in cDNA-to-genome alignment, and here Pairagon proves to be extremely accurate. Out of the 1,834 introns Pairagon predicted (both within and outside coding regions), only 22 introns from 15 transcript structures were not supported by HAVANA annotation. Three of them (from a single transcript) matched the introns of a Gencode gene labeled 'putative' and eight of them were a result of using incorrect seed exons from BLASTN (discussed in detail below). The remaining 11 were from Refseq cDNAs that have no evidence in HAVANA annotation. Two of the eleven aligned to the reference genome with numerous mismatches.
Pairagon's accuracy has improved since the official evaluation
Prediction accuracies of improved Pairagon alignments and Pairagon+N-SCAN_EST gene structures
Pairagon v0.0.95 + N-SCAN_EST
A lack of biological evidence raises questions about ORF annotation
Identifying the coding region in (even) a full-length mRNA is an extremely difficult problem. NCBI and HAVANA do not always agree in their CDS annotations of mRNA sequences, even if they agree on the exon-intron structures. Because we relied on the CDS annotations from NCBI, a few of our gene predictions are incorrect according to HAVANA, although the underlying alignment is correct. For example, GenBank's annotated translation start sites for cDNA sequences BC001940 and NM_001004759.1 are 798 bases downstream and 81 bases upstream of HAVANA's annotated translation start sites in Gencode genes AC005538.1-001 and AC011711.3-001, respectively. A few more of our ORF predictions obtained from correct alignments are labeled incorrect because HAVANA has not made any CDS annotations on the exon-intron structures yet. For example, the exon-intron structure of our gene NM_181879.1 from aligning a reviewed RefSeq mRNA NM_181879.1 matches that of Gencode reference gene AC008984.1-003, which does not have a CDS annotation. Since the biological evidence supporting the GenBank ORF annotations, if any, is not available for evaluation, we might do better by using a modified version of N-SCAN to predict ORFs on aligned cDNA sequences.
N-SCAN_EST performs well on complete GENCODE test regions
After the release of the HAVANA annotations, we found that N-SCAN_EST predictions used to fill the gaps between Pairagon alignments had a very high proportion of incorrect genes - the gene/transcript specificity of the original N-SCAN_EST predictions was 8.5% in regions that did not overlap Pairagon alignments (gene and transcript specificity are the same for programs that predict only one transcript per locus). However, this is due largely to the fact that there are high quality cDNA sequences covering most of the real genes in the ENCODE regions. When these are not used and N-SCAN_EST's predictions on the complete GENCODE test regions are evaluated, their specificity is 38.7% (Table 2).
Prediction accuracy measures of multiple-genome based gene prediction methods
The results of this exercise have demonstrated two things. First, this careful community assessment has been very valuable, particularly for the way in which it uncovered weaknesses in, and inspired improvements to, Pairagon and other systems. Second, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments, under certain circumstances. However, annotation accuracy in the EGASP assessment is determined largely by the accuracy with which high quality native cDNA sequences can be aligned, and secondarily by the accuracy with which HAVANA's ORF calls on those cDNA sequences, or lack thereof, can be anticipated. We cannot extrapolate the results of this exercise to situations in which fewer full length cDNAs and/or fewer ESTs are available. In such situations, the accuracy of our pipeline would depend more on N-SCAN and N-SCAN_EST, while the accuracy of ENSEMBL would depend more on trans alignments. In future assessments, it would be worthwhile to assess prediction pipelines under a range of scenarios between the two evaluated this time - freedom to use all available native cDNA and prohibition against using any. In particular, the selective elimination of cDNA and EST sequences from the available pool would shed light on the tradeoffs among different approaches under a range of situations of practical significance (see  for such a study on Pairagon+N-SCAN_EST).
Materials and methods
Pairagon gene predictions
The state diagram of Pairagon's pairHMM model for cDNA-to-genome alignment is given in Figure 1. The different states model different alignment columns as follows: matches and mismatches are modeled by state A; intron is modeled by a loop consisting of Entry, Intron and Exit; insertion and gap in genome are modeled by states G and C, respectively. Four additional states - RG1, RC1, RG2 and RC2 modeling unaligned genomic and cDNA sequences - were added to facilitate local alignment. Although, for simplicity, Figure 1 shows only one loop modeling introns, our model contains two such loops. One of them requires GT or GC at the splice donor site and AG at the splice acceptor site. The other requires AT and AC at those sites, respectively. Each state can emit the different columns of a cDNA-to-genome alignment with certain probabilities (emission probabilities). For each state there is also a probability of staying in that state or transitioning to different states (transition probabilities). These probabilities can be estimated using maximum likelihood from example alignments.
We implemented the Viterbi algorithm, an optimal dynamic programming algorithm for finding the most probable alignment between two sequences, in C. Although it produced accurate alignments, the time and space complexity for optimally aligning two sequences increases in proportion to the product of the sizes of the input sequences, imposing limitations on the size of the input sequences. Therefore, we adapted the Stepping Stone algorithm , a heuristic modification to the optimal algorithm. Stepping Stone relies on faster seeded alignment programs like BLASTN to identify regions of high identity between the cDNA and the genomic sequence (diagonal lines in Figure 4). It restricts the optimal dynamic programming algorithm to regions close to the approximate exons that the seed alignments correspond to (light blue region in Figure 4).
Pairagon v0.5 was trained using 15,766 BLAT alignments of 15,297 MGC [1, 9, 10] cDNA sequences to the human genome build NCBI35 (May 2005). Transition probabilities between the states were estimated from the alignments using maximum likelihood. Because this was a bootstrap procedure, and BLAT does not pay careful attention to splice sites, we assigned reasonable estimates for probabilities of GT-AG, GC-AG and AT-AC splice site combinations (98.9%, 1.0% and 0.1%, respectively). All bases were equally probable in states RG1, RC1, RG2, RC2, G, C and Intron. The probability of a match in the aligned state was estimated using maximum likelihood and was evenly distributed among the four possible combinations. Similarly, the probability of a mismatch in the aligned state was distributed among the 12 possible combinations.
Ungapped local alignments between the cDNA sequences and the unmasked ENCODE regions were generated using BLASTN  with parameters M = 1 N = -3. These approximate seed exons were then used by the Stepping Stone implementation of Pairagon v0.5 to generate an alignment. GenBank CDS annotations of the cDNA sequences were used to convert these alignments into gene structures.
N-SCAN gene predictions
N-SCAN_EST gene predictions
Human ESTs, downloaded from dbEST on 20 January 2005, were aligned to whole human genome (build NCBI35) by BLAT . For each EST sequence, the alignment with the greatest number of bases matching the genome was selected. Alignments with at least 98% of the bases in the entire EST matching the genome were chosen to generate an ESTseq for each chromosome. ESTseq parameters were estimated from regions corresponding to a set of cleaned Refseq annotations containing 17,798 transcripts. An additional 1,000 bases on either side of the genes were used to train intergenic regions. The genome sequence was masked for putative processed pseudogenes using PPFINDER . ESTseqs corresponding to the ENCODE regions were obtained by cutting the relevant sections out of the chromosomal ESTseq, and N-SCAN_EST was then used to predict genes.
A block diagram showing the steps involved in generating Pairagon gene structures and N-SCAN_EST gene predictions, and combining them is given in Figure 2. Because multiple mRNA sequences are available for some genes, identical Pairagon gene structures are merged into one gene. N-SCAN_EST predictions are added to the final set if they do not overlap the merged Pairagon gene structures. We used the Eval software package  for finding these overlapping genes.
We are grateful to Jeltje van Baren for help with her PPFINDER software for detection of processed pseudogenes in gene annotation sets. Thanks also to the organizers of the GENCODE evaluation, including especially Roderic Guigó and Paul Flicek. This work was supported in part by grants U01 HG003150 (ENCODE) and R01 HG02278 from the National Human Genome Research Institute and by Contract N01-CO-12400 from the National Cancer Institute (Mammalian Gene Collection).
This article has been published as part of Genome Biology Volume 7, Supplement 1, 2006: EGASP '05. The full contents of the supplement are available online at http://genomebiology.com/supplements/7/S1.
- The MGC Project Team: The status, quality, and expansion of the NIH full-length cDNA project: The Mammalian Gene Collection (MGC). Genome Res. 2004, 14: 2121-2127. 10.1101/gr.2596504.View ArticleGoogle Scholar
- Brent MR: Genome annotation past, present and future: How to define an ORF at each locus. Genome Res. 2005, 15: 1777-1786. 10.1101/gr.3866105.PubMedView ArticleGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res. 2004, 14: 988-995. 10.1101/gr.1865504.PubMedPubMed CentralView ArticleGoogle Scholar
- Wei C, Brent MR: Integrating EST alignments and de novo gene prediction using TWINSCAN. BMC Bioinformatics. 2006,Google Scholar
- van Baren MJ, Brent MR: Iterative gene prediction and pseudo-gene removal improves genome annotation. Genome Res. 2006, 16: 678-685. 10.1101/gr.4766206.PubMedPubMed CentralView ArticleGoogle Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge, UK: Cambridge University PressView ArticleGoogle Scholar
- Levine A, Durbin R: A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res. 2001, 29: 4006-4013. 10.1093/nar/29.1.300.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, et al: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci USA. 2002, 99: 16899-16903. 10.1073/pnas.242603899.PubMedView ArticleGoogle Scholar
- Strausberg RL, Feingold EA, Klausner RD, Collins FS: The mammalian gene collection. Science. 1999, 286: 455-457. 10.1126/science.286.5439.455.PubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq) a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, 33 (Database): D501-D504. 10.1093/nar/gki025.PubMedPubMed CentralGoogle Scholar
- Gross SS, Brent MR: Using multiple alignments to improve gene prediction. Research in Computational Molecular Biology, 9th Annual International Conference, RECOMB Cambridge, MA, USA, May14-18, 2005, Proceedings. Edited by: Miyano S, Mesirov JP, Kasif S, Istrail S, Pevzner PA, Waterman MS. 2005, Cambridge: Springer, 374-388.Google Scholar
- Gross SS, Brent MR: Using multiple alignments to improve gene prediction. J Comput Biol. 2006, 13: 379-393. 10.1089/cmb.2006.13.379.PubMedView ArticleGoogle Scholar
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarkis S, Ashburner M, Bajic VB, Birney E, et al: EGASP: The ENCODE Genome Annotation Assessment Project. Genome Biology. 2006, 7 (Suppl 1): S2-10.1186/gb-2006-7-s1-s2.PubMedPubMed CentralView ArticleGoogle Scholar
- Flicek P, Keibler E, Hu P, Korf I, Brent MR: Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 2003, 13: 46-54. 10.1101/gr.830003.PubMedPubMed CentralView ArticleGoogle Scholar
- Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, 17 (Suppl 1): S140-S148.PubMedView ArticleGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004, 14: 708-715. 10.1101/gr.1933104.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang M, Gish W: Improved spliced alignment from an information theoretic approach. Bioinformatics. 2006, 22 (1): 13-20. 10.1093/bioinformatics/bti748.PubMedView ArticleGoogle Scholar
- Brown RH, Gross SS, Brent MR: Begin at the beginning: predicting genes with 5' UTRs. Genome Res. 2005, 15: 742-747. 10.1101/gr.3696205.PubMedPubMed CentralView ArticleGoogle Scholar
- Meyer IM, Durbin R: Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics. 2002, 18: 1309-1318. 10.1093/bioinformatics/18.10.1309.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.PubMedView ArticleGoogle Scholar
- Keibler E, Brent MR: Eval: a software package for analysis of genome annotations. BMC Bioinformatics. 2003, 4: 50-10.1186/1471-2105-4-50.PubMedPubMed CentralView ArticleGoogle Scholar
- UCSC Genome Browser. [http://genome.ucsc.edu]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.