Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments
© Haas et al.; licensee BioMed Central Ltd. 2008
Received: 26 September 2007
Accepted: 11 January 2008
Published: 11 January 2008
EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Accurate and comprehensive gene discovery in eukaryotic genome sequences requires multiple independent and complementary analysis methods including, at the very least, the application of ab initio gene prediction software and sequence alignment tools. The problem is technically challenging, and despite many years of research no single method has yet been able to solve it, although numerous tools have been developed to target specialized and diverse variations on the gene finding problem (for review [1, 2]). Conventional gene finding software employs probabilistic techniques such as hidden Markov models (HMMs). These models are employed to find the most likely partitioning of a nucleotide sequence into introns, exons, and intergenic states according to a prior set of probabilities for the states in the model. Such gene finding programs, including GENSCAN , GlimmerHMM , Fgenesh , and GeneMark.hmm , are effective at identifying individual exons and regions that correspond to protein-coding genes, but nevertheless they are far from perfect at correctly predicting complete gene structures, differing from correct gene structures in exon content or position [7–10].
The correct gene structures, or individual components including introns and exons, are often apparent from spliced alignments of homologous transcript or protein sequences. Many software tools are available that perform these alignment tasks. Tools used to align expressed sequence tags (ESTs) and full-length cDNAs (FL-cDNAs) to genomic sequence include EST_GENOME , AAT , sim4 , geneseqer , BLAT , and GMAP , among numerous others. The list of programs that perform spliced alignments of protein sequences to DNA are much fewer, including the multifunctional AAT, exonerate , and PMAP (derived from GMAP). An extension of spliced protein alignment that includes a probabilistic model of eukaryotic gene structure is implemented in GeneWise , a popular homology-based gene predictor that serves a critical role in the Ensembl automated genome annotation pipeline . In most cases, the spliced protein alignments and transcript alignments (derived from ESTs) provide evidence for only part of the gene structure, delineating introns, complete internal exons, and potential portions of other exons at their alignment termini.
A comprehensive approach to eukaryotic gene structure annotation should utilize both the information intrinsic to the genome sequence itself, as is done by ab initio gene prediction software, and any extrinsic data in the form of homologies to other known sequences, including proteins, transcripts, or conserved regions revealed from cross-genome comparisons. Some of the most recent ab initio gene finding software is able to utilize such extrinsic data to improve upon gene finding accuracy. Examples of such software are numerous, and each falls within a certain niche based on the form of extrinsic data utilized. TWINSCAN , for example, uses an 'informant' genome to condition the probabilities of exons and introns in a closely related genome. Subsequently, TWINSCAN_EST  combined spliced transcript alignments with the intrinsic data, and finally N-SCAN  (also known as TWINSCAN 3.0) and N-SCAN_EST  utilized cross-genome homologies to multiple related genome sequences in the context of a phylogenetic framework. Other tools, including Augustus , Genie , and ExonHunter  include mechanisms to incorporate extrinsic data into the ab initio gene prediction framework to improve accuracy further. Each of these programs analyzes and predicts genes along a single target genome sequence, while using homologies detected to other sequences. A more specialized approach to gene-finding is employed by the tools SLAM  and TWAIN , which consider homologies between two related genome sequences and simultaneously predict gene structures within both genomes.
Early large-scale genome projects relied heavily on the manual annotation of gene structures in order to ensure genome annotation of the highest quality [28–30]. Manual annotation involves scientists examining all of the evidence for gene structures as described above using a graphical genome viewer and annotation editor such as Apollo  or Artemis . These manual efforts were, and continue to be, essential to providing the best community resources in the form of high quality and accurate genome annotations. Manual annotation is limited, though, because it is time consuming, expensive, and it cannot keep pace with the advances in high-throughput DNA sequencing technology that are producing increasing quantities of genome sequences.
FL-cDNA projects have lessened the need for manual curation of every gene by providing accurate and complete gene structure annotations derived from high-quality spliced alignments. Software such as Program to Assemble Spliced Alignments (PASA)  has enabled high-throughput automated annotation of gene structures by exploiting ESTs and FL-cDNAs alone or within the context of pre-existing annotated gene structures. Other, more comprehensive computational strategies have been developed to play the role of the human annotator by combining precomputed diverse evidence into accurate gene structure annotations. These tools include Combiner , JIGSAW , GLEAN , and Exogean , among others. These algorithms employ statistical or rule-based methods to combine evidence into a most probable correct gene structure.
We present a utility called EVidenceModeler (EVM), an extension of methods that led to the original Combiner development [34, 38], using a nonstochastic weighted evidence combining technique that accounts for both the type and abundance of evidence to compute weighted consensus gene structures. EVM was heavily utilized for the genome analysis of the mosquito Aedes aegypti , and used partially or exclusively to generate the preliminary annotation for recently sequenced genomes of the blood fluke Schistosoma mansoni , the protozoan oyster parasite Perkinsus marinus, the human body louse Pediculus humanus, and another mosquito, Culex pipiens. The evidence utilized by EVM corresponds primarily to ab initio gene predictions and protein and transcript alignments, generated via any of the various methods described above. The intuitive framework provided by EVM is shown to be highly effective, exploiting high quality evidence where available and providing consensus gene structure prediction accuracy that approaches that of manual annotation. EVM source code and documentation are freely available from the EVM website .
Results and discussion
In the subsequent sections, we demonstrate EVM as an automated gene structure annotation tool using rice and human genome sequences and related evidence. First, using the rice genome, we develop the concepts that underlie the algorithm of EVM as a tool that incorporates weighted evidence into consensus gene structure predictions. We then turn our attention to the human genome, in which we examine the role of EVM in concert with PASA to annotate protein-coding genes and alternatively spliced isoforms automatically. In each scenario, we include comparisons with alternative annotation methods.
Evaluation of ab initiogene prediction in rice
Consensus ab initioexon prediction accuracy
Consensus gene prediction by EVM
To demonstrate the simplest application of EVM, we combine only the three ab initio gene predictions and weight each prediction type equally. Figures 1 and 2 display the results in comparison with the ab initio prediction accuracies; we demonstrate that, by incorporating shared exons and introns into consensus gene structures, complete gene prediction accuracy is improved by at least 10%. Exon prediction accuracy is increased by about 6%, and exon prediction accuracies for each exon type are mostly improved, with the exception of the initial exon type, for which GeneMark.hmm alone is slightly superior.
Consensus gene prediction accuracy using varied evidence types and associated weights
Although this represents just a minute number of possible random weight combinations, it demonstrates the effect of the weight settings and the inclusion of different evidence types on our consensus prediction accuracy. By including evidence based on sequence homology, our prediction accuracy improves greatly, doubling to tripling complete gene prediction accuracy of ab initio programs alone or in combination. Also, very different weight settings can still lead to similar levels of performance, particularly in the presence of sequence homology data.
EVM consensus prediction accuracy using trained evidence weights
Given the variability in consensus gene prediction accuracy observed using different combinations of weight values, finding the single combination of weights that provides the best consensus prediction accuracy is an important goal. Searching all possible weight combinations to find the single best scoring combination is not tractable, given the computational effort needed to explore such a vast search space. To estimate a set of high scoring weights, we employed a set of heuristics that use random weight combinations followed by gradient ascent (see Materials and methods, below). For the purpose of choosing high performing weights and evaluating their accuracy, we selected 1,000 of our cDNA-verified gene structures and used half for estimating weights and the other half for evaluating accuracy using these weights (henceforth termed 'trained weights'). In both the training and evaluation process, accuracy statistics were limited to each reference gene and flanking 500 base pairs (bp). However, EVM was applied to regions of the rice genome including the 30 kilobase (kb) region flanking each reference gene, to emulate gene prediction by EVM in a larger genomic context.
Intuitive versus trained weights
Although we can computationally address the problem of finding a set of weights that yield optimal performance, it is clear from our analysis of randomly selected weights that there could be numerous weight combinations that provide reasonable accuracy. In general, we find that combinations of assigned weightings in the following form provides adequate consensus prediction accuracy:
(ab initio predictions) ≤ (protein alignments, EST alignments) < (GeneWise) < (PASA)
Using such a weight combination (gene predictions = 0.3, proteins and other plant ESTs = 1, GeneWise = 5, PASA = 10), we find that our consensus exon and complete gene prediction accuracy is quite comparable, with our intuitive weights providing performance levels that in most cases are just slightly lower than those of our trained weights (Additional data file 1 [Figure S1]). In each case, accuracy measurements with intuitive weight settings were within 3% of the results from trained weights. The ability to tune EVM's evidence weights intuitively provides a flexibility that is not as easily afforded by current software systems based on a strict probabilistic framework.
EVM versus alternative annotation tools: Glean and JIGSAW
The prediction accuracy between JIGSAW and EVM is strikingly similar for two of the evidence combing scenarios examined: combining gene predictions with other plant EST alignments (gap2), and when all alignment data are included minus the rice PASA evidence (all). We further examined the latter case, in which both JIGSAW and EVM predicted more than 60% of the complete genes accurately, to determine the similarity of their gene predictions. Of the 500 reference genes tested, there are 310 predictions generated identically between EVM and JIGSAW, of which 260 were correct. Therefore, although their prediction accuracies can be strikingly similar, overall the gene structures predicted are quite different.
A strength of EVM is its ability to utilize heavily trusted forms of evidence, such as gene structures inferred from alignments of cognate FL-cDNAs and ESTs. Each of the three programs were trained in the presence of cDNA-supported gene structures as provided by PASA (long open reading frame [ORF] structures within PASA alignment assemblies), a subset of that defines a correct gene structure (see Materials and methods, below). All three tools demonstrated the greatest prediction accuracy in the presence of PASA evidence. Although each tool is effectively provided with evidence containing all complete introns and exons that define the correct gene structure, only EVM is found to be capable of nearly perfect prediction accuracy. Of the 500 evaluated reference genes, EVM predicted only six incorrectly when supplied with PASA evidence along with the competing evidence types (ab initio predictions, and protein and other plant EST alignments). These six incorrect predictions involved three cases in which neighboring genes were merged into single predictions, two cases in which improper gene termini were chosen, and a single case that was confounded by a large degenerate retrotransposon insertion within an intron of a gene, an element that was not masked and excluded from the gene prediction effort.
Comparison with manual annotation
It is expected and reassuring that EVM provides nearly perfect complete gene accuracy in the presence of high quality and reliable complete gene structure data, as provided in the form of the PASA alignment assemblies. The importance of such ESTs and FL-cDNAs for gene structure annotation is well known [42–45], and software such as PASA can annotate gene structures based solely on these data in absence of pre-existing gene annotations or ab initio gene predictions . A greater challenge is to achieve maximal consensus gene prediction accuracy in the absence of these data, which is the typical scenario with newly sequenced genomes that lack extensive EST or FL-cDNA sequences as companion resources. In such cases we must rely on the accuracy of ab initio gene predictors and homologies to sequences from other organisms, and it is here that, in lieu of an equivalent automated annotation method, we expect to have the greatest gains from expert scientists directly evaluating and modeling complete gene structures based on these sources of evidence.
In our application of EVM thus far, the relevant set of input evidence is that which contains the ab initio gene predictions, protein alignments, GeneWise predictions based on protein homology, and the alignments to ESTs derived from other plants (Figure 7; entry 'EVM:All(-PASA)', read as EVM with all evidence minus PASA evidence). Using trained weights, EVM correctly predicted 92% of the known exons and 62% of the 500 cDNA-verified genes correctly, on average. If the subset of the native cDNA data that defines the correct gene structure is not supplied as evidence, and if components of such known gene structures are not available as candidate introns and exons, then EVM will be unable to predict the gene correctly. In an effort to establish the upper limit of gene prediction accuracy in the absence of cDNA evidence, we propose use of the accuracy of manual annotation on the same dataset. The accuracy of human annotation has never been adequately measured, although it is widely assumed that human annotation is the 'gold standard' for genome projects. For our study, a set of human annotators was asked to evaluate these data in absence of cognate rice cDNA alignments, and were instructed to model a gene structure manually that best reflected the available evidence. In absence of the rice cDNAs, manual annotation accuracy resulted in 96% eSn and 96% eSp, and 81% gSN and 81% gSP (Figure 7). In light of these statistics, we consider the accuracy provided by EVM on the identical dataset to be demonstrably effective as an automated annotation system, and approaching the better accuracy obtained through manual curation efforts, particularly when compared with the accuracy of individual ab initio gene predictors on the same dataset.
Application of EVM and PASA to the ENCODE regions of the human genome
The ENCyclopedia of DNA Elements (ENCODE) project was initiated shortly after the sequencing of the human genome with the aim being to identify all functional elements, including all protein-coding genes, in the human genome sequence . The pilot phase of the project focused on only 1% (about 30 megabases spread across 44 regions) of the genome, termed the ENCODE regions. The GENCODE (encyclopedia of genes and genes variants) consortium was formed to provide high quality manual annotation and experimental verification of protein coding genes in these regions . The human ENCODE Genome Annotation Assessment Project (EGASP) was established to evaluate the accuracy of automated genome annotation methods by comparing automated annotations of the ENCODE regions with the GENCODE annotations . Participants in the EGASP competition were allowed access to 13 ENCODE regions along with their corresponding GENCODE annotations, which could be used for training purposes. Groups submitted their automated annotations for the remaining 31 regions, after which time the corresponding GENCODE annotations were released and the automated annotation methods were evaluated based on a rigorous comparison with the GENCODE annotations .
The sequences, gene predictions, and annotations involved in EGASP additionally serve as a resource for evaluating current and future annotation methods. Similarly to our application of EVM to the rice genome using cDNA-verified gene structures for training and evaluation purposes, we applied EVM to the ENCODE regions using the GENCODE annotations for training and evaluation purposes, analogous to the original EGASP competition. Evidence used by EVM included the evidence tracks provided by University of California at Santa Cruz: TWINSCAN, SGP2, GENEID, GENSCAN, CCDSGene, KNOWNGene, ENSEMBL (ENSGene), and MGCGene. Additional evidence generated in our study included AAT alignments of nonhuman proteins, GeneWise predictions based on the nonhuman protein homologies, AAT nucleotide alignments of select animal gene indices, and PASA alignment assemblies generated from GMAP alignments of human ESTs and FL-cDNAs. The GlimmerHMM predictions used by EVM were those generated as part of the EGASP competition, and were obtained separately.
There are several notable differences between the training and evaluation of EVM on the ENCODE regions as compared with the earlier application to rice. The cDNA-verified rice genes used for training and evaluation were restricted to a single splicing isoform. In addition, each gene was complete, containing the protein-coding region from start to stop codon. The GENCODE protein-coding annotations, in contrast, include alternative splicing isoforms and several partial gene structures. Accuracy measurements computed for rice genes included each cDNA-verified gene and the flanking 500 bases, whereas accuracy measurements on the ENCODE regions included these sequence regions in their entirety and all corresponding protein-coding gene annotations.
Post-EVM application of PASA to annotate alternatively spliced isoforms
EVM is not designed to model alternative splicing isoforms directly. This is, however, a primary function of our companion annotation tool PASA, which contributes to the automated annotation of gene structures in several ways. PASA, like EVM, is made freely available as open source from the PASA website . Above, PASA alignment assemblies were used as one source of gene structure components by EVM. Alternatively, PASA can generate complete gene structures based on full-length alignment assemblies (alignment assemblies containing at least one FL-cDNA) by locating the longest ORF within each alignment assembly, and annotate gene structures and alternatively spliced isoforms restricted to the transcriptome. A third application of PASA is to perform a retroactive processing of a set of pre-existing gene structure annotations, whereby alignment assemblies are incorporated into untranslated region annotations, exon modifications, correctly splitting or merging predicted gene structures, and used to model alternative splicing isoforms .
To demonstrate the effect of applying PASA as a postprocess to integrate transcript data into an existing set of gene structure annotations (which we refer to as 'PASAu', for PASA updates), we applied PASA separately to the ab initio predictions, the various University of California at Santa Cruz gene prediction tracks (which we refer to as 'other predictions'), and to the EVM-generated datasets that either utilized or excluded the other predictions. The change in prediction accuracy as a result of applying PASA's annotation updates is illustrated in Additional data file 1 (Figure S2). PASAu can yield relatively large improvements (increases from 23% to 33% in gSn and from 7% to 32% in gSp) to the accuracy of the various ab initio predictions by incorporating transcript alignment assembly-based updates. PASAu-resulting changes to the accuracies of the other original predictions were more variable, mostly involving small increases in transcript sensitivity and larger decreases in transcript specificity; more GENCODE transcripts predicted correctly, but additional PASA-based transcripts not represented in the GENCODE dataset were also identified. The EVM gene sets were affected similarly.
Although it is useful to compare accuracies of these various tools based on their ability to recreate the GENCODE annotation for the ENCODE regions, direct comparisons between each method based on these data may be generally useful but not exactly valid. In the case of ab initio gene prediction tools that require only the genome sequence as input, direct comparisons between the results of the gene predictors are fully justified, because the inputs are exactly identical. The focus of EGASP was to examine the accuracy of diverse automated annotation methods and not necessarily to perform head-to-head comparisons between each method. Therefore, groups were allowed to use any evidence available to them to assist in their annotation efforts, and so, for example, the additional evidence used by JIGSAW was not exactly the same inputs utilized by Exogean, or EVM as described here. The analogous experiments we directed in rice were more tightly controlled, given that each software tool was trained and executed using identical inputs. Even so, although alternative methods examined as part of the EGASP competition are shown to exceed EVM's accuracy, even if only slightly, EVM does fare well as an automated annotation system, especially when it is compared with the individual ab initio predictions.
We have demonstrated that EVM is an effective automated gene structure annotation tool that leverages ab initio gene predictions and sequence homologies to generate weighted consensus gene predictions. The gene prediction accuracy of EVM is influenced by the types of evidence provided and associated weight values. Although a training system is provided to assist the search for optimal evidence weights, a manually set weighting scheme can perform similarly. We demonstrated the general utility of EVM as an automated annotation utility using both rice and human genome sequences. We also showed how to use PASA to provide an effective postprocessing step to discover and annotate alternatively spliced isoforms. EVM, especially when combined with PASA, provides an intuitive and flexible automated eukaryotic gene structure annotation framework, reducing the manual effort required to produce a high quality and reliable gene set to support the earliest efforts of furthering our scientific understanding of the genome biology of eukaryotes. Both EVM  and PASA  are fully documented and freely available as open source from their respective websites.
Materials and methods
Generating evidence for gene structures
The ab initio gene prediction programs Fgenesh , GeneMark.hmm , and GlimmerHMM  were applied to the rice genome sequences. Fgenesh and GlimmerHMM were applied to repeat-masked genome sequences. Repeats were masked using RepeatMasker  and the rice repeat library . GeneMark.hmm was applied to the unmasked genome sequence; software problems prevented us from running GeneMark.hmm on all repeat-masked genome sequences, and so we chose instead to use the unmasked genome in this case. The AAT software  was used to generate spliced protein and transcript alignments. For generating spliced protein alignments, AAT was used to search a comprehensive and nonredundant protein database that was first filtered from rice protein sequences. A database of other plant transcript sequences was compiled by downloading and joining all plant gene indices provided by The Gene Index at the Dana Farber Cancer Institute , excepting the rice gene indices. Rice ESTs and FL-cDNAs were aligned to the rice genome and assembled into gene structures as described previously , with the exception being that the high quality single-exon transcript alignments were included here along with spliced alignments.
Compiling a reference rice gene set
We extracted PASA assemblies encoding a complete ORF exceeding 100 amino acids and considered these as candidates for high confidence complete gene structures, first requiring manual verification. For the purpose of training and evaluating EVM, we sought approximately 1,000 total high confidence gene structures, half to be used for training and the remainder for evaluation. In an effort to select this subset of genes, we manually examined the candidate PASA-based structures in the context of the available evidence using the TkGFF3 graphical genome viewing utility provided in the EVM software distribution. We then selected PASA-based structures that appeared to provide the best gene structure as the reference gene structures, yielding 1,058 such genes. We excluded PASA assemblies found to harbor rare AT-AC introns, to encode less than full-length ORFs, or to represent splicing variants that did not best represent the additional evidence. These excluded assemblies comprised approximately 10% of the total. To simplify training and evaluation of EVM, we extracted each high confidence gene and flanking 30 kb region from the complete rice genome and prepared these as independent and individual datasets.
All sequences, gene structures, and evidence are available for download . A comparison of the distribution of coding exon counts among the gene structures in the training set as compared with all candidates and the release-4 gene structure annotations (non-TE set) is provided in Additional data file 1 (Figure S3). Although our verified set of known gene structures is notably deficient in single-exon genes, overall it is consistent with the other selections of rice genes and deemed suitable for our purposes herein.
GENCODE annotations for ENCODE regions
We obtained the ENCODE region sequences, GENCODE annotations, and the various EGASP annotation datasets from the EGASP ftp site . We encountered some difficulties working with the downloaded data files because of inconsistent file formats, inconsistent annotation of stop codons, and annotation features extending out of the sequence range. We therefore converted each data file over to a more strict GTF format, clipping annotations at the bounds of the ENCODE regions and adding stop codons where they were obviously lacking. Prediction accuracies of the EGASP datasets were recomputed (Additional data file 1 [Figure S4]) and were found to agree with the previously reported values; small differences between our recomputed values and previously published values are likely because of the slight differences in our stated implementation of our accuracy evaluation software and those differences resulting from our file conversions. Our refined versions of the EGASP datasets are available from the EVM software website .
Additional evidence compiled for the GENCODE annotations included homologies to nonhuman proteins using AAT-nap and GeneWise, alignments to assembled animal ESTs downloaded from the Gene Index using AAT-gap2, and PASA alignment assemblies. This additional evidence is also available from the EVM software site .
EVM reports consensus gene structures as high scoring paths through a directed acyclic graph containing complete intron, exon, and intergenic region features as vertices. Each of the possible features is computed based on the evidence provided in the form of the genome sequence, ab initio gene predictions, and the transcript and protein alignments. Each type of evidence, such as the name of the gene prediction program or the combination of alignment method and sequence database searched, has an associated numeric weight value. This weight value is either set by hand or by the training process described below. The evidence and corresponding weights are used to score the exon, intron, and intergenic region features. Consensus gene structures reported by EVM are computed by connecting exons, introns, and intergenic regions across the complete genome sequence such that the series of connected components provides the highest cumulative score. An example of EVM applied to a section of the rice genome, including components of the scoring system and feature set, is illustrated in Figure 5. For large genome sequences (>1 megabase), the data are partitioned into overlapping segments, and the EVM predictions from the separate partitions are subsequently joined into a single nonredundant set of predictions.
Dismantling predictions and alignments into exons and introns
Exons of eukaryotic gene structures are commonly treated as four distinct types: initial exon, including the start codon to a donor splice junction; internal exon, including an acceptor splice junction to a donor splice junction; terminal exon, including the acceptor splice junction to the stop codon; and the single exon, which corresponds to an intronless gene from start codon to stop codon. These are the four types of exons considered by EVM. The ab initio gene predictions provided as inputs to EVM are dismantled into their component exons and introns and added to a nonredundant corresponding exon or intron feature set. Each exon of a given type is stored by EVM with its coordinates, the codon position of its leading base, and a list of all evidence types that perfectly support it. Introns are likewise stored as discrete features based on unique coordinate pairs and their supporting evidence. Only the consensus GT or GC donor and AG acceptor dinucleotide splice sites are treated as valid by EVM; the more rare AT-AC consensus introns, although accepted by PASA, are currently disallowed by EVM. No maximum intron length is enforced by EVM, but a minimum intron length of 20 bp is set and can be tuned as required.
Protein and transcript spliced alignment inputs to EVM, by default, are only capable of contributing internal exons and introns to EVM's feature set. Spliced alignments contribute internal exons to the feature set for those internal alignment segments that have consensus splice sites and encode an ORF in at least one of the three reading frames. An internal exon is added to the feature set for each incident codon position that provides an ORF on that strand. A final way for alignment data to contribute initial, terminal, or single exons to the feature set is by explicitly providing such candidate exons to EVM a priori. This is one mechanism that allows EVM to better exploit gene structures provided by PASA. PASA includes functions to provide the longest ORF within each PASA assembly, and EVM includes a utility that extracts initial, terminal, and single exons from gene structures corresponding to the longest ORF within each PASA assembly. This list of PASA-based exon candidates can be provided directly to EVM. Internal exons provided by PASA alignment assemblies are included in the feature set exactly as other forms of spliced alignment data described above.
Experiments performed on the rice genome utilizing PASA evidence as input instead included the structure of the longest ORF (minimum length of 50 amino acids) within each PASA alignment assembly in place of the alignment assemblies themselves supplemented with the terminal exon candidates, as described above. These PASA longest ORF structures were provided to EVM as an OTHER_PREDICTION evidence class. Utilization of the PASA data in this way was necessary to allow provision of identical PASA-based evidence to the alternative annotation tools Glean and JIGSAW as part of the rice combiner accuracy comparison.
Scoring genome features
EVM scoring mechanism based on feature class and type
As each gene prediction or spliced alignment is dismantled into its component parts, the parts contribute the weight of that evidence to the scoring scheme. For example, a single spliced protein alignment is dismantled into the protein alignment segments and intervening gaps, possibly contributing to feature types exon and intron of feature class PROTEIN. Those 'perfect' complete introns and exons yielded by dismantling of this protein alignment chain are added to the candidate exon and intron feature set if those features do not already exist. Each protein alignment segment contributes its corresponding evidence weight to each overlapping nucleotide position in the exon feature type scoring vector. Those protein alignment gaps that correspond to complete introns in our feature set contribute a value of (weight × length) to the feature-specific score of each corresponding intron.
The abundance of evidence is reflected in both the feature-specific and vectored scores. For example, often many protein homologies will exist at a given locus. Each protein database match (accession) at a given locus is scored separately, and so exon and introns supported by vast quantities of evidence will have scores that reflect both the weight and abundance of that evidence.
For the purpose of scoring exons and introns and minimizing the memory requirements required for storing the scoring vectors, each strand and associated set of evidence is initially examined separately; note that our final gene prediction examines both strands simultaneously. During the initial strand-based analysis, distinct exons and introns are collected from the evidence restricted to the strand being analyzed and scored accordingly. After collecting properly scored gene structure components from each strand, they are grouped together as a single collection of features from both DNA strands.
Dynamic programming is used to find the highest scoring set of connected exons, introns, and intergenic regions across the entire genome sequence (see Figure 5). Unlike exon and intron features, the intergenic features are not precomputed and are instead scored during the dynamic programming stage; scores for intergenic regions are computed when attempting to connect candidate gene termini while building the directed acyclic graph of connectable feature components (also referred to as the feature trellis). The highest scoring path of connected features is extracted from the feature trellis and separated into the individual gene predictions. A primary restriction within our feature trellis is that the introns connecting exons must exist as explicit components of our feature set; EVM will not connect two otherwise compatible exons unless the required intron exists within the inputted evidence, such as provided by a gene prediction, or spliced protein or transcript alignment.
Note that, by default, EVM will re-examine long introns to identify candidate nested genes. Although we find this functionality extraordinarily useful for automated annotation, especially for insect genomes, this function was not employed in any analysis described here. Although improvements in sensitivity can result from the nested gene search, there are associated costs in specificity (data not shown).
Augmenting intergenic scores from approximate beginnings and ends of genes
Because the ABINITIO_PREDICTION class of evidence is the only class that contributes explicitly to the prediction of intergenic regions, coping with cases in which the consensus of ab initio predictions merges multiple adjacent genes into a single gene structure is particularly problematic. To split the merged consensus into separate individual predictions, the true intergenic region would need a score that is suitable to offset the alternative, typically involving a predicted intron that joins what should be distinct loci. To encourage the selection of separate complete gene structures supported by protein homologies instead of the merged gene, EVM augments the scores of intergenic regions supported indirectly by protein evidence, as elaborated below.
The approximate boundaries of candidate intergenic regions supported by protein homologies are localized by examining the boundaries of protein alignment chains. The beginnings and ends of all PROTEIN evidence structures (the far bounds of all spliced alignment chains, not the individual segments) are tallied. A sliding window of 300 nucleotides is applied to each strand, and all peaks of beginnings and ends are separately tallied. In addition to the protein alignment chains, the terminal exons provided by the extraction of long ORFs from PASA alignment assemblies also contribute to the tally of candidate beginnings and ends of genes.
From each begin peak, a corresponding initial exon is located from the feature set. The intergenic score for each nucleotide from the candidate initial exon upstream to the preceding gene is set to the maximal intergenic score, corresponding to the sum of the weights for ABINITIO_PREDICTION evidence classes. Likewise, from each candidate gene end, a terminal exon is located from the feature set, and the genome region downstream to the next gene is set to the maximal intergenic score. Note that single exon genes are also treated similarly as initial or terminal exons in the search for the next possible adjacent gene structure.
Although this search for gene boundaries is not very precise, the heuristic employed here tends to work acceptably well in practice. Choosing the proper boundaries of a gene structure is critical for predicting the entire gene correctly, as demonstrated by the greater variability in initial and terminal exon prediction among the various ab initio gene prediction programs.
Filtering EVM predictions with low support
Instead of reporting the single best scoring gene structure at each locus, EVM reports the set of gene structures that, when connected together with the intervening intergenic regions, provides an optimal cumulative score. There are sometimes cases in which low scoring adventitious genes are included in the preliminary EVM gene set, largely as a consequence of ABINITIO_PREDICTION introns called on either strand in what are really intergenic regions. To remove these adventitious genes from the EVM gene set, the score of each EVM prediction is re-examined in the context of ab initio predicted introns being scored as if they were intergenic regions. An alternative noncoding score is computed for each EVM gene prediction by summing the predicted intergenic regions with the ab initio predicted intron regions. This noncoding score is then compared with the initial EVM prediction score, and those EVM predictions with a coding/noncoding score ratio below 0.75 are eliminated. An example of a low scoring EVM prediction removed during this postprocessing stage is illustrated in Additional data file 1 (Figure S5). An option is available in the EVM software to report these eliminated genes. In those cases in which all predictions agree, predictions lack introns, and the corresponding intergenic score is zero, the score ratio is set to an arbitrary high value and reported accordingly.
Evaluating prediction accuracy
Gene prediction accuracy (sensitivity and specificity) was computed at the level of nucleotides, exons, transcripts, and complete genes, as described previously  but with slight modifications. Although some gene structures include untranslated region annotations, only the protein-coding portions of each exon were considered when we computed accuracy.
In our evaluation of the reference gene structures in rice, alternative splicing was ignored, and no attempt was made to generate a reference gene set for rice that included alternatively spliced transcripts. Therefore, given the one transcript per gene in the rice dataset, gene prediction accuracy calculations would necessarily be identical to the transcript accuracy calculations, and so only the gene prediction accuracy was reported. Although each reference gene region was provided as input to EVM in the context of the flanking 30 kb of genome sequence and corresponding evidence, all accuracy calculations were based on the gene predictions isolated from reference gene region including a flanking 500 bp. In our comparison of the accuracy of EVM to the annotation tools Glean and JIGSAW, we obtained the most current versions of the software available from their respective sites, namely version 3.2.9 for JIGSAW  and version 1.0.1 for GLEAN , downloaded directly from the subversion source repository.
Accuracy calculations on the human ENCODE genome regions included these regions and corresponding predictions in their entirety. Given that the GENCODE annotations included alternatively spliced transcripts, the prediction of alternatively spliced genes was a major component of our analysis, and so transcript prediction accuracy calculations were reported along with complete gene, exon, and nucleotide prediction accuracies.
Estimating optimal evidence weights
The EVM training process is divided into three phases described below:
Initially optimized PREDICTION weights
In the first stage, optimal weights are explored for the ABINITIO_PREDICTION class in isolation from evidence of the other classes. The proper balance between the evidence weights applied to exons, introns, and intergenic regions is explored to optimize gene prediction accuracy. Weights are randomly chosen for each ab initio gene prediction type and normalized so that they sum to one. EVM is applied to each reference gene and specified length of flanking region included. EVM prediction accuracy is measured, and a conglomerate accuracy score is computed as follows:
AccuracyScore = F + gSn + eSn
Where F = (2 × nSn × nSp)/(nSn + nSp), Sn = TP/(TP + FN), and Sp = TP/(TP + FP). (TP, FP, FN correspond to true positives, false positives, and false negatives, respectively. The nSn and nSp indicate nucleotide sensitivity and specificity, respectively.)
Twenty random trials are performed. The weight combination that yielded the greatest AccuracyScore is chosen. These weight values are gradually adjusted while applying gradient ascent to find weight values that improve performance.
Initially optimized best individual evidence weights
Using the combination of weights now temporarily fixed for the ABINITIO_PREDICTION evidence, each other evidence type is introduced separately to find the minimum corresponding weight that provides the greatest AccuracyScore in the context of the ABINITIO_PREDICTION types. The weight for the other evidence type is first set to zero and evaluated. Next, the weight is set to the average weight value of the ABINITIO_PREDICTION types and evaluated. Gradient ascent is performed to explore adjusted weight values and a higher scoring weight. The minimum weight value that yielded the highest AccuracyScore is initially assigned to the other evidence type.
Simultaneous application of all evidence and relative weight refinements
The weight values for all evidence types are adjusted to find weight combinations that demonstrate improved prediction accuracies when all evidence is examined simultaneously. Evidence types are examined in descending order of their initially set weight values computed from phase 1 (ABINITIO_PREDICTION) or phase 2 (other) above. Weight values are gradually adjusted and gradient ascent is applied to explore better performing weight value in the context of the other evidence types. Cycling through the evidence types in this manner occurs until no appreciable improvement in performance is observed, in which case the training process ceases and the final weight values are reported.
Evidence weights and EVM prediction accuracies encountered during the training process using the rice data are illustrated in Additional data file 1 (Figure S6).
Manual annotation of gene structures
The genome sequence, ab initio gene predictions, protein alignments, GeneWise predictions, and other plant EST alignments were examined using the Neomorphic/Affymetrix Annotation Station software (described by Haas and coworkers ). No rice transcript alignments either alone or in the context of PASA assemblies were made available to users so that we could reasonably estimate optimal gene structure annotation accuracy in the context of ab initio gene predictions and homologies to sequences derived from other organisms. A group of annotators were provided with the same data sets evaluated by EVM, only in graphical form. Annotators were instructed to model a gene structure in the targeted region that best reflected the available evidence using the Annotation Station software. Annotators were not allowed to examine the data deeper than the visual display provided. The sequence alignments themselves were not available except in the context of the glyphs highlighting their end points, and no additional sequence analyses such as running blast was allowed. The focus of this effort was not to measure the maximal accuracy of manual gene annotation accuracy in general, but only to measure the maximal possible accuracy of an automated annotation such as EVM given the restricted inputs.
Additional data files
ENCODE Genome Annotation Assessment Project
ENCyclopedia of DNA Elements
expressed sequence tag
hidden Markov model
open reading frame
Program to Assemble Spliced Alignments
We thank Linda Hannick, Rama Maiti, Vinita Joardar, Mathangi Thiagarajan, Qi Zhao, Hernan Lorenzi, Natalie Federova, and Shu Ouyang for participating in our experiment to assess the accuracy of manual annotation in rice in the absence of rice ESTs and FL-cDNAs. We thank Bill Majoros for edification on the intricacies of gene finding. We thank Bob Zimmerman, Alan Kwan, and Matt Campbell for critiquing the manuscript. We give many thanks to Aaron Mackey and Jason Stajich for providing help and advice on using the Glean software. Work on the rice genome annotation was supported by a National Science Foundation Plant Genome Research Program grant to CRB (DBI-0321538). SLS, JEA, and MP were supported in part by NIH grant R01-LM006845 to SLS. BJH, JRW, JO, and OW were supported by MSC contract NIH-N01-AI-30071.
- Brent MR: Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res. 2005, 15: 1777-1786. 10.1101/gr.3866105.PubMedView ArticleGoogle Scholar
- Zhang MQ: Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002, 3: 698-709. 10.1038/nrg890.PubMedView ArticleGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.PubMedView ArticleGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004, 20: 2878-2879. 10.1093/bioinformatics/bth315.PubMedView ArticleGoogle Scholar
- Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000, 10: 516-522. 10.1101/gr.10.4.516.PubMedPubMed CentralView ArticleGoogle Scholar
- Lukashin AV, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998, 26: 1107-1115. 10.1093/nar/26.4.1107.PubMedPubMed CentralView ArticleGoogle Scholar
- Pavy N, Rombauts S, Dehais P, Mathe C, Ramana DV, Leroy P, Rouze P: Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics. 1999, 15: 887-899. 10.1093/bioinformatics/15.11.887.PubMedView ArticleGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics. 1996, 34: 353-367. 10.1006/geno.1996.0298.PubMedView ArticleGoogle Scholar
- Guigo R, Agarwal P, Abril JF, Burset M, Fickett JW: An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000, 10: 1631-1642. 10.1101/gr.122800.PubMedPubMed CentralView ArticleGoogle Scholar
- Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006, S21-S31. Suppl 1
- Mott R: EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci. 1997, 13: 477-478.PubMedGoogle Scholar
- Huang X, Adams MD, Zhou H, Kerlavage AR: A tool for analyzing and annotating genomic sequences. Genomics. 1997, 46: 37-45. 10.1006/geno.1997.4984.PubMedView ArticleGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMedPubMed CentralGoogle Scholar
- Usuka J, Zhu W, Brendel V: Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics. 2000, 16: 203-211. 10.1093/bioinformatics/16.3.203.PubMedView ArticleGoogle Scholar
- Kent WJ: BLAT: the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-1875. 10.1093/bioinformatics/bti310.PubMedView ArticleGoogle Scholar
- Slater GS, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005, 6: 31-10.1186/1471-2105-6-31.PubMedPubMed CentralView ArticleGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res. 2004, 14: 988-995. 10.1101/gr.1865504.PubMedPubMed CentralView ArticleGoogle Scholar
- Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, et al: An overview of Ensembl. Genome Res. 2004, 14: 925-928. 10.1101/gr.1860604.PubMedPubMed CentralView ArticleGoogle Scholar
- Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, S140-S148. Suppl 1
- Wei C, Brent MR: Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics. 2006, 7: 327-10.1186/1471-2105-7-327.PubMedPubMed CentralView ArticleGoogle Scholar
- Gross SS, Brent MR: Using multiple alignments to improve gene prediction. J Comput Biol. 2006, 13: 379-393. 10.1089/cmb.2006.13.379.PubMedView ArticleGoogle Scholar
- Stanke M, Schoffmann O, Morgenstern B, Waack S: Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006, 7: 62-10.1186/1471-2105-7-62.PubMedPubMed CentralView ArticleGoogle Scholar
- Kulp D, Haussler D, Reese MG, Eeckman FH: Integrating database homology in a probabilistic gene structure model. Pac Symp Biocomput. 1997, 232-244.Google Scholar
- Brejova B, Brown DG, Li M, Vinar T: ExonHunter: a comprehensive approach to gene finding. Bioinformatics. 2005, i57-i65. 10.1093/bioinformatics/bti1040. Suppl 1
- Alexandersson M, Cawley S, Pachter L: SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003, 13: 496-502. 10.1101/gr.424203.PubMedPubMed CentralView ArticleGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics. 2005, 21: 1782-1788. 10.1093/bioinformatics/bti297.PubMedView ArticleGoogle Scholar
- Haas BJ, Wortman JR, Ronning CM, Hannick LI, Smith RK, Maiti R, Chan AP, Yu C, Farzad M, Wu D, White O, Town CD: Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol. 2005, 3: 7-10.1186/1741-7007-3-7.PubMedPubMed CentralView ArticleGoogle Scholar
- Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, Smith CD, Tupy JL, Whitfied EJ, Bayraktaroglu L, Berman BP, Bettencourt BR, Celniker SE, de Grey AD, Drysdale RA, Harris NL, Richter J, Russo S, Schroeder AJ, Shu SQ, Stapleton M, Yamada C, Ashburner M, Gelbart WM, Rubin GM, Lewis SE: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002, 3: RESEARCH0083-10.1186/gb-2002-3-12-research0083.PubMedPubMed CentralView ArticleGoogle Scholar
- Loveland J: VEGA, the genome browser with a difference. Brief Bioinform. 2005, 6: 189-193. 10.1093/bib/6.2.189.PubMedView ArticleGoogle Scholar
- Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglir L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME: Apollo: a sequence annotation editor. Genome Biol. 2002, 3: RESEARCH0082-10.1186/gb-2002-3-12-research0082.PubMedPubMed CentralView ArticleGoogle Scholar
- Berriman M, Rutherford K: Viewing and annotating sequence data with Artemis. Brief Bioinform. 2003, 4: 124-132. 10.1093/bib/4.2.124.PubMedView ArticleGoogle Scholar
- Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003, 31: 5654-5666. 10.1093/nar/gkg770.PubMedPubMed CentralView ArticleGoogle Scholar
- Allen JE, Pertea M, Salzberg SL: Computational gene prediction using multiple sources of evidence. Genome Res. 2004, 14: 142-148. 10.1101/gr.1562804.PubMedPubMed CentralView ArticleGoogle Scholar
- Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005, 21: 3596-3603. 10.1093/bioinformatics/bti609.PubMedView ArticleGoogle Scholar
- Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM: Creating a honey bee consensus gene set. Genome Biol. 2007, 8: R13-10.1186/gb-2007-8-1-r13.PubMedPubMed CentralView ArticleGoogle Scholar
- Djebali S, Delaplace F, Crollius HR: Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA. Genome Biol. 2006, S7.1-S7.10. Suppl 1
- Pertea M: Gene finding in eukaryotes. PhD thesis. 2001, Johns Hopkins University, Baltimore, MA, USAGoogle Scholar
- Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu ZJ, Loftus B, Xi Z, Megy K, Grabherr M, Ren Q, Zdobnov EM, Lobo NF, Campbell KS, Brown SE, Bonaldo MF, Zhu J, Sinkins SP, Hogenkamp DG, Amedeo P, Arensburger P, Atkinson PW, Bidwell S, Biedler J, Birney E, Bruggner RV, Costas J, Coy MR, et al: Genome sequence of Aedes aegypti, a major arbovirus vector. Science. 2007, 316: 1718-1723. 10.1126/science.1138878.PubMedView ArticleGoogle Scholar
- Haas BJ, Berriman M, Hirai H, Cerqueira GG, Loverde PT, El-Sayed NM: Schistosoma mansoni genome: closing in on a final gene set. Exp Parasitol. 2007, 117: 225-228. 10.1016/j.exppara.2007.06.005.PubMedView ArticleGoogle Scholar
- EVidenceModeler (EVM). [http://evidencemodeler.sourceforge.net]
- Haas BJ, Volfovsky N, Town CD, Troukhan M, Alexandrov N, Feldmann KA, Flavell RB, White O, Salzberg SL: Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 2002, 3: RESEARCH0029-10.1186/gb-2002-3-6-research0029.PubMedPubMed CentralView ArticleGoogle Scholar
- Zavolan M, Kondo S, Schonbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T: Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 2003, 13: 1290-1300. 10.1101/gr.1017303.PubMedPubMed CentralView ArticleGoogle Scholar
- Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA: Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol. 2006, 60: 69-85. 10.1007/s11103-005-2564-9.PubMedView ArticleGoogle Scholar
- Takeda J, Suzuki Y, Nakao M, Barrero RA, Koyanagi KO, Jin L, Motono C, Hata H, Isogai T, Nagai K, Otsuki T, Kuryshev V, Shionyu M, Yura K, Go M, Thierry-Mieg J, Thierry-Mieg D, Wiemann S, Nomura N, Sugano S, Gojobori T, Imanishi T: Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56,419 completely sequenced and manually annotated full-length cDNAs. Nucleic Acids Res. 2006, 34: 3917-3928. 10.1093/nar/gkl507.PubMedPubMed CentralView ArticleGoogle Scholar
- ENCODE Project Consotrium: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004, 306: 636-640. 10.1126/science.1105136.View ArticleGoogle Scholar
- Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigo R: GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006, S4.1-S4.9. Suppl 1
- Reese MG, Guigo R: EGASP: introduction. Genome Biol. 2006, S1.1-S1.3. Suppl 1
- Gene Structure Annotation and Analysis Using PASA. [http://pasa.sourceforge.net]
- RepeatMasker Open-3.0. [http://www.repeatmasker.org]
- Ouyang S, Buell CR: The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 2004, D360-D363. 10.1093/nar/gkh099. 32 Database
- DFCI - Gene Indices. [http://compbio.dfci.harvard.edu/tgi/tgipage.html]
- Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR: Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics. 2006, 7: 327-10.1186/1471-2164-7-327.PubMedPubMed CentralView ArticleGoogle Scholar
- EGASP Project FTP Site. [ftp://genome.imim.es/pub/projects/gencode/data/egasp05/]
- The JIGSAW Home Page. [http://www.cbcb.umd.edu/software/jigsaw/]
- SourceForge.net: GLEAN. [http://sourceforge.net/projects/glean-gene]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.