Identifying protein-coding genes in genomic sequences
© BioMed Central Ltd 2009
Published: 30 January 2009
Skip to main content
© BioMed Central Ltd 2009
Published: 30 January 2009
The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.
The genome sequence is an organism's blueprint: the set of instructions dictating its biological traits. The unfolding of these instructions is initiated by the transcription of the DNA into RNA sequences. According to the standard model, the majority of RNA sequences originate from protein-coding genes; that is, they are processed into messenger RNAs (mRNAs) which, after their export to the cytosol, are translated into proteins. While the importance of noncoding RNAs has come to the fore over the past ten years [1–5], proteins are still assumed to be the main functional and structural players in the cell. The delineation of the complete set of protein-coding genes and their alternative splice forms is, therefore, essential to the task of translating the information in the sequence of the genome into biologically relevant knowledge. This is not a trivial task, as illustrated by the fact that many years after the first drafts of the human genome sequence became available [6–8], uncertainty remains regarding the exact number of protein-coding genes , a number that might actually vary between individuals - and even between cells within the same individual - as extensive structural variation has been reported in the human genome [10–12].
Even the concept of a 'gene' is under revision. Genes have long been regarded as discrete entities located linearly along chromosomes, but recent investigations have demonstrated extensive transcriptional overlap between different genes. Specifically, genomic regions from otherwise distinct and apparently well characterized protein-coding loci (which may be very far apart in linear genomic space) often appear to combine to produce transcripts with the potential for encoding novel protein species [13, 14].
Another genome browser supplying sequence and annotation data for a large number of genomes is the University of California, Santa Cruz (UCSC) genome browser database . In April 2007, UCSC released an improved version of their 'Known Gene Set' for the human genome and included putative noncoding RNAs as well as protein-coding genes. Each entry in this set requires the support of a GenBank entry and at least one other line of evidence, except for curated cDNAs, which require no other evidence.
Manual annotation still plays a significant part in annotating high-quality finished genomes. Currently, the National Center for Biotechnology Information (NCBI) reference sequences (RefSeq) collection provides a highly (manually) curated resource of multi-species transcripts, including plant, viral, vertebrate and invertebrate sequences [21, 22]. These are, as their name indicates, transcript-oriented and usually rely on full-length cDNAs for reliable curation, although the dataset also contains predictions using expressed sequence tags (ESTs) and partial cDNAs aligned against genomic sequence using the Gnomon prediction program . Manually reviewed RefSeq nucleotide sequences begin with the reference NM identifier whereas unreviewed predictions have the XM identifier. When a new genome is initially sequenced, researchers usually use the RefSeq data set to identify genes that are missing or identify genomic rearrangements within genes, as RefSeq is used internationally as a standard for genome annotation . RefSeq is a very reliable, but also conservative, gene reference set. Other reference sets usually include RefSeq, but extend it substantially. For instance, the UCSC 'Known Genes' has 10% more protein-coding genes, approximately five times as many putative coding genes and twice as many splice variants as RefSeq.
A different approach to manual gene annotation is to annotate transcripts aligned to the genome and take the genomic sequences as the reference rather than the cDNAs. This is how the HAVANA group at the Wellcome Trust Sanger Institute produces its annotation on vertebrate sequence. Currently, only three vertebrate genomes - human, mouse and zebrafish - are being fully finished and sequenced to a quality that merits manual annotation . The finished genomic sequence is analyzed using a modified Ensembl pipeline , and BLAST results of cDNAs/ESTs and proteins, along with various ab initio predictions, can be analyzed manually in the annotation browser tool Otterlace. The advantage of genomic annotation compared with cDNA annotation is that more alternative spliced variants can be predicted, as partial EST evidence and protein evidence can be used, whereas cDNA annotation is limited to availability of full-length transcripts. Moreover, genomic annotation produces a more comprehensive analysis of pseudogenes. One disadvantage, however, is that if a polymorphism occurs in the reference sequence, a coding transcript cannot be annotated, whereas cDNA annotation can select the major haplotypic form and is, therefore, not limited by a reference sequence.
In 2006, the groups mentioned above (NCBI (RefSeq), UCSC, the Wellcome Trust Sanger Institute (HAVANA) and Ensembl) identified a need to collaborate and produce a consensus gene set for the human reference genome as there was still no official agreement between the different databases on the human protein-coding genes. Referred to as the Consensus Coding Sequence Set (CCDS) , it currently contains only those coding transcripts that are equivalent in each database's gene build from start codon to stop codon. The latest human CCDS release (May 2008) contains 20,151 consensus coding sequences representing 17,052 genes. For the first time, this provides researchers with a consistent reliable gene set that has been derived independently from a combination of manual and automated annotation by three groups (Ensembl, NCBI and HAVANA) and quality checked at the UCSC. The protein-coding genes that differ between the gene sets of the different groups and cannot be merged automatically will be re-examined manually and either rejected or added to the consensus set if they get a unanimous vote from the groups at NCBI, UCSC and HAVANA.
Complementary to the CCDS project is the GENCODE project . The GENCODE consortium  was initially formed to identify and map all protein-coding genes within the regions selected in the framework of the ENCODE project [29, 30], representing 1% of human genome sequence. This was achieved by a combination of initial manual annotation by HAVANA, computational predictions and experimental validation, and the consequent refinement of the annotation on the basis of these experimental results. The project has been funded in 2008 to annotate the whole reference human genome sequence and experimentally verify a number of putative loci. The scaled-up annotation includes identification of pseudogenes and noncoding loci supported by transcript evidence. The initial manual annotation is compared with automated predictions to highlight inconsistencies based on comparative analysis or new transcript data. It is expected that, upon completion in 2011, this gene set will become the standard human gene reference set.
The issue obviously arises of the reliability of the reference sets. Usually, the experimentally verified manual annotations, such as those produced by GENCODE, are considered the most exhaustive and reliable reference human gene sets. Based on 'bona fide' cDNA sequences, the annotated gene models are, in these cases, generally correct - although issues still remain because, on occasion, the same cDNA sequence can be mapped into the human genome through alternative exonic structures. Completeness is more difficult to assess, because it is unclear how representative of the complete human transcriptome the current set of cDNA sequences is.
To assess the completeness of GENCODE, the EGASP community experiment was organized in 2005 . In this experiment a number of computational predictions were evaluated against the GENCODE annotation. Then, a subset of high-confidence computational predictions that were not present in the annotation was tested by reverse transcription-polymerase chain reaction (RT-PCR) on a panel of human tissues. Only a handful of these predictions could be verified, strongly suggesting completeness of the GENCODE annotation (with respect to computational predictions of protein-coding genes). A second goal of EGASP was to assess to what extent purely computational methods can reproduce the slow and expensive manual annotations. In this regard, EGASP results indicated that although computational methods are quite accurate in identifying protein-coding exons with an overall accuracy of more than 80% (in terms of both the fraction of real exons correctly identified and the fraction of predicted exons that are real), finding the complete transcript structure is more challenging, with the most accurate methods correctly predicting only about 60% of the annotated protein-coding transcripts. This indicates that computational methods cannot yet totally replace human expertise in gene annotation.
After mapping a cDNA to the genome, the protein-coding status of the transcript needs to be assessed, and the boundaries of the eventual coding regions precisely delimited - so that it is possible to identify the correct amino acid sequence of the protein, from which most of the biology of the transcript will be inferred. As direct evidence of the existence of the protein is generally absent, the criterion often used to annotate a transcript as protein-coding is the existence of an open reading frame (ORF). However, this criterion has recently been put in question by a number of methods developed to assess the quality of protein-coding gene annotations. These are based on the principle that gene models that conflict with our current knowledge about functional protein-coding genes are incorrect. Thus, the rationale of the method of Clamp et al.  is that functional protein-coding genes are subject to purifying selection, and are therefore expected to show evolutionary conservation. The authors used two types of measures for the assessment of evolutionary conservation of predicted human genes: reading frame conservation (RFC; based on the observation that indels do not affect significantly the size of functional proteins) and codon substitution frequency (CSF; based on the observation that the patterns of nucleotide substitution in functional protein-coding genes is different from that observed on random DNA). In their analysis of a number of human gene reference sets, Clamp et al.  identified around 1,200 human 'orphans': ORFs that lack homology with known genes. Both RFC and CSF analysis revealed that the behavior of many of these human orphans is essentially indistinguishable from that of matched random controls, and is very different from that of non-orphan protein-coding genes. From these results, the authors concluded that, overall, about 15% of the entries in the gene catalogs investigated are not valid protein-coding genes.
While the quality-control method of Clamp et al.  can distinguish protein-coding genes from non-coding sequences, it is less suitable for identifying gene predictions that are only partially correct. Indeed, if an annotated gene misses one or more exons, or a fraction of one exon, it may still display the expected evolutionary characteristics of protein-coding genes. To find errors in the annotated protein-coding genes, the MisPred approach [32–35] uses several criteria that hold for different subsets of correctly folded, correctly localized, functionally competent protein molecules. Hypothetical proteins that violate any of these rules are judged to be nonfunctional and the corresponding coding regions to be misidentified. For example, one of the quality-control tools of this approach is based on the observation that the number of residues in closely related members of a globular protein domain family usually falls within a relatively narrow range. Accordingly, proteins containing domains that consist of significantly larger or smaller numbers of residues than closely related members of the same family may be suspected to be nonviable and the corresponding genes to be mispredicted. Several quality-control tools in MisPred address the issue of whether the predicted protein is able to reach the cellular compartment where it could be properly folded, stable and functional. The rationale of these tools is that mislocalized proteins are usually misfolded, unstable and nonfunctional. For example, predicted proteins that contain extracellular domains but lack sequence signals that could direct these domains to the extracellular space are likely to be misfolded and nonfunctional. Analyses of predicted human sequences with MisPred tools revealed that 2.3% of Ensembl entries (v41) and 3.4% of proteins predicted by the NCBI's Gnomon pipeline are likely to be mislocalized and/or misfolded as they lack appropriate sequence signals or they contain domains that deviate from normal size .
In a similar spirit, the EPipe pipeline  of the BioSapiens consortium incorporates a variety of tools to assess the structural and functional properties of hypothetical proteins. Analysis of the GENCODE peptides with these tools revealed that many of the potential alternative gene products have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternatively spliced forms, there is little evidence that they have a role as functional proteins, and many splice variants encode abnormal proteins that are mislocalized and/or misfolded .
Alternative splicing is common in mammalian genomes, and it has been suggested to be a means of increasing protein complexity from a limited number of genes. Therefore, any complete gene set should include annotation of the protein-coding variants. Detailed cDNA mapping into the genome, as in the GENCODE annotation, reveals that alternative splicing is widespread, affecting more than 86% of multi-exon gene loci  with an average of 5.7 transcript variants per locus. While this is a proportion of alternative-splicing events much larger than that in other human reference gene sets, the use of novel high-throughput methods that concatenate and sequence the 5' tags of transcripts (cap analysis gene expression(CAGE) ) or sequence paired 5' and 3' cDNA ends (5' paired-end ditags (5' PETs) ) has revealed that traditional methods based on cDNA clone sequencing were not fully surveying the complexity of mammalian transcriptomes. Similarly, the (re)analysis of gene-trapping sequences has unveiled thousands of novel transcripts . Tiling oligonucleotide arrays that monitor the expression of the non-repeated fraction of the genome have consistently identified many more transcribed fragments than previously anticipated [38, 39]. The combination of all these experimental approaches in the frame of the ENCODE project  surprisingly showed that more than 90% of the genome is transcribed as primary RNA , with at least 15% being incorporated into processed transcripts. Many such novel transcripts map to protein-coding loci, as revealed by experiments in which RACE (rapid amplification of 3' ends) products originating in these loci were hybridized onto tiling arrays. When applied to the ENCODE regions, these experiments yielded as many novel as annotated exons . Often these exons corresponded to tissue-specific 5' distal transcription start sites (TSSs) . These distal TSSs map hundreds of kilobases upstream of the currently annotated TSS and often overlap with a 5'-positioned gene, suggesting extensive overlap between protein-coding loci (Additional data file 2). Next-generation sequencing will further enhance our capacity to sequence the transcriptome of the cell (RNAseq). Indeed, preliminary results demonstrate that RNAseq can detect 25% more genes than microarrays can and that a third of the sequences emanate from unannotated regions [40–45].
Interestingly, only a small fraction of these novel transcripts seem to have protein-coding capacity - often through transcription-induced chimeras that fuse two different ORFs that may be encoded by genes far apart in the genome [13, 46, 47]. Instead, the majority correspond to 'novel' noncoding RNA classes, such as transcribed pseudogenes [48–50], antisense transcripts [51–53] and structured RNAs [54, 55], that might regulate transcription and/or translation. For example, Watanabe et al.  recently described precursor transcripts of small interfering RNAs (siRNAs) that are derived from transcribed pseudogenes. Other yet-unannotated RNAs appear to be processed into short RNAs, some of which, like the 'promoter-associated sRNAs' (PASRs) and 'termini-associated sRNAs' (TASRs), are coupled to the expression state of protein-coding genes [2, 57]. Finally, it was postulated that some of these novel transcripts might be the outcome of interchromosomal transcript chimerism: that is, chimeric transcripts resulting from the proximity of active genes in so-called transcription factories .
In summary, recent technological developments and large-scale whole-genome analyses have shown that mammalian transcriptomes are composed of a swarming mass of different overlapping transcripts, sometimes originating from both strands, and suggest that only a small fraction of the transcriptional complexity has been discovered. Little evidence exists, however, that the majority of this transcript complexity leads to protein complexity. In fact, all evidence suggests otherwise - that the human protein-coding gene set is near consolidation. Thus, the 5.7 average transcripts per coding locus annotated in GENCODE translates to only 1.7 proteins per locus (because a large fraction of transcript variation corresponds to noncoding transcripts or accumulates in the untranslated regions of coding transcripts) . Moreover, if the GENCODE proteins flagged as problematic by the protein-assessment methods discussed above are ignored, there are barely 1.3 annotated proteins per locus - a somewhat unexpected return to one of the founding axioms of molecular biology: Beadle and Tatum's 'one gene one protein' principle. The discrepancy between a complex, variable and largely unexplored population of RNA molecules and a relatively small, stable, and well defined population of proteins constitutes one of the challenges that molecular biology needs to address to fully elucidate cellular function.
Additional data file 1 contains a table listing software used for gene prediction and annotation. The programs are categorized according to the sources of information utilized and each listing includes a literature reference and URL where the software may be obtained. This list is meant to be representative rather than comprehensive. Additional data file 2 contains a figure showing novel transcripts discovered through a combination of directed RACE and hybridization onto tiling arrays.
This work was carried out as part of the BioSapiens project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2003-503265. AR, SA and RG also acknowledge support from grants U01HG003150 and U01HG003147 from the National Human Genome Research Institute, NIH. RG acknowledges support from grant BIO2006-03380 from the Spanish Ministry of Education and Science. AR and SA acknowledge support from the EU AnEUploidy project, and the NCCR Frontiers in Genetics. LP thanks the Hungarian National Office for Research and Technology for partial support under grant no. RET14/2005. JA's work is supported by the Wellcome Trust.