Finding genes in bacteria is relatively easy, in large part because bacterial genomes are approximately 90% protein-coding, with relatively short intergenic stretches in between every pair of genes. The gene-finding problem is mostly about deciding which of the six possible reading frames (three in each direction) contains the protein, and computational gene finders take advantage of this to produce highly accurate results. Thus, although we still don’t know the functions of many bacterial genes, at least we can be confident that we have their amino acid sequences correct.
In eukaryotes, by contrast, the gene-finding problem is far more difficult, because (i) genes are few and far between, and (ii) genes are interrupted by introns. Thus, while 90% of a typical bacterial genome is covered by protein-coding sequences, only about 1.3% of the human genome (40.2 Mb in the CHESS 2.2 database [2]) comprises protein-coding exons. The percentage is even lower in larger genomes, such as the mega-genomes of pine trees and other conifers. For this reason and others, the best automated gene finders are far less accurate on eukaryotes. Manual curation will not solve this quandary, for the obvious reason that it does not scale, and the less-obvious reason that even careful human analysis does not always provide a clear answer. To illustrate the latter point: in a recent comparison of all the protein-coding and lncRNA transcripts in the RefSeq and Gencode human gene databases, only 27.5% of the Gencode transcripts had exactly the same introns as the corresponding RefSeq genes [2]. Thus, even after 18 years of effort, the precise exon–intron structure of many human protein-coding genes is not settled. The annotation of most other eukaryotes—with the exception of small, intensively studied model organisms like yeast, fruit fly and Arabidopsis—is in worse shape than human annotation.
One high-throughput solution provides at least a partial solution to this problem: RNA sequencing (RNA-seq). Prior to the invention of RNA-seq, scientists worked hard to generate full-length transcripts that could provide a “gold standard” annotation for a species. The idea was that if we had the full-length messenger RNA sequence for a gene, we could simply align it to the genome to reveal the gene’s exon–intron structure. The Mammalian Gene Collection, an effort to obtain these RNAs for humans and a few other species, concluded in 2009 with the announcement that 92% of human protein-coding genes had been captured [3]. That project, though extremely useful, was very expensive, not easily scalable, and still not comprehensive. (Notably, the Mammalian Gene Collection only attempted to capture a single isoform of each gene. We now know that most human genes have multiple isoforms.) RNA-seq technology, in contrast, provides a rapid way to capture most of the expressed genes for any species. By aligning RNA-seq reads to a genome and then assembling those reads, we can construct a reasonably good approximation (including alternative isoforms) of the complete gene content of a species, as my colleagues and I have done for the human genome [2].
Thus, a modern annotation pipeline such as MAKER [4] can use RNA-seq data, combined with alignments to databases of known proteins and other inputs, to do a passably good job of finding all genes and even assigning names to many of them.
This solution comes with several major caveats. First, RNA-seq does not precisely capture all of the genes in a genome. Some genes are expressed at low levels or in only a few tissues, and they might be missed entirely unless the RNA sequencing data are truly comprehensive. In addition, many of the transcripts expressed in a tissue sample are not genes: they might represent incompletely spliced transcripts, or they might simply be noise. Therefore, we need independent verification before we can be certain that any expressed region is a functional gene. Even for genes that are repeatedly expressed at high levels, determining whether they encode proteins or instead represent noncoding RNAs is a still-unsolved problem. The current Gencode human annotation (version 30), for example, contains more RNA genes than proteins [5], but no one knows what most of those RNA genes do.
Another caveat is that because draft genomes may contain thousands of disconnected contigs, many genes will be broken up among several contigs (or scaffolds) whose order and orientation are unknown. The problem occurs in all species, but it is much worse for draft genomes where the average contig size is smaller than the span of a typical gene. This makes it virtually impossible for annotation software to put genes together correctly; instead, the software will tend to annotate many gene fragments (residing on different contigs) with the same descriptions, and the total gene count might be vastly overinflated. Even where they don’t have gaps, some draft genomes have high error rates that may introduce erroneous stop codons or frame shifts in the middle of genes. There is no way that annotation software can easily fix these problems: the only solution is to improve the assemblies and re-annotate.