Between a chicken and a grape: estimating the number of human genes
© BioMed Central Ltd 2010
Published: 5 May 2010
Skip to main content
© BioMed Central Ltd 2010
Published: 5 May 2010
Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
Ever since the discovery of the genetic code, scientists have been trying to catalog all the genes in the human genome. Over the years, the best estimate of the number of human genes has grown steadily smaller, but we still do not have an accurate count. Here we review the history of efforts to establish the human gene count and present the current best estimates.
The first attempt to estimate the number of genes in the human genome appeared more than 45 years ago, while the genetic code was still being deciphered. Friedrich Vogel published his 'preliminary estimate' in 1964 , based on the number of amino acids in the alpha- and beta-chains of hemoglobin (141 and 146, respectively). Knowing that three nucleotides corresponded to each amino acid, he extrapolated to compute the molecular weight of the DNA comprising these genes. He then made several assumptions in order to produce his estimate: that these proteins were typical in size (they are actually smaller than average); that nucleotide sequences were uninterrupted on the chromosomes (introns were discovered more than 10 years later [2, 3]); and that the entire genome was protein coding. All these assumptions were reasonable at the time, but later discoveries would reveal that none of them was correct. Vogel then used the molecular weight of the human haploid chromosomes to correctly calculate the genome size as 3 × 109 nucleotides, and dividing that by the size of a 'typical' gene, came up with an estimate of 6.7 million genes.
Many people, including many geneticists, expected that we would have a definitive gene count when the human genome was finally completed, and indeed one of the main surprises upon the initial publication of the human genome in February 2001 [5, 6] was that the number had again dropped, quite precipitously. However, as we shall see, the publication of the human genome did not come anywhere close to producing a precise gene list or even a gene count, and in the years since the number has continued to fluctuate. As a result, even today's best estimates still have a large amount of uncertainty associated with them.
In order to count genes, we need to define what we mean by a 'gene', a term whose meaning has changed dramatically over the past century. For our discussion, we will restrict the definition of gene to a region of the genome that is transcribed into messenger RNA and translated into one or more proteins. When multiple proteins are translated from the same region due to alternative mRNA splicing, we will consider this collection of alternative isoforms to be a single gene. In this respect, our definition of a gene is equivalent to what may also be called a chromosomal locus. We will exclude non-protein-coding RNA genes (such as microRNAs (miRNAs) and small nuclear RNAs (snRNAs)), in part because of the even greater uncertainty surrounding their numbers. In recent years, as a result of the dramatic breakthroughs in our understanding of RNA interference  and miRNAs , the number and variety of known RNA genes has grown dramatically, and we expect that it will be many more years before we have a clear picture of how many of these non-coding genes exist in the human genome.
With the advent of automated DNA sequencing, it became possible to use sequencing methods to estimate the number of human genes more accurately. The most promising approach, which was used by many groups in the 1990s, was to capture mRNA transcripts in a cell by making use of the polyadenylated (poly(A)) 3' ends. Using poly(T) sequences as primers, researchers could use reverse transcription-polymerase chain reaction (RT-PCR) to capture and sequence large numbers of expressed genes in a cell. At a time when the human genome project was just getting under way, these expressed sequence tags (ESTs) represented a shortcut to capturing the protein-coding genes in the genome . In 1995, one of the first large-scale surveys of human genes  used this approach to construct 300 complementary DNA (cDNA) libraries from 37 distinct organs and tissues, and constructed 87,983 distinct sequences, many of them assembled from multiple overlapping ESTs. This result was consistent with the NIH/DOE estimate of 100,000 genes in the human genome .
These two estimates, 64,000 and 80,000, reduced the expected gene count somewhat, but even in 1994 there was little agreement on which number was closer to the truth . In a study that unified physical maps, genetic maps, and the sequence data available at the time, Schuler et al.  reported in 1996 that the genome held 50,000 to 100,000 genes, although their mapping effort only captured 16,000.
In 2000, shortly before the human genome was published, several additional estimates appeared: Roest et al.  estimated 28,000 to 34,000 genes using alignments to pufferfish, and two new EST-based estimates reported 35,000  and 57,000  genes. This set the stage for the human genome paper, which was soon to appear.
To better understand the source of this continuing uncertainty about the gene count, it is instructive to mention a few of the most significant advances in computational gene prediction. (For a more comprehensive review of gene structure prediction methods, the interested reader can consult several recent reviews [19–21].)
One of the oldest and most reliable ways to identify a gene in a newly sequenced genome is by locating a highly similar protein-coding sequence in another organism. Together with EST and cDNA alignments, gene finding by homology is the first step in all the major annotation pipelines. But even the most thorough EST sequencing projects fail to capture many exons and genes. The discovery of these genes is still dependent, at least in part, on de novo gene finders that only require information inherent in the DNA sequence itself.
Computational gene recognition began about 30 years ago, when it was observed that statistical analysis could detect differences between protein-coding and non-coding nucleotide sequences [22–24]. Early gene-prediction programs attempted to identify relatively few properties of genes, such as the signals around splice sites, and they made simplifying assumptions to make the problem more tractable . With the development of gene-finding systems designed to predict any number of complete gene structures transcribed from either strand of the genome, automated methods made a significant step forward. The most successful framework for these systems was the generalized hidden Markov model (GHMM) approach. Thanks to their modularity and to their capability to model variable-length features, GHMMs are well suited to modeling the statistical properties of genes. Genscan  was one of the first of these, in 1997, and it was also the first de novo gene predictor to reach 80% exon-level accuracy on a human benchmark set. Despite its performance on coding exons, Genscan's gene-level accuracy (the proportion of genes for which it correctly predicts every exon) on the human genome was only about 10%. One reason for the low gene-level accuracy is that typical human genes contain 5 to 10 exons, and even at 80% accuracy per exon, the likelihood of getting all the exons correct for any particular gene is low.
Although later gene finders would improve on Genscan's results, the next real leap in accuracy came with the development of comparative gene finders. Comparative gene finders use patterns of conservation between two related species, such as human and mouse, to predict the location and structure of protein-coding genes. They can also use the GHMM framework. The biggest effect of using two genomes at once was to reduce the number of false-positive predictions: using human-mouse alignments, Twinscan , a dual-genome gene finder, predicted 25,600 human genes versus 45,000 predicted by Genscan .
Until 2007, GHMMs were the dominant framework for de novo gene finders, but this changed when conditional random fields (CRFs), a new class of discriminative models, were introduced as a means of using more than two genomes simultaneously. Unlike GHMMs, which are trained by maximum likelihood to generate sequences statistically similar to actual DNA sequences, CRFs are trained to discriminate between genomic elements of interest in order to maximize annotation accuracy. In addition, they are capable of utilizing external evidence and submodels that are not inherently probabilistic . Through the use of 11 informant genomes, CONTRAST  predicted the exact exon-intron structure of 59% of known human protein-coding genes, compared to 25 to 35% from the best previous methods. This is a very strict measure of accuracy: if even one splice site from a multi-exon gene is incorrect, the entire gene is considered to be wrong. But also note that all de novo methods have a significant false-positive rate, predicting many exons (and genes) that do not appear to be genuine. Pseudogenes are one source of false predictions, although the precise reasons for high false positive rates have never been fully determined.
Despite a steady increase in accuracy over the years, de novo gene predictors are still not accurate enough to rely on for the definitive human gene list. Much greater gains in accuracy have been made through advances at the level of integrative evidence-based methods, such as those employed by JIGSAW . By effectively combining multiple forms of evidence generated from a diverse set of sources, including gene finders, protein sequence alignments, EST and cDNA alignments, and splice-site predictions, JIGSAW's predictions are exactly correct for approximately 75% and partially correct for 97% of human genes . Similar integrated methods are used to generate the gene lists at Ensembl  and the National Center for Biotechnological Information (NCBI), which uses the Gnomon system .
The release of the draft human genome sequence in 2001 revealed a much lower human gene count than expected [6, 34]. The paper published by the public consortium estimated 30,000 to 40,000 protein-coding genes. This number was in rough agreement with the count in the private consortium's paper, which reported 26,588 protein-coding genes with 'strong' evidence, and an additional 12,000 computationally predicted genes with weaker evidence. Strong evidence included similarity to previously known proteins, homology to another mammal, and EST evidence. Weak genes were those with homology to mouse, but lack of other supporting evidence. After 3 years of detailed finishing work, a much more complete draft genome was published in 2004 , and along with this more complete sequence, the public consortium announced a new, much lower, estimate of human protein-coding genes, only 20,000 to 25,000. This low number - lower even than the model plant Arabidopsis thaliana - was surprising to scientists across a wide range of fields, who had expected that the number of genes to be a measure of organismal complexity. Furthermore, the imprecision of the estimate raised questions about the validity of many predicted genes .
Although the near-finished human genome sequence now covers 99% of the euchromatic (or gene-containing) genome at 99.999% accuracy, the exact number of human genes is still unknown. The two leading repositories of genome annotation, relied on by most researchers looking for genes, are the databases at Ensembl and NCBI. At present, Ensembl lists 22,619 human protein-coding genes, which is 286 higher than the 22,333 protein-coding genes in NCBI's RefSeq database . This Ensembl total excludes 1,002 genes mapped onto alternative MHC regions in chromosome 6. The gene count from NCBI includes all protein-coding genes in RefSeq that either have been manually curated or that have supporting cDNA evidence, and that map onto the current human reference assembly (GRCh37). Another popular resource, the University of California at Santa Cruz (UCSC) genome browser , lists 21,814 'known' protein-coding genes . The 'known' genes list was created by mapping human RefSeq mRNA sequences to the genome.
In an effort to identify a core set of human genes that are universally agreed upon, the collaborative consensus coding sequence project (CCDS) tracks identical protein annotations that are consistently represented at NCBI, Ensembl, and the UCSC Genome Browser . As of January 2010, CCDS contained 18,173 human genes that are shared by all three browsers (counting alternative splice variants, where one gene is represented by two or more loci, it lists 23,739 protein-coding loci). Because CCDS takes an extremely conservative strategy, its gene list represents a lower bound on the total number of human genes. Indeed, in its original incarnation in 2005, it listed only 13,142 genes, and the total has steadily grown since then.
Currently, the average number of genes listed in the human gene catalogs appears to be somewhere around 22,500, with an uncertainty of around 2,000 genes. One recent report claims that this number is much too high: Clamp et al.  used a conservation-based method, relying on similarity to the mouse and dog genomes as well as other techniques, to reduce it to about 20,500 'valid' protein-coding genes. They discarded as invalid genes that appeared to be retroposons, pseudogenes, and other miscellaneous artifacts, as well as 'orphan' DNA sequences. These orphans have many features of protein-coding genes, but are not conserved in other mammalian genomes, including those of chimpanzees and macaques. Because there were a relatively large number of orphans compared with the otherwise very small gene differences between humans and chimps, Clamp et al. rejected as implausible the alternative hypothesis that the orphans are human-specific genes.
Recently, the Mammalian Gene Collection (MGC), a multi-year effort to produce full-length cDNA clones for all human genes, reported the completion of its work . This report describes 18,877 human protein-coding genes 'with curated RefSeq transcripts', of which MGC has produced clones for 17,421 (92%). The same report noted that recent efforts using comparative sequence data and computational gene finding, followed by confirmation with RT-PCR, had confirmed 563 distinct genes that were missing from the cDNA-based RefSeq and Vega collections  at the time. The MGC also excluded the transcripts of many single-exon genes and genes shorter than 100 amino acids, in order to avoid including pseudogenes, although their own report found that out of a set of 351 'likely' single-exon genes, 198 (57%) were confirmed via RT-PCR . Thus, although the 18,877 number is substantially lower than the total in Ensembl and RefSeq, at least some of the discrepancy is due to the conservative strategy used to identify protein-coding genes by the MGC.
Comparative genome analysis suggests that the numbers of protein-coding genes are not expected to differ very much from mammal to mammal . When new genes arise in a species, most such cases are the result of duplications of previously existing genes, followed by neofunctionalization . However, entirely novel genes must arise at some point, although the rate of gene 'birth' is not precisely known. Interestingly, a recent study provides the first evidence for the de novo origin of human protein-coding genes, which evolved from non-coding DNA after the divergence of humans and chimpanzees. In this study, Knowles and McLysaght  identified three entirely novel genes, all of which have strong mRNA expression evidence supporting transcription, and peptide matches from proteomics databases supporting translation. The orthologous DNA sequence exists in other primate genomes - chimp, macaque, gorilla, gibbon, and orangutan - but in the other primates, the DNA has disabling mutations that disrupt the reading frame. By extrapolating their findings to the whole human genome, the authors estimate that 18 genes are likely to have arisen de novo in humans since our divergence from chimps.
In addition to the ongoing uncertainty about the precise number of protein-coding genes, recent evidence has emerged that makes it clear that different humans have slightly different individual gene sets. A major source of such differences is variation in the number of segmental duplications scattered across the genome. Sebat et al.  looked at 20 individuals for copy-number polymorphisms, and found 70 different genes included in regions with variable copy numbers. Iafrate et al.  found more than 100 gene-containing regions that varied in copy number among individuals. Most recently, Alkan et al.  estimated, on the basis of three sequenced human genomes, that gene counts vary by 73 to 87 genes between any two individuals.
In another recent finding, Li et al.  sequenced and assembled two human genomes, one from Africa and one from Asia, and compared them with the reference human genome at NCBI. They identified around 5 Mb of novel sequence in each of the new genomes, and they estimate that the human 'pangenome', which would include all the DNA of every individual human, should have up to 40 Mb of sequence additional to the reference genome, including an unknown number of genes. This additional potential sequence is 1.3% of the genome, which suggests that the eventual gene count might grow by about that same amount.
We aligned all human genes from NCBI's RefSeq database to the Ensembl gene set in an attempt to explain the differences, but although the total counts differ by less than 300, there are several thousand genes in each set that do not map cleanly onto the other, many of them representing genes of unknown function. Our personal best guess for the total number of human genes is 22,333, which corresponds to the current gene total at NCBI. We prefer this to the slightly higher Ensembl gene count both because the NCBI annotation is slightly more conservative, and because recent compelling arguments support an even lower gene total [41, 42]. This number could easily shrink or grow by 1,000 genes in the near future. However, recent analyses make it clear that even if we agree on a complete list of human genes, any particular individual might be missing some of the genes in that list. The genome sequence is complete enough now (although it is not yet finished) that few new genes are likely to be discovered in the gaps, but it seems likely that more genes remain to be discovered by sequencing more individuals. Additional discoveries are likely to make our best estimates for this basic fact about the human genome continue to move up and down for many years to come.
We thank Carl Kingsford for helpful comments and suggestions on the manuscript. MP and SLS were supported in part by grants R01-LM006845 and R01-GM083873 from the US National Institutes of Health.