How many genes in a genome?
© BioMed Central Ltd 2003
Published: 22 December 2003
Despite the current good level of annotation, the Drosophila genome still holds surprises. A recent study has added perhaps 2,000 genes to the predicted total, and raises a number of questions about how genome annotation data should be stored and presented.
As sequenced and assembled whole genomes first began to appear in earnest, there was much discussion about the number of genes in those genomes, usually accompanied by comments about the surprisingly low numbers of genes. Just how fuzzy those numbers are is not generally appreciated. The well-annotated Drosophila genome [1, 2] is a blessing for Drosophila scientists, who make choices every day on the basis of predicted genes - from picking the exons to sequence in the hunt for the genetic lesion in a favorite mutant, to designing elements for a microarray. As good as this annotation is, Hild et al.  show, in this issue of Genome Biology, that we still have no clear idea how many genes there are in Drosophila. This should be a little sobering, as the picture for most other sequenced genomes is even less clear.
The goal of annotation is to map features on the genome, initially focusing on developing models for genes that encode proteins. Good annotation requires an assembled sequence and a repository of the evidence for important genome features such as transcripts and sequence homologies to known genes. The annotation itself adds critical and explanatory notes to the genome. Thus, annotation is an executive decision about the relevancy, accuracy, and quality of the evidence, and by definition exposes the curator's point of view. The current Drosophila genome annotation (Release 3.1, housed at FlyBase ) is conservative. The Hild et al.  annotation is not.
Hild et al.  used a more loosely tuned gene-finding algorithm than previous annotations, and in total this generated around 22,000 gene models, including nearly all of the approximately 14,000 Release 3.1 genes. It follows that the price one must pay for exposing more of the genes is the generation of more false gene models, in a classical sensitivity/specificity tradeoff. In order to test the more loosely generated models systematically, Hild et al. amplified a genomic region corresponding to each model and used the amplicons as elements on an array to probe for expressed RNAs. They then asked how many of the predicted genes produce transcripts. Microarrays are not sufficiently sensitive to detect every real transcript, and detection of a signal is not always definitive, but detection is very strong evidence in support of RNA synthesis directed by the genome segment in question. Using this metric, around 75% of the predicted genes common to Release 3.1 and to the study by Hild et al., and around 50% of the predicted genes unique to Hild et al., are transcribed at some point in the Drosophila life cycle. Spot-checking by reverse-transcriptase-coupled PCR and in situ hybridization suggests that there are no systematic problems with the array results. Thus, these data strongly suggest that there are many transcribed regions of the genome that fall outside of the Release 3.1 predictions. The lower detection frequency in the Hild et al. unique set than in the set shared with Release 3.1 also indicates that there is more 'chaff as one loosens the gene calling.
While finding a transcript is good evidence for the presence of a gene, not all transcripts are from genes - depending on what you call a gene , the range of transcriptional noise, and a host of other debatable points. While deciding what qualifies as a gene is non-trivial, there are a number of ways to assay for functional importance. A particularly stringent phenotypic test involves asking whether a given transcript is required for cell viability. The amplicons used in the Hild et al. microarray form a core set of reagents for genome-wide assays for phenotypes by RNA interference (RNAi) at a newly opened screening center . RNAi is a powerful method for dramatically downregulating the steady-state levels of a given transcript . Systematic RNAi experiments show on tissue-culture cells that transcripts from about 3% of Release 3.1 predicted genes and approximately 1% of the transcripts from the Hild et al. predicted genes are required for Drosophila cell viability. Thus, there are genes required for the viability of tissue culture cells that evaded annotation in Release 3.1. Clearly, gene models with supporting evidence for transcription, regulated expression in space and time, and genetic function are worth annotating. On the basis of this extensive set of tests, Hild et al.  make some rough calculations and suggest that there are at least 2,000 new genes to add to the Drosophila total.
Finding genes without simultaneously collecting large amounts of useless information is hard. Are more genomes the solution to gene finding? The highly anticipated sequenced genomes of many related Drosophila species  will certainly be extremely important for informing the annotation of Drosophila melanogaster . Sequence similarities and the relative ease of determining sequence quality will make comparative genomics evidence strong. But, as is pointed out by Hild et al., it may not be a panacea: most of the novel predictions of Hild et al. do not show good sequence conservation between Drosophila melanogaster and other genomes, including those of insects. There are probably several reasons for this. Not all the genes in a genome evolve at the same rate or have the same sequence constraints. One can also imagine situations where the act of transcription carries the genetic function (to promote or block the access of transcription factors to DNA sites, for example). More genomes is not enough.
The biology of the organism drives the annotation of its genome. The work by Hild et al. on Drosophila and recent work on mammalian genomes clearly points out the value of experimental data in making the distinction between genes and chaff [3, 10, 11]. We should extend from Hild et al. and tackle the genome head-on. We should be using a Drosophila tiling-path resource (covering the whole genome with amplicons or oligonucleotides rather than sampling only the gene models) for mapping transcripts and for systematically covering the genome for function via RNAi experiments. We can also use tiling-path arrays to map the 'chromatin code' of DNA-associated proteins and the in vivo occupancy of transcription factors, via procedures such as chromatin immunoprecipitation, as well as to map the replication origins. This need for more data has been recognized by the NIH, which has launched a project called the Encyclopedia of DNA Elements (ENCODE ) for the human genome. The main idea behind ENCODE is to develop and validate new computational and experimental means for finding genes and other important features in the human genome. The tremendous effort that goes into sequencing genomes justifies similarly large-scale efforts to map features onto the sequenced genomes.
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al: The genome sequence of Drosophila melanogaster. Science. 2000, 287: 2185-2195. 10.1126/science.287.5461.2185.PubMedView ArticleGoogle Scholar
- Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002, 3: research0083.1-0083.22. 10.1186/gb-2002-3-12-research0083.View ArticleGoogle Scholar
- Hild M, Beckman B, Haas SA, Koch B, Solovyev V, Busold C, Fellenberg K, Boutros M, Vingron M, Sauer F, et al: An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome. Genome Biol. 2003, 5: R3-10.1186/gb-2003-5-1-r3.PubMedPubMed CentralView ArticleGoogle Scholar
- FlyBase, a database of the Drosophila genome. [http://flybase.bio.indiana.edu/]
- Snyder M, Gerstein M: Genomics. Defining genes in the genomics era. Science. 2003, 300: 258-260. 10.1126/science.1084354.PubMedView ArticleGoogle Scholar
- Drosophila RNAi Screening Center. [http://flyrnai.org/]
- Weitzman JB: RNAi and the shape of things to come. J Biol. 2003, 2: 23-PubMedPubMed CentralView ArticleGoogle Scholar
- NHGRI Genome Sequencing Proposals. [http://www.genome.gov/10002154]
- Bergman CM, Pfeiffer BD, Rincon-Limas DE, Hoskins RA, Gnirke A, Mungall CJ, Wang AM, Kronmiller B, Pacleb J, Park S, et al: Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 2002, 3: research0086.1-0086.20. 10.1186/gb-2002-3-12-research0086.View ArticleGoogle Scholar
- Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P, Gerstein M, et al: The transcriptional activity of human chromosome 22. Genes Dev. 2003, 17: 529-540. 10.1101/gad.1055203.PubMedPubMed CentralView ArticleGoogle Scholar
- Guigo R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C, et al: Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci USA. 2003, 100: 1140-1145. 10.1073/pnas.0337561100.PubMedPubMed CentralView ArticleGoogle Scholar
- The ENCODE Project: ENCyclopedia Of DNA Elements. [http://www.genome.gov/10005107]
- The Heidelberg Flyarray. [http://hdflyarray.zmbh.uni-heidelberg.de/]
- Third Party Annotation Sequence Database. [http://www.ncbi.nlm.nih.gov/Genbank/tpa.html]