Volume 11 Supplement 1

Beyond the Genome: The true gene count, human evolution and disease genomics

Open Access

From identification to validation to gene count

  • Clara Amid1,
  • Adam Frankish1,
  • HAVANA1,
  • Bronwen Aken1,
  • Iakes Ezkurdia2,
  • Felix Kokocinsk1,
  • James Gilbert1,
  • Simon White1,
  • Piero Carninci3,
  • Thomas Gingeras4,
  • Roderic Guigo5,
  • Steve Searle1,
  • Michael L Tress2,
  • Jennifer Harrow1 and
  • Tim Hubbard1
Genome Biology201011(Suppl 1):O1

https://doi.org/10.1186/gb-2010-11-s1-o1

Published: 11 October 2010

Background

The current GENCODE gene count of ~ 30,000, including 21,727 protein-coding and 8,483 RNA genes, is significantly lower than the 100,000 genes anticipated by early estimates. Accurate annotation of protein-coding and non-coding genes and pseudogenes is essential in calculating the true gene count and gaining insight into human evolution.

As part of the GENCODE Consortium, the HAVANA team produces high quality manual gene annotation, which forms the basis for the reference gene set being used by the ENCODE project and provides a rich annotation of alternative splice variants and assignment of functional potential. However, the protein-coding potential of some splice variants is uncertain and valid splice variants can remain unannotated if they are absent from current cDNA libraries. Recent technological developments in sequencing and mass spectrometry have created a vast amount of new transcript and protein data that facilitate the identification and validation of new and existing transcripts, while harboring their own limitations and problems.

Results

Historically, all gene models have been built based on support from mRNA, EST and protein evidence. The recent integration of RNA-seq data into our annotation pipeline has allowed us to identify new splice variants that were previously either unannotated or supported only by non-human transcript evidence. Owing to their short read length, however, mapping them to the genome is problematic as is their use in recapitulating full-length transcript models. In order to assess different computational methods to map, assemble and quantify human RNA-seq data and improve this pipeline, we have been involved in the RNA-seq Genome Annotation Assessment project (RGASP), which seeks to address these questions.

We will also present the use of CAGE and ditag data produced by the ENCODE transcriptome group to identify and verify the use of alternative transcription start and termination sites and describe their impact on the interpretation of coding potential. Finally, we will show how mass spectrometry data can validate annotated gene models, identify novel splice variants and lead us to change our interpretation of the functional potential of a locus or variant.

Conclusions

We believe that an understanding of complete gene sets (i.e. the total gene number and the number of alternative splice variants allied to accurate functional interpretation) is crucial for understanding the genome. We demonstrate the value of the integration of new data types into our annotation pipeline in helping to identify and validate loci and variants to reach this aim.

Authors’ Affiliations

(1)
The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus
(2)
Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO)
(3)
Omics Science Center, RIKEN Yokohama Institute
(4)
Cold Spring Harbor Laboratory
(5)
Center for Genomic Regulation, Universitat Pompeu Fabra

Copyright

© Amid et al; licensee BioMed Central Ltd. 2010

This article is published under license to BioMed Central Ltd.

Advertisement