From identification to validation to gene count

Amid, Clara; Frankish, Adam; Aken, Bronwen; Ezkurdia, Iakes; Kokocinsk, Felix; Gilbert, James; White, Simon; Carninci, Piero; Gingeras, Thomas; Guigo, Roderic; Searle, Steve; Tress, Michael L; Harrow, Jennifer; Hubbard, Tim

doi:10.1186/gb-2010-11-s1-o1

Volume 11 Supplement 1

Beyond the Genome: The true gene count, human evolution and disease genomics

Selected oral presentation
Published: 11 October 2010

From identification to validation to gene count

Clara Amid¹,
Adam Frankish¹,
HAVANA,
Bronwen Aken¹,
Iakes Ezkurdia²,
Felix Kokocinsk¹,
James Gilbert¹,
Simon White¹,
Piero Carninci³,
Thomas Gingeras⁴,
Roderic Guigo⁵,
Steve Searle¹,
Michael L Tress²,
Jennifer Harrow¹ &
…
Tim Hubbard¹

Genome Biology volume 11, Article number: O1 (2010) Cite this article

3228 Accesses
1 Citations
Metrics details

Background

The current GENCODE gene count of ~ 30,000, including 21,727 protein-coding and 8,483 RNA genes, is significantly lower than the 100,000 genes anticipated by early estimates. Accurate annotation of protein-coding and non-coding genes and pseudogenes is essential in calculating the true gene count and gaining insight into human evolution.

As part of the GENCODE Consortium, the HAVANA team produces high quality manual gene annotation, which forms the basis for the reference gene set being used by the ENCODE project and provides a rich annotation of alternative splice variants and assignment of functional potential. However, the protein-coding potential of some splice variants is uncertain and valid splice variants can remain unannotated if they are absent from current cDNA libraries. Recent technological developments in sequencing and mass spectrometry have created a vast amount of new transcript and protein data that facilitate the identification and validation of new and existing transcripts, while harboring their own limitations and problems.

Results

Historically, all gene models have been built based on support from mRNA, EST and protein evidence. The recent integration of RNA-seq data into our annotation pipeline has allowed us to identify new splice variants that were previously either unannotated or supported only by non-human transcript evidence. Owing to their short read length, however, mapping them to the genome is problematic as is their use in recapitulating full-length transcript models. In order to assess different computational methods to map, assemble and quantify human RNA-seq data and improve this pipeline, we have been involved in the RNA-seq Genome Annotation Assessment project (RGASP), which seeks to address these questions.

We will also present the use of CAGE and ditag data produced by the ENCODE transcriptome group to identify and verify the use of alternative transcription start and termination sites and describe their impact on the interpretation of coding potential. Finally, we will show how mass spectrometry data can validate annotated gene models, identify novel splice variants and lead us to change our interpretation of the functional potential of a locus or variant.

Conclusions

We believe that an understanding of complete gene sets (i.e. the total gene number and the number of alternative splice variants allied to accurate functional interpretation) is crucial for understanding the genome. We demonstrate the value of the integration of new data types into our annotation pipeline in helping to identify and validate loci and variants to reach this aim.

Author information

Authors and Affiliations

The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
Clara Amid, Adam Frankish, Bronwen Aken, Felix Kokocinsk, James Gilbert, Simon White, Steve Searle, Jennifer Harrow & Tim Hubbard
Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
Iakes Ezkurdia & Michael L Tress
Omics Science Center, RIKEN Yokohama Institute, Kanagawa, Japan
Piero Carninci
Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York, 1 1 724, USA
Thomas Gingeras
Center for Genomic Regulation, Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
Roderic Guigo

Authors

Clara Amid
View author publications
You can also search for this author in PubMed Google Scholar
Adam Frankish
View author publications
You can also search for this author in PubMed Google Scholar
Bronwen Aken
View author publications
You can also search for this author in PubMed Google Scholar
Iakes Ezkurdia
View author publications
You can also search for this author in PubMed Google Scholar
Felix Kokocinsk
View author publications
You can also search for this author in PubMed Google Scholar
James Gilbert
View author publications
You can also search for this author in PubMed Google Scholar
Simon White
View author publications
You can also search for this author in PubMed Google Scholar
Piero Carninci
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Gingeras
View author publications
You can also search for this author in PubMed Google Scholar
Roderic Guigo
View author publications
You can also search for this author in PubMed Google Scholar
Steve Searle
View author publications
You can also search for this author in PubMed Google Scholar
Michael L Tress
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Harrow
View author publications
You can also search for this author in PubMed Google Scholar
Tim Hubbard
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

HAVANA

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amid, C., Frankish, A., HAVANA. et al. From identification to validation to gene count. Genome Biol 11 (Suppl 1), O1 (2010). https://doi.org/10.1186/gb-2010-11-s1-o1

Download citation

Published: 11 October 2010
DOI: https://doi.org/10.1186/gb-2010-11-s1-o1

Beyond the Genome: The true gene count, human evolution and disease genomics

From identification to validation to gene count

Background

Results

Conclusions

Author information

Authors and Affiliations

Consortia

HAVANA

Rights and permissions

About this article

Cite this article

Keywords

Genome Biology

Contact us

Beyond the Genome: The true gene count, human evolution and disease genomics

From identification to validation to gene count

Background

Results

Conclusions

Author information

Authors and Affiliations

Consortia

HAVANA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us