A standard variation file format for human genome sequences
© Reese et al; licensee BioMed Central Ltd. 2010
Received: 29 April 2010
Accepted: 26 August 2010
Published: 26 August 2010
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
With the advent of personalized genomics we have seen the first examples of fully sequenced individuals [1–9]. Now, next generation sequencing technologies promise to radically increase the number of human sequences in the public domain. These data will come not just from large sequencing centers, but also from individual laboratories. For reasons of resource economy, 'variant files' rather than raw sequence reads or assembled genomes are rapidly emerging as the common currency for exchange and analysis of next generation whole genome re-sequencing data. Several data formats have emerged recently for sequencing reads (SRF) , read alignments (SAM/BAM) , genotype likelihoods/posterior SNP probabilities (GLF) , and variant calling (VCF) . However, the resulting variant files of single nucleotide variants (SNVs) and structural variants (SVs) are still distributed as non-standardized tabular text files, with each sequence provider producing its own idiomatic data files [1–9]. The lack of a standard format complicates comparisons of data from multiple sources and across projects and sequencing platforms, tremendously slowing the progress of comparative personal genome analysis. In response we have developed GVF, the Genome Variation Format.
GVF  is an extension of the widely used Generic Feature Format version 3 (GFF3) standard for describing genome annotation data. The GFF3 format  was developed to permit the exchange and comparison of gene annotations between different model organism databases . GFF3 is based on the General Feature Format (GFF), which was originally developed during the human genome project to compare human genome annotations . Importantly, GFF3, unlike GFF, is typed using an ontology. This means that the terminology being used to describe the data is standardized, and organized by pre-specified relationships. The attribute specification structure of GFF3 files allows extensibility in specifying feature-specific data for different types of features and it is this extensibility that GVF capitalizes on in defining sequence alteration specific data types. Annotation databases have historically developed different in-house schemas; thus, such standardization is required to ensure interoperability between databases and for comparative analyses.
While there are richer ways of representing genomic features using XML (Extensible Markup Language) and relational database schemas, simple text-based, tab-delimited files have persisted in bioinformatics because they balance human with computer readability. Since its adoption as the basic exchange format, two aspects of GFF3 have emerged as essential for success. First, it must be simple for software to produce and parse; second, its contents need to be typed using terms drawn from an ontology. The first aspect means that humans can easily read and edit files with a text editor and perform simple analyses with command-line software tools. The second aspect not only constrains different database curators to use the same terminologies, but also, because of the formal structure of the ontology, allows automated reasoning on the contents of such a file. It therefore prevents ambiguities and conflicting terminologies. GVF builds upon these strengths of GFF3, adopting GFF3's simple, tab-delimited format; and like GFF3, the contents of GVF files are described using the Sequence Ontology (SO) - an ontology developed by the Gene Ontology Consortium  to describe the parts of genomic annotations, and how these parts relate to each other [19, 20]. Using SO to type both the features and the consequences of a variation gives GVF files the flexibility necessary to capture a wide variety of variation data, while still maintaining unified semantics and a simple file format. For example, GVF files can contain both re-sequencing and DNA genotyping microarray experiment data. In addition, GVF capitalizes on the extensibility of GFF3 to specify a rich set of attributes specific to sequence alterations in a structured way. An added benefit of GVF's compliance with GFF3 is that existing parsers, visualization and validation software, such as those developed by the Generic Model Organism Database (GMOD) project to operate on GFF3 files can be used to manipulate and view GVF files. Thus, the GVF complements existing gene and variant nomenclature efforts , and provides a simple ontology-based sequence-centric genome file format linking variants to genome positions and genome annotations.
Below we describe the GVF standard and the various additions we have made to GFF3 and SO to support it. We also briefly describe the conversion of the first ten publicly available personal genomes into GVF format. These GVF files are available for download and for cloud computation. We will refer to these data as the 10Gen dataset. This is provided as a service to the biomedical community as a reference dataset for whole genome comparative analyses and software development. This dataset will hopefully foster the development of new tools for the analyses of personal genome sequences.
A summary of the tag-value pairs, and their requirement for GVF
While the GFF3 specification considers the ID tag to be optional, GVF requires it. As in GFF3 this ID must be unique within the file and is not required to have meaning outside of the file
ID = chr1:Soap:SNP:12345;
ID = rs10399749;
All sequences found in this individual (or group of individuals) at a variant location are given with the Variant_seq tag. If the sequence is longer than 50 nucleotides, the sequence may be abbreviated as '~'. In the case where the variant represents a deletion of sequence relative to the reference, the Variant_seq is given as '-'
Variant_seq = A,T;
The reference sequence corresponding to the start and end coordinates of this feature
Reference_seq = G;
The number of reads supporting each variant at this location
Variant_reads = 34, 23;
The total number of reads covering a variant
Total_reads = 57;
The genotype of this variant, either heterozygous, homozygous, or hemizygous
Genotype = heterozygous;
Real number between 0 and 1
A real number describing the frequency of the variant in a population. The details of the source of the frequency should be described in an attribute-method pragma as discussed above. The order of the values given must be in the same order that the corresponding sequences occur in the Variant_seq tag
Variant_freq = 0.05;
String: SO term sequence_variant
String: SO sequence_feature
String feature ID
The effect of a variant on sequence features that overlap it. It is a four part, space delimited tag, The sequence_variant describes the effect of the alteration on the sequence features that follow. Both are typed by SO. The 0-based index corresponds to the causative sequence in the Variant_seq tag. The feature ID lists the IDs of affected features. A variant may have more than one variant effect depending on the intersected features
Variant_effect = sequence_variant 0 mRNA NM_012345, NM_098765;
For regions on the variant genome that exist in multiple copies, this tag represents the copy number of the region as an integer value
Variant_copy_number = 7;
For regions on the reference genome that exist in multiple copies, this tag represents the copy number of the region as an integer in the form:
Reference_copy_number = 5;
A tag to capture the given nomenclature of the variant, as described by an authority such as the Human Genome Variation Society
Nomenclature = HGVS: p.Trp26Cys;
GVF: a specification for genome variant description
The pragmas defined by GVF, in addition to those already defined by GFF3 (gff-version, sequence-region, feature-ontology, attribute-ontology, source-ontology, species, genome-build)
This allows the specification of the version of a specific file. What exactly the version means is left undefined, but the tag is provided for the case when an individual's variants are described in GVF and then, at a later date, changes to the data or the software require an update to the file. An increment of the file-version could signify such a change. Any numeric version of file-version is allowed
The file-date pragma is included as a method to describe the date when the file was created. The ISO 8601 standard for dates in the form YYYY-MM-DD is required for the value
Dbxref, Gender, Population, Comment
This pragma provides details about the individual whose variants are described in the file
##individual-id Dbxref = Coriell:NA18507;Gender = male;Ethnicity = Yoruba; Comment = Yoruba from Ibadan
Seqid, Source, Type, Dbxref, Comment
This pragma provides details about the algorithms or methodologies used to generate data for a given source in the file. This is used, for example, to document how a particular type of variant was called. A typical use would be to provide a DBxref link to a journal article describing software used for calling the variant data with the given source tag
##source-method Seqid = chr1;Source = MAQ;Type = SNV;Dbxref = PMID:18714091;Comment = MAQ SNV calls;
Seqid, Source, Type, Attribute, Dbxref, Comment
This pragma provides details about algorithms or methodologies for a given attribute tag in the file. This is used to document how a particular type of attribute value (that is, Genotype, Variant_effect) was calculated
##attribute-method Source = SOLiD;Type = SNV;Attribute = Genotype;Comment = Genotype is reported here as determined in the original study
Seqid, Source, Type, Read_length, Read_type, Read_pair_span, Platform_class, Platform_name, Average_coverage. Comment, Dbxref
This pragma provides details about the technologies (that is, sequencing or DNA microarray) used to generate the primary data
##technology-platform Seqid = chr1;Source = AFFY_SNP_6;Type = SNV;Dbxref = URI:http://www.affymetrix.com; Platform_class = SNP_Array;Platform_name = Affymetrix Human SNP Array 6.0;
Seqid, Source, Type, Dbxref, Data_type, Comment.
This pragma provides details about the source data for the variants contained in this file. This could be links to the actual sequence reads in a trace archive, or links to a variant file in another format that have been converted to GVF
##data-source Source = MAQ;Type = SNV;Dbxref = SRA:SRA008175;Data_type = DNA sequence;Comment = NCBI Short Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra;
Ontology, Term, Comment
A description of the phenotype of the individual. This pragma can contain either ontology constrained terms, or a free text description of the individual's phenotype or both.
##phenotype-description Ontology = http://www.human-phenotype-ontology.org/human-phenotype-ontology.obo.gz;Term = acute myloid leukemia;Comment = AML relapse;
Ontology, Term, Comment
This pragma defines the ploidy for a given genome. This pragma can contain either ontology constrained terms, or a free text description of the individual's ploidy. It is suggested that ontology constrained terms use a subtype of the term PATO:0001374, which includes haploid, diploid, polyploid, triploid etc
##ploidy chr22 1 49691432 diploid
##ploidy chrY 1 57772954 haploid
Each of the rows in a GVF file describes a single variant from an individual or population. Each such variant is typed using the SO terms that can describe SNVs, any size of nucleotide insertion or deletion, copy number variations, large structural variations or any of the 38 terms currently related to sequence alterations in SO. In the case of a seemingly complex variation, such as an SNV located within a translocation, each sequence alteration is annotated relative to its location on the reference genome, on a separate line in the file.
The most flexible part of a feature description in GFF3 is the ninth column, where attributes of a feature are given as tag-value pairs (Table 1). It is here that GVF provides additional structure specific to sequence alteration features. Like GFF3, the attribute tag-value pairs in GVF can come in any order. Multiple tag-value pairs are separated from each other by semicolons, tags are separated from values by '=', and multiple values are comma delimited. GVF includes the tags specified by the GFF3 specification, such as ID, Name, Alias, and so on, and in addition 11 additional tags that allow for the annotation of sequence alteration features and constrains the values for some of those attributes to portions of the SO. For example, the sequence of the variant as well as the reference sequence at that position are specified by Variant_seq and Reference_seq tags, respectively. In the case of sequence-based variant calling methods, the number of reads supporting the variant can be given by the Variant_reads tag. The genotype at the variant locus is specified with the Genotype tag. Other features annotated on the genome (gene, mRNA, exon, splice site, transcription start site, and so on) that intersect the variant, along with the effect that the variant has on the feature, are annotated with the Variant_effect tag. For variant sequences that involve deletion or duplication of large regions of the reference sequence, the copy number of the region may be given with the Variant_copy_number tag. Table 1 provides the details for the tags discussed here and the allowed values.
While a great deal of personal genome variation data today comes from next generation sequencing technologies, the GVF standard can also be used to describe variant data from any source creating DNA variation data with nucleotide resolution, including genotyping DNA microarrays, comparative genomic hybridization (CGH) arrays, and others.
Because GVF is a fully compliant extension of GFF3, GVF files provide a basis for exploration and analysis of personal genome sequences with the widely used Bioperl , and GMOD toolkits ; variant annotations can be viewed by browsers such as GBrowse , JBrowse , Apollo , and analyzed, for example, using the Comparative Genomics Library (CGL) . This means that a GVF file can be passed through a series of analyses, each step adding various attributes to the file, allowing a GVF file to grow progressively richer with each analysis. Complete documentation is available from the website .
A reference personal genomes dataset - '10Gen'
A reference GVF dataset for public use
To fulfill the promise of personal whole genome sequencing it will be critical to compare individual genomes to the reference genome and to one another. One lesson learned from comparative genomics analyses [31–34, 37] is that accurate and easy comparisons require a standardized data format. Without a data standard, ambiguities and misunderstandings poison comparative analyses. The GFF3 standard has been widely embraced by the model organism community as a solution to these problems. GVF will provide the same benefits for personal genomics. Although some of the variant file formats currently in use [1–8] and VCF  are GFF3-like in spirit, none is a formal extension of GFF3, meaning that their terminologies (tags) are not formally defined, versioned, maintained or OBO compliant . GVF also differs from existing formats in matters of scope. First, GVF is not limited to re-sequencing applications; it also can be used to describe DNA genotyping chip experiments, re-sequencing and DNA-chip data can even be combined in a single file. Second, GVF provides more than just a means to describe how and why a variant was called; it provides an extensive terminology with which to describe a variant's relationship to - and impact upon - other features annotated on a genome.
Rigorously grounding GVF upon the GFF3 specification has many other benefits as well. Because both file formats are typed using the SO, GFF3 and GVF files can be used together in a synergistic fashion. Moreover, because GVF is a formal extension of the GFF3 standard, existing parsers, visualization tools and validation software, such as those developed by the GMOD project  to operate on GFF3 files, can used to manipulate and view GVF files. This will provide enormous benefits for those seeking to analyze personal human genomics data.
In order to jumpstart such analyses, we have also manufactured a reference dataset of variants from ten personal genomes, the 10Gen dataset. These genomes represent a diverse assortment of ethnicities, and were produced using a variety of sequencing platforms. Our hope is that the 10Gen dataset will be used as a benchmark for personal genomics software development, following in the footsteps of other successful benchmark datasets, such as those used by CASP [32, 33] for protein structures, GASP/EGASP/NGASP [34, 35, 37] for gene structures, and Eisen/MIAME (Minimum Information about a Microarray Experiment) [38–40] for gene expression, to name just a few. Moreover, the simplicity of the GVF file format combined with the rigor of its formal specification make GVF ideal for adoption by technology providers, genome centers, population geneticists, computational biologists, evolutionary biologists, health care providers, and clinical testing laboratories.
Materials and methods
Extensions to the Sequence Ontology
Using OBO-Edit  the SO was extended in three areas: sequence_alteration, sequence_feature and sequence_variant. There are 38 terms to represent the kinds of sequence alteration, 1,283 terms to represent features intersected by the alteration and 100 terms to represent the variant caused by a sequence alteration, such as intergenic_variant and non_synonymous_codon (see the MISO Sequence Ontology Browser on the SO website  for complete details).
Variant files for ten genomes
The variant files from the ten genomes were downloaded from web sites indicated in the references listed in Table 3. These files were converted to GVF format and were manually spot checked for consistency with annotations on the UCSC Genome Browser. They were then analyzed with a genome variation software pipeline that provided additional quality and consistency checks with respect to the NCBI build 36 of the human genome assembly and with data in the dbSNP and OMIM (Online Mendelian Inheritance in Man) databases.
The GVF standard can also be used to describe genotyping DNA microarray-based variant calls. This flexibility means that a single parser can process variant files from both sequencing and DNA genotyping microarray experiments; moreover, because these fields are attributes of the variant, not the file, a single GVF file can contain variants from heterogeneous sets of sequencing and microarray platforms.
The 10Gen dataset is available for download . Each variant file is named as denoted in Table 3 and additional details are documented in a README file within the download directory. In addition, a cloud compatible version of the data is available as an Amazon elastic block storage (EBS) snapshot . Details for using the snapshot are available from the 10Gen website . This set provides a standard reference dataset and a means to benchmark new analysis procedures. GVF files are also available for download of variant data from Ensembl.
Competitive Assessment of Protein fold recognition
ENCODE Genome Annotation Assessment Project
Genome Annotation Assessment in Drosophila melanogaster
General Feature Format
Generic Feature Format version 3
Generic Model Organism Database
Genome Variation Format
National Center for Biotechnology Information
Nematode Genome Annotation Assessment Project
Open Biological and Biomedical Ontologies
single nucleotide polymorphism
single nucleotide variation
Variant Call Format.
We thank Francisco De La Vega and Kevin McKernan of Life Technologies for providing early data access. We also acknowledge the 1000 Genomes Project for making data publicly available. This work is supported by NIH/NHGRI grants 5R01HG004341 and P41HG002273 (KE), 1RC2HG005619 (MY and MGR), 2R44HG002991 (MGR) and 2R44HG003667 (MGR and MY).
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.PubMedView ArticleGoogle Scholar
- McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE, et al: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009, 19: 1527-1541. 10.1101/gr.091868.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, et al: The diploid genome sequence of an Asian individual. Nature. 2008, 456: 60-65. 10.1038/nature07484.PubMedPubMed CentralView ArticleGoogle Scholar
- Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, Park D, Lee YS, Kim S, Reja R, Jho S, Kim CG, Cha JY, Kim KH, Lee B, Bhak J, Kim SJ: The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res. 2009, 19: 1622-1629. 10.1101/gr.092197.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Pushkarev D, Neff NF, Quake SR: Single-molecule sequencing of an individual human genome. Nat Biotechnol. 2009, 27: 847-852. 10.1038/nbt.1561.PubMedPubMed CentralView ArticleGoogle Scholar
- Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, Staker B, Pant KP, Baccash J, Borcherding AP, Brownley A, Cedeno R, Chen L, Chernikoff D, Cheung A, Chirita R, Curson B, Ebert JC, Hacker CR, Hartlage R, Hauser B, Huang S, Jiang Y, Karpinchyk V, et al: Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010, 327: 78-81. 10.1126/science.1181498.PubMedView ArticleGoogle Scholar
- 1000 Genomes Project. [http://www.1000genomes.org]
- Sequence Read Format. [http://srf.sourceforge.net]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Genotype Likelihood Format. [http://maq.sourceforge.net/glfProgs.shtml]
- Variant Call Format. [http://vcftools.sourceforge.net]
- Genome Variation Format. [http://www.sequenceontology.org/gvf.html]
- Generic Feature Format version 3. [http://www.sequenceontology.org/resources/gff3.html]
- Generic Model Organism Database. [http://www.gmod.org]
- GFF. [http://www.sanger.ac.uk/resources/software/gff/spec.html]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Eilbeck K, Lewis SE: Sequence ontology annotation guide. Comp Funct Genomics. 2004, 5: 642-647. 10.1002/cfg.446.PubMedPubMed CentralView ArticleGoogle Scholar
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005, 6: R44-10.1186/gb-2005-6-5-r44.PubMedPubMed CentralView ArticleGoogle Scholar
- Oetting WS: Clinical genetics and human genome variation: the 2008 Human Genome Variation Society scientific meeting. Hum Mutat. 2009, 30: 852-856. 10.1002/humu.20987.PubMedView ArticleGoogle Scholar
- Sprague J, Bayraktaroglu L, Bradford Y, Conlin T, Dunn N, Fashena D, Frazer K, Haendel M, Howe DG, Knight J, Mani P, Moxon SA, Pich C, Ramachandran S, Schaper K, Segerdell E, Shao X, Singer A, Song P, Sprunger B, Van Slyke CE, Westerfield M: The Zebrafish Information Network: the zebrafish model organism database provides expanded support for genotypes and phenotypes. Nucleic Acids Res. 2008, 36: D768-772. 10.1093/nar/gkm956.PubMedPubMed CentralView ArticleGoogle Scholar
- Robinson PN, Mundlos S: The human phenotype ontology. Clin Genet. 2010, 77: 525-534. 10.1111/j.1399-0004.2010.01436.x.PubMedView ArticleGoogle Scholar
- The Open Biological and Biomedical Ontologies. [http://www.obofoundry.org/]
- Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007, 25: 1251-1255. 10.1038/nbt1346.PubMedPubMed CentralView ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The bioperl toolkit: perl modules for the life sciences. Genome Res. 2002, 12: 1611-1618. 10.1101/gr.361602.PubMedPubMed CentralView ArticleGoogle Scholar
- O'Connor BD, Day A, Cain S, Arnaiz O, Sperling L, Stein LD: GMODWeb: a web framework for the Generic Model Organism Database. Genome Biol. 2008, 9: R102-10.1186/gb-2008-9-6-r102.PubMedPubMed CentralView ArticleGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView ArticleGoogle Scholar
- Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH: JBrowse: a next-generation genome browser. Genome Res. 2009, 19: 1630-1638. 10.1101/gr.094607.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglir L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME: Apollo: a sequence annotation editor. Genome Biol. 2002, 3: RESEARCH0082-10.1186/gb-2002-3-12-research0082.PubMedPubMed CentralView ArticleGoogle Scholar
- Yandell M, Mungall CJ, Smith C, Prochnik S, Kaminker J, Hartzell G, Lewis S, Rubin GM: Large-scale trends in the evolution of gene structures within 11 animal genomes. PLoS Comput Biol. 2006, 2: e15-10.1371/journal.pcbi.0020015.PubMedPubMed CentralView ArticleGoogle Scholar
- Levitt M: Competitive assessment of protein fold recognition and alignment accuracy. Proteins. 1997, Suppl 1: 92-104. 10.1002/(SICI)1097-0134(1997)1+<92::AID-PROT13>3.0.CO;2-M.PubMedView ArticleGoogle Scholar
- Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assessment of methods of protein structure prediction (CASP): round II. Proteins. 1997, Suppl 1: 2-6. 10.1002/(SICI)1097-0134(1997)1+<2::AID-PROT2>3.0.CO;2-T.PubMedView ArticleGoogle Scholar
- Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000, 10: 483-501. 10.1101/gr.10.4.483.PubMedPubMed CentralView ArticleGoogle Scholar
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006, 7 Suppl 1: S2.1-S2.31. 10.1186/gb-2006-7-s1-s2.Google Scholar
- Reese MG, Guigo R: EGASP: Introduction. Genome Biol. 2006, 7 Suppl 1: S1.1-S1.3. 10.1186/gb-2006-7-s1-s1.Google Scholar
- Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD: nGASP - the nematode genome annotation assessment project. BMC Bioinformatics. 2008, 9: 549-10.1186/1471-2105-9-549.PubMedPubMed CentralView ArticleGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Brazma A: Minimum Information About a Microarray Experiment (MIAME) - successes, failures, challenges. ScientificWorldJournal. 2009, 9: 420-423. 10.1100/tsw.2009.57.PubMedView ArticleGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME) - toward standards for microarray data. Nat Genet. 2001, 29: 365-371. 10.1038/ng1201-365.PubMedView ArticleGoogle Scholar
- Day-Richter J, Harris MA, Haendel M, Lewis S: OBO-Edit - an ontology editor for biologists. Bioinformatics. 2007, 23: 2198-2200. 10.1093/bioinformatics/btm112.PubMedView ArticleGoogle Scholar
- MISO Sequence Ontology Browser. [http://www.sequenceontology.org/miso]
- 10Gen at Sequence Ontology. [http://www.sequenceontology.org/resources/10Gen.html]
- 10Gen at Amazon. [http://10gen-gvf.s3.amazonaws.com/list.html]
- Database list. [ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs]
- Database list details. [ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.