Genomic information infrastructure after the deluge
© BioMed Central Ltd 2010
Published: 26 July 2010
Skip to main content
© BioMed Central Ltd 2010
Published: 26 July 2010
Maintaining up-to-date annotation on reference genomes is becoming more important, not less, as the ability to rapidly and cheaply resequence genomes expands.
The advent of next-generation sequencing technology has led to a profound shift in the economics of genomics. Sequencing costs have fallen more than a hundredfold over the past four years, and this rate of reduction is likely to continue for the foreseeable future. The availability of cheap DNA sequencing has changed the cost of a variety of experiments - gaining a near-complete bacterial sequence costs a few hundred dollars in consumables, whereas mid-size genomes are amenable to a single grant proposal. A number of large genomes, such as those of vertebrates (for example, the turkey) have been undertaken by small consortia of interested laboratories. In addition, there are a variety of novel assays, such as RNA sequencing (RNA-seq), transposon mutagenesis and chromatin immunoprecipitation and sequencing (ChIP-seq) in which low-cost sequencing has replaced other readout platforms such as nucleic acid hybridization. Understanding these data rests fundamentally on well curated, up-to-date annotation for reference genomes, which can be leveraged for other species. However, the ability of the scientific community to maintain such resources is failing as a result of the onslaught of new data and the disconnect between the archival DNA databases and the new types of information and analysis being reported in the scientific literature. In this article, we propose a new structure for genomic information resources to address this problem.
Dramatic falls in the consumable costs of DNA sequencing have not fundamentally changed the need for computational analysis to process and interpret the information produced. Indeed, the need has increased as the volume and complexity of the data have risen. There has, therefore, been a profound shift towards a higher intensity of informatics in biological research, with bioinformatics becoming a necessary component of many, if not most, molecular biology groups. The analysis of new genome-wide experiments typically requires the presence of a robust, accurate information infrastructure, including a reasonable assembly of the genome sequence, a set of accurate gene predictions and a description of their biological function. When genome sequence determination was expensive, and thus both relatively uncommon and concentrated in areas of intensive experimental research, considerable resources could be focused on individual genomes, often in intensively managed and curated model organism databases (such as FlyBase , WormBase , and the Saccharomyces Genome Database ).
However, the model of relatively independent, large consortia focused on a small set of genomes seems ill equipped to handle the flood of new genomes. Without such support, annotations created for many genomes have not been kept up-to-date since their initial submission to the public databases, as sequencing groups have moved on to new targets and experimental data have accumulated in the literature. Although there has been considerable success in creating portable software components for genome curation, such as the GMOD tools (for example, Apollo  and Chado ), Artemis  and others, their application happens in an ad hoc manner, often focusing on solving a particular problem specific to one group, rather than systematically. This leads to the duplication of effort between groups and inconsistency between the annotations they produce. Even when experimental data are well organized in a structured resource, their volume is a further impediment to their successful exploitation by the wider community, as network bandwidth is often a constraining factor when attempting to download large datasets for analysis. There are, therefore, at least two challenges facing the post-deluge community. The first is ensuring that bioinformatics resources are kept up-to-date and operate in a stable and reliable funding environment. The second is creating mechanisms to give end users access to the raw datasets, which are now so massive that they cannot easily be transferred across the Internet. Both are weighty issues, and this article focuses on the first one.
The International Nucleotide Sequence Database (INSDC), implemented as GenBank  at the US National Center for Biotechnology Information (NCBI), ENA at the European Bioinformatics Institute  (EBI) and the DNA Database of Japan  at the National Institute for Genetics, has archived DNA sequence information submitted by experimentalists since its establishment in 1984. However, even before the advent of the new technology there was an increasing disconnection between the genome annotation in the archive and the more complex functional information that had accumulated in the laboratories of the scientific community, and in the literature. In response to this, the Ensembl project  in Europe and the RefSeq project  at NCBI were developed partly to capture, and partly to provide, high-quality annotation, in particular on protein-coding genes, on important genomes. For some species (such as Drosophila, yeast and worm) these resources mirrored information from the well funded model organism databases already established for these species. In most other cases, however, the new resources were derived from a selection from the submitted archival records, without significant manual updates. Finally, in cases such as human and other mammals, there was direct creation of added-value datasets on the genome, often through collaborations with other groups (for example, the UCSC Genome Browser group  for vertebrate genomes). More generally, NCBI  and EBI  act as major providers of bioinformatics services across a broad range of domains, of which genome-centric resources form just one part.
The current situation is therefore a patchwork of different resources, with different funding models and different communication lines. There are benefits to this diversity - funding streams usually involve a good connection to the scientists working directly on a species (whose involvement is required to justify investment), no single group has a monopoly on the information flow, innovation in added-value services can be explored, and small additional components can often be funded rapidly. However, there are some major disadvantages as well - ineffective (or in some cases nonexistent) communication between diverse groups hampers the propagation of the best annotation through the system, while the diversity and ad hoc nature of the tools requires large investments by individual laboratories in just gathering, organizing and reformatting data before conducting any pan-domain analysis. Finally, the heterogeneous structure is very confusing for funding agencies to engage with; it is unclear what resources will appear without intervention, unclear whether a particular resource is good value for money (especially when it partially duplicates other resources) and unclear how any particular information resource will survive beyond a single funding cycle. In addition, like many other scientific endeavors, these activities occur in an international context with a geographic diversity of participating groups and a matching diversity of funding agencies, whose goals may be more or less well aligned.
The absence of a structure for funding and data can lead to the loss of valuable scientific content when a particular episode of funding concludes. Among the most striking current demonstrations of this is the funding crisis faced by The Arabidopsis Information Resource (TAIR) , which has curated the genome of the model plant Arabidopsis thaliana, but which faces closure in 2013 if new funding cannot be secured. For smaller resources, the threat of effective closure is ever present, as funding is usually linked to specific research-oriented grants. To give just one example, the COGEME database for plant pathogen expressed sequence tags (ESTs)  was updated regularly between 2001 and 2007 but (in the absence of longer-term funding) not since.
Over the past five years this patchwork of resources has improved through communication and software reuse. Examples include the development of open-source software by groups such as GMOD (for example, the Gbrowse genome browser ), Ensembl  and GeneDB  that can be reused by others; better communication between model organism databases and EBI/NCBI; and improved coordination of funding in adjacent areas (for example, the Bioinformatics Resource Centers (BRCs) [19–21] funded by the US National Institute of Allergy and Infectious Diseases (NIAID), which each cover a portfolio of related species where NIAID is also funding experimental work). However, there is still a fundamental need for a stable, sustainable and comprehensive configuration of resources that can handle the growing influx of genomic data from all sources. In the remainder of this article we outline a proposed structure that formalizes aspects of current best practice and proposes a clear model for data management for both scientists and funding agencies.
Attributes of each of the tiers
Explore and analyze new areas of biology
Organize an appropriate area of biology
Aggregate across all biology, provide information infrastructures
Main style of funding
Response-mode and strategic grants for specific key datasets
Strategic grants for an area of biology, with portions of response-mode grants for specific datasets
Infrastructure funds, coupled to portions of strategic grants for specific biological areas
Time horizon of group
Grant-driven, 3-5 years
Strategic grant driven, 5-10 years
Infrastructure driven, 10-20 years
Many response-mode laboratories in universities and academic institutions
Bioinformatics resource centers (BRCs), model organism databases
EBI (Ensembl, Ensembl Genomes), NCBI (RefSeq)
These three tiers are not proposed to replace the primary data archives such as the INSDC (for nucleotide sequence), GEO  and ArrayExpress  (for expression data), but rather to exist in parallel, providing biological context to the archived data, which remains a record of experiments that have been carried out. In contrast, this stream of information represents the scientific community's best current understanding of information on these species. The specialization in terms of biology decreases from Tier 1 to Tier 3, whereas the sophistication in engineering and computation increases from Tier 1 to Tier 3. This structure both provides for a diversity of datasets and approaches (in particular Tier 1 and to some extent Tier 2) while ensuring consistency and the preservation of high-value datasets within Tier 3. Importantly, it captures the enthusiasm and expertise of specialized scientific groups around Tier 2 databases to keep information on specific genomes up to date, and provides a direct route for this information into the Tier 3 databases that are used by the wider scientific community. As in all scientific endeavors, openness and discussions between all participants need to be encouraged, but this structure places particular emphasis on the communication between adjacent Tiers.
For this structure to work, the different components need to be funded efficiently, with a minimum of unproductive overlap and maximizing the overall utility of the information. As the inter-tier communication is critical for this, we believe that creating funding schemes that deliberately span two tiers (that is, Tier 1 to Tier 2 or Tier 2 to Tier 3) is optimal. Such funding schemes guarantee the communication lines and promote the transfer of information into the higher, longer-lived tiers.
There are well developed funding streams from a variety of agencies for Tier 1 groups, primarily from 'responsive-mode schemes' that encourage the submission of proposals within a broad area of scientific research. It is important to realize that the Tier 1 groups require an increasing intensity of bioinformatics to perform the primary analysis of their own data, and that the presence of the other tiers, and the investment of informatics in these tiers, does not fundamentally change the need for bioinformatics at this level. In addition, funding agencies should support grants that deliberately couple the transfer of information to Tier 2, in some cases by having joint funding episodes with the appropriate Tier 2 group. This sort of 'spanning' funding is particularly appropriate when the generation of a specific dataset is the major focus of a grant: for example, a program to expand a specific phylogenetic domain in terms of genomes sequenced or to generate population genomics resources for a particular species.
There are a variety of existing mechanisms for Tier 2 resources, such as the Biological and Bioinformatics Resources (BBR) of the Biotechnology and Biological Sciences Research Council (BBSRC) in the United Kingdom and, in the United States, the model organism database funds of the National Human Genome Research Institute (NHGRI) and the BRCs of NIAID. The focus of a Tier 2 resource is ideally a specific area of biology, led by scientists practicing in this area. However, it is best sited in, or allied to, an institutional context with existing commitment to suitable infrastructure. This tier is currently the least well defined, and there are areas of biology with no obvious Tier 2 'aggregator' capable of providing a good feed of information into Tier 3. As with the Tier1/Tier2 interface, we see funding that spans Tier2 and Tier3 being a successful way to ensure transfer of information up into the next tier. Such 'spanning' funds exist now in a number of areas (for example, the grants supporting VectorBase  and PomBase , both Tier 2 resources, each of which defines a relationship with a Tier 3 resource).
Schemes such as the BRCs and BBRs are welcome because they offer the possibility of continuity of funding, and partnership with Tier 3 resources provides the possibility of data persistence even beyond funding episodes. Indeed, the BBSRC is now addressing the needs of plant pathogens within this framework. The model-organism funding stream from NHGRI is also clearly targeted at this area. There are also initiatives under way to coordinate global funding for important Tier 2 resources, such as recent workshops held in the United Kingdom and the United States to develop a framework to secure funding for the ongoing needs of the Arabidopsis community. However, given the large number of species with sequenced genomes expected over the next decade, overall we believe that Tier 2 is the least well understood by funding agencies and research communities, and that this is the area that most needs clarifying and developing by funding agencies.
A Tier 3 resource is fundamentally an information infrastructure, and must be provided by institutions with a core commitment to infrastructure provision. For much biomolecular data, two obvious centers are the NCBI and EBI, although it is vital that these develop clear interfaces, not just with Tier 2 resources, but also with other infrastructure providers in adjacent domains (such as medical informatics, crop informatics and bioengineering). This area of funding is becoming better defined, with increasingly sophisticated links between institutes of the National Institutes of Health (NIH) and NCBI in the United States; the ELIXIR process led by the EBI to coordinate bioinformatics infrastructure funding in Europe; and increasing collaboration between EBI and NCBI on a number of Tier 2 and Tier 3 projects (for example, the Common Coding Sequence Initiative in human and mouse to establish a universal set of reference transcripts for these species). Set against this is the fact that a number of heavily used 'aggregator' resources, such as the UCSC genome browser, are so widely used that despite the different institutional contexts of these resources, it is likely that they will be very long lasting and thus have characteristics of Tier 3 resources. Despite this progress, however, it is still unclear how these new funding streams will mature as the volume and diversity of underlying data continue to grow. This discussion needs to be considered in the context of the broader infrastructure challenges in bioinformatics and medical informatics.
To sum up, the structure proposed here is in many ways a formalization of current best practice, particularly in the model organism databases. However, by expanding and codifying the structure, and emphasizing the importance of information transfer between the tiers, it should go some way towards closing the loop between the public archival databases and the scientific literature, and ensuring that the latest functional information is propagated to relevant genome databases, where it can form an effective foundation for subsequent research from high-throughput analysis to individual hypothesis-based approaches.
We are grateful to Pat Goodwin and the Wellcome Trust for their encouragement, and for supporting a workshop in November 2008 in which aspects of this model were discussed.