- Open Access
DAVID: Database for Annotation, Visualization, and Integrated Discovery
© Dennis et al.; licensee BioMed Central Ltd. 2003
- Received: 4 April 2003
- Accepted: 4 July 2003
- Published: 14 August 2003
The distributed nature of biological knowledge poses a major challenge to the interpretation of genome-scale datasets, including those derived from microarray and proteomic studies. This report describes DAVID, a web-accessible program that integrates functional genomic annotations with intuitive graphical summaries. Lists of gene or protein identifiers are rapidly annotated and summarized according to shared categorical data for Gene Ontology, protein domain, and biochemical pathway membership. DAVID assists in the interpretation of genome-scale datasets by facilitating the transition from data collection to biological meaning.
- Gene Ontology
- Functional Annotation
- Annotation Tool
- Structure Query Language
- Accession Type
The post-genomic era has introduced high-throughput methodologies that generate experimental data at rates that exceed knowledge growth. In particular, high-density biochips including complementary deoxyribonucleic acid (cDNA) microarrays, oligonucleotide microarrays, and rapidly evolving proteomics platforms represent modern tools able to interrogate biology on a genome-wide scale and generate tens of thousands of data points simultaneously . While researchers are beginning to appreciate the statistical rigors required for the analysis of genome-scale datasets, a rate-limiting step in knowledge growth occurs at the transition from statistical significance to biological discovery.
A number of public efforts are currently focusing on the annotation and curation of gene-specific functional data, including LocusLink, Protein Information Resource (PIR), GeneCards, Proteome, Kyoto Encyclopedia of Genes and Genomes (KEGG), Ensembl, and Swiss-Prot to name but a few [2–8]. These resources provide exceptional depth and coverage of the functional data available for a given gene, but are not designed to effectively explore the biological knowledge associated with hundreds or thousands of genes in parallel. In order to facilitate the functional annotation and analysis of large lists of genes we have developed a Database for Annotation, Visualization, and Integrated Discovery (DAVID), which provides a set of data-mining tools that systematically combine functionally descriptive data with intuitive graphical displays . DAVID provides exploratory visualization tools that promote discovery through functional classification, biochemical pathway maps, and conserved protein domain architectures, while simultaneously remaining linked to rich sources of biological annotation. DAVID expedites the functional annotation and analysis of any list of genes encoded by human, mouse, rat, or fly genomes. DAVID's functionality is demonstrated using the Affymetrix GeneChip data of Cicala et al. .
Sources of annotation data integrated into DAVID
University of Michigan
Options provided by the Annotation Tool
Accession number corresponding to the nucleotide sequence
Cluster containing sequences that represent a unique gene
Unique and stable identifier for curated genetic loci
Reference sequence standards for mRNAs
Official gene symbol included in the Locus Report provided by NCBI
Official gene name included in the Locus Report provided by NCBI
Catalog of human genes and genetic disorders
Probe set description provided by Affymetrix
Functional summaries included in the Locus Report provided by NCBI
Controlled vocabulary applied to the functions of genes and proteins. Functional classifications used here are those included in the Locus Report provided by NCBI
The GoCharts module graphically displays the distribution of differentially expressed genes among functional categories using the controlled vocabulary of the Gene Ontology Consortium (GO), which provides a structured language that can be applied to the functions of genes and proteins in all organisms even as knowledge continues to accumulate and change . The language is structured in a directed acyclic graph (DAG), wherein term specificity increases and genome coverage decreases as one moves down the hierarchy. In contrast with a true hierarchy, child terms in a DAG may have more than one parent term and may have a different class of relationship with its different parents. The structure of GO starts with three main categories, Biological Process, Molecular Function, and Cellular Component. Biological Process includes broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions. Molecular Function describes the tasks performed by individual gene products; examples are transcription factor and DNA helicase. The Cellular Component classification type involves subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. After choosing a classification type, levels that determine list coverage and specificity are chosen by selecting the appropriate radio button. Level 1 provides the highest list coverage with the least amount of term specificity. With each increasing level coverage decreases while specificity increases so that level 5 provides the least amount of coverage with the highest term specificity.
Classification data is displayed as a bar chart, where the length of the bar represents the number of gene identifiers in each category. The user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens a new HTML table displaying the gene identifier, LocusLink number, gene name, the current classification, and other classifications for each gene in that category. A 'Show All' button opens a new HTML table displaying all classification data and a 'Show Chart Data' button opens an HTML table containing the underlying chart data, thus allowing users to recreate customized chart graphics in a spreadsheet program. A new chart can be displayed for any subset of genes by selecting the classification type and level using the checkboxes and radio buttons available within the user's current page that allow for drill-down capabilities. A count of the number of genes annotated is included in the output, and unannotated genes are binned into the 'unclassified' category, thus providing users with an automated tracking system for genes not annotated.
KeggCharts graphically display the distribution of differentially expressed genes among KEGG biochemical pathways. Each pathway is linked to the KEGG pathway map, wherein differentially expressed genes from the original list are highlighted in red. In this view genes are further linked to additional annotations available through KEGG's DBGET retrieval system . As with GoCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes and the KeggCharts visualization inherits all of the dynamic features of GoCharts.
DomainCharts display the distribution of differentially expressed genes among PFAM protein domains . Each domain designation is linked to the Conserved Domain Database (CDD) of the National Center for Biotechnology Information (NCBI), where details regarding domain function, structure and sequence are readily available. As with GoCharts and KeggCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes and the DomainCharts visualization inherits all of the dynamic features of GoCharts and KeggCharts. For further information regarding the functionality of DAVID visit the FAQ section at .
Using DAVID to mine functional annotation
To demonstrate the functionality of DAVID we analyzed a list of genes differentially expressed in human peripheral blood mononuclear cells (PBMCs) after incubation with HIV-1 envelope proteins. Details of the experimental, RNA preparation, and GeneChip hybridization procedures, along with details of the chip-to-chip normalizations and statistical analysis of differential gene expression are provided in Cicala et al. . Briefly, primary human PBMCs and monocyte-derived macrophages were incubated for 16 hours with HIV-1 envelope protein (gp120). High-density oligonucleotide microarrays (Affymetrix HU-95A GeneChip) were used to monitor gp120-induced transcriptional events. This analysis resulted in the identification of 402 differentially expressed genes.
Whereas 16 genes modulated by HIV-1 gp120 have previously been associated with HIV replication and/or envelope signaling, the remaining genes are of unknown function or have never been associated with HIV-1 or gp120. Converting this list of genes into biological meaning requires the gathering of pertinent information from several data repositories. For many researchers this process consists of iterative browsing through several databases for each gene, manually gathering gene-specific information regarding sequence, function, pathway, and disease association. In contrast, the systematic approach of DAVID simultaneously adds biologically rich information derived from several public data sources to lists of genes in parallel. Selecting DAVID's Annotation Tool and uploading the list of 402 differentially expressed genes initiates the functional annotation and analysis of the entire dataset. Once submitted, the gene list is stored for the entire analysis session, allowing users to switch between modules without having to resubmit data.
Choosing the GoCharts module opens a new window with a variety options. Users choose between three general types of classification (biological process, molecular function, and cellular component) and five levels of annotation that represent term coverage and specificity (see Analysis Modules section). Any combination of classification and coverage level can be specified. Also included are options to annotate gene lists with all GO terms available or only the most specific terms, which are referred to as terminal nodes. The option to choose different levels of term specificity provides needed flexibility and thus allows researchers to determine dynamically which level of coverage and specificity best suits their data and stage of analysis. For instance, early-stage analyses may consist of annotating gene lists with very general terms in order to gain a broad understanding of the data. In this case, selecting biological process and level 1 classifies genes using general terms such as 'death' and 'cell communication'. Using increased term specificity facilitates the extraction of more detailed functional information. In this case selecting biological process and level 5 classifies genes using terms such as 'apoptotic mitochondrial changes' and 'chemosensory perception'.
Because HIV-1 has a major impact on the function of cells of the immune system and their ability to carry out stress responses, we selected the histogram bar representing the number of genes involved in stress response, which opens an HTML table containing the Affymetrix identifier, LocusLink number, gene name, the current classification, and other classifications for all 35 genes (Figure 4b). Now that we have reduced our gene list to those genes involved in stress responses, we further characterized this subset by repeating the GoCharts procedure available at the top of the stress-response HTML table. Choosing molecular function, level 3 produces a new histogram that quickly reveals that nearly half (16/35) of the stress-response genes possess cytokine activity (Figure 4c). Indeed, cytokines have been shown to play an important part in the HIV-1 life cycle and the results obtained here suggest that treatment of PBMCs with HIV-1 envelope proteins significantly modulates the transcription of numerous cytokine genes. The efficiency with which GoCharts systematically summarized this large dataset with graphic visualizations, while remaining linked to primary data and external resources drastically improved the discovery process.
In conclusion, the development of any complete, in-silico discovery system requires full, query-based access to an integrated, up-to-date view of all relevant information, regardless of its physical location and content structure. Still in its infancy, DAVID represents the foundation of our continued development efforts that aim to integrate information-rich data sources and provide quantitative summaries and analysis methods. In addition to the functionality reported here, the methods of Hosack et al.  have been incorporated into a DAVID analysis module called EASEonline, which allows users to identify statistically over-represented functional categories within a given list of genes. Committed to maintaining a system able to coevolve with technological advances and the new forms of data that are sure to follow, DAVID's current design elements provide automated solutions that enable researchers to rapidly discover biological themes in large datasets consisting of lists of genes.
The authors are grateful to the referees for their constructive comments and thank Bill Wilton and Mike Tartakovsky for information technology and network support. The project has been funded with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, under Contract No. NO1-C0-56000. The contents of this tool do not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the United States government.
- Quackenbush J: Computation analysis of microarray data. Nat Rev Genet. 2001, 2: 418-427. 10.1038/35076576.PubMedView ArticleGoogle Scholar
- Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000, 28: 10-14. 10.1093/nar/28.1.10.PubMedPubMed CentralView ArticleGoogle Scholar
- Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z-Z, Ledley RS, Lewis KC, Mewes H-W, Orcutt BC, et al: The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002, 30: 35-37. 10.1093/nar/30.1.35.PubMedPubMed CentralView ArticleGoogle Scholar
- Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics. 1998, 14: 656-664. 10.1093/bioinformatics/14.8.656.PubMedView ArticleGoogle Scholar
- Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, et al: YPD™, PombePD™, and WormPD™: model organism volumes of the BioKnowledge™ library, an integrated resource for protein information. Nucleic Acids Res. 2001, 29: 75-79. 10.1093/nar/29.1.75.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 38-41. 10.1093/nar/30.1.38.PubMedPubMed CentralView ArticleGoogle Scholar
- Gasteiger E, Jung E, Bairoch A: SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol. 2001, 3: 47-55.PubMedGoogle Scholar
- DAVID. [http://www.DAVID.niaid.nih.gov]
- Cicala C, Arthos J, Selig SM, Dennis G, Hosack DA, Van Ryk D, Spangler ML, Steenbeke TD, Khazanie P, Gupta N, et al: HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc Natl Acad Sci. 2002, 99: 9380-9385. 10.1073/pnas.142287999.PubMedPubMed CentralView ArticleGoogle Scholar
- Unigene annotation for Affy chips. [http://dot.ped.med.umich.edu:2000/ourimage/pub/shared/JMR_pub_affyannot.html]
- NetAffx Analysis Center. [http://www.affymetrix.com/analysis/index.affx]
- The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nature Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticleGoogle Scholar
- Sonnhammer ELL, Eddy SR, Durbin R: Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins. 1997, 28: 405-420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L.PubMedView ArticleGoogle Scholar
- QuickGO. [http://www.ebi.ac.uk/ego/]
- ENSMART. [http://www.ensembl.org/EnsMart]
- FatiGO. [http://fatigo.bioinfo.cnio.es]
- GeneLynx. [http://www.genelynx.org]
- GoMiner. [http://discover.nci.nih.gov/gominer/index.jsp]
- GenMAPP including MAPPFinder. [http://www.genmapp.org]
- MatchMiner. [http://discover.nci.nih.gov/matchminer/html/index.jsp]
- Resourcerer. [http://pga.tigr.org/tigr-scripts/magic/r1.pl]
- Source. [http://source.stanford.edu/cgi-bin/SourceSearch]
- Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biol. 2003, 4: P4-10.1186/gb-2003-4-6-p4.View ArticleGoogle Scholar
- Searching GenBank. [http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html]
- UniGene. [http://www.ncbi.nlm.nih.gov/UniGene]
- RefSeq. [http://www.ncbi.nlm.nih.gov/RefSeq/]
- LocusLink. [http://www.ncbi.nlm.nih.gov/LocusLink]
- KEGG. [http://www.genome.ad.jp/kegg]
- OMIM. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM]
- Gene Ontology. [http://www.geneontology.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.