DAVID: Database for Annotation, Visualization, and Integrated Discovery
© BioMed Central Ltd 2003
Received: 28 March 2003
Published: 3 April 2003
Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information.
Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains.
Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
The post-genomic era has introduced high-throughput methodologies that generate experimental data at rates that exceed knowledge growth. In particular, high-density biochips including complementary deoxyribonucleic acid (cDNA) microarrays, oligonucleotide microarrays, and rapidly evolving proteomics platforms represent modern tools able to interrogate biology on a genome-wide scale and generate tens of thousands of data points simultaneously . While researchers are beginning to appreciate the statistical rigors required for the analysis of genome-scale datasets, a rate-limiting step in knowledge growth is at the transition from statistical significance to biological discovery.
There are currently a number of public efforts focusing on the annotation and curation of gene-specific functional data including, LocusLink, Protein Information Resource (PIR), GeneCards, Proteome, Kyoto Encyclopedia of Genes and Genomes (KEGG), Ensembl, and Swiss-Prot to name but a few [1–8]. These resources provide exceptional depth and coverage of the functional data available for a given gene, but are not designed to effectively aggregate biological knowledge for 100s or 1000s of genes in parallel. In order to facilitate the functional annotation and analysis of large lists of genes we have developed a Database for Annotation, Visualization, and Integrated Discovery (DAVID), which provides a set of data mining tools that systematically combine functionally descriptive data with intuitive graphical displays. DAVID provides exploratory visualization tools that promote discovery through functional classification, biochemical pathway maps, and conserved protein domain architectures, while simultaneously remaining linked to rich sources of biological annotation. DAVID's functionality is demonstrated using the Affymetrix Genechip data of Cicala et al., . However, DAVID expedites the functional annotation and analysis of any list of genes encoded by the human, mouse, rat, or fly genomes.
Materials and Methods
Details of the experimental, RNA preparation, and Genechip hybridization procedures, along with details of the chip-to-chip normalizations and statistical analysis of differential gene expression are provided in Cicala et al., . Briefly, primary human peripheral blood mononuclear cells (PBMCs) and monocyte-derived macrophages were incubated for 16 hours with HIV-1 envelope protein (gp120). High-density oligonucleotide microarrays (Affymetrix HU-95A GeneChip) were used to monitor gp120 induced transcriptional events.
System Architecture and Maintenance
Sources of Annotation Data Integrated into DAVID
Univ. of Michigan
Format, Submit, and Save Files
Uploading or pasting a list of gene identifiers into DAVID initiates the data mining process. Uploaded files must be tab-delimited text files and can contain one or two columns, gene identifiers must be in the first column and an optional second column can contain any other type of information (e.g. fold change, p-value, cluster number, etc). Genes separated by any white character can also be copied and pasted into a textbox and uploaded to DAVID. Gene identifiers can be in the form of Affymetrix probe set identifiers, Genbank (and RefSeq) accession numbers, Unigene cluster numbers, or LocusLink identifiers. HTML tables containing analysis results can be saved in Microsoft Excel format by choosing File > Save As > 'filename.xls', where the '.xls' extension allows Microsoft Excel to directly import the the tab-delimited data. HTML table results can also be copied and pasted into Microsoft Word and Excel directly.
Options Provided by the Annotation Tool
Accession number corresponding to the nucleotide sequence.
Cluster containing sequences that represent a unique gene.
Unique and stable identifier for curated genetic loci.
Reference sequence standards for mRNAs.
Official gene symbol included in the Locus Report provided by NCBI.
Official gene name included in the Locus Report provided by NCBI.
Catalog of human genes and genetic disorders.
Probe set description provided by Affymetrix.
Functional summaries included in the Locus Report provided by NCBI.
Controlled vocabulary applied to the functions of genes and proteins. Functional classifications used here are those included in the Locus Report provided by NCBI.
The GoCharts tool graphically displays the distribution of differentially expressed genes among functional categories using the controlled vocabulary of the Gene Ontology Consortium (GO), which provides a structured language that can be applied to the functions of genes and proteins in all organisms even as knowledge continues to accumulate and change . The language is structured in a directed acyclic graph (DAG), wherein term specificity increases and genome coverage decreases as one moves down the hierarchy. In contrast with a true hierarchy, child terms in a DAG may have more than one parent term and may have a different class of relationship with its different parents. The structure of GO starts with three main categories, Biological Process, Molecular Function, and Cellular Component. Biological Process includes broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions. Molecular Function describes the tasks performed by individual gene products; examples are transcription factor and DNA helicase. The Cellular Component classification type involves subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. After choosing a classification type, levels that determine list coverage and specificity are chosen by selecting the appropriate radio button. Level 1 provides the highest list coverage with the least amount of term specificity. With each increasing level coverage decreases while specificity increases so that level 5 provides the least amount of coverage with the highest term specificity. Classification data is displayed as a bar chart, where the length of the bar represents the number of gene identifiers in each category.
The user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens a new HTML table displaying the gene identifier, LocusLink number, Gene Name, the current classification, and other classifications for each gene in that category. A 'Show All' button opens a new HTML table displaying all classification data and a 'Show Chart Data' button opens an HTML table containing the underlying chart data, thus allowing users to recreate customized chart graphics in a spreadsheet program. A new chart can be displayed for any subset of genes by selecting the classification type and level using the checkboxes and radio buttons available within the users current page allowing for drill-down capabilities. A count of the number of genes annotated is included in the output and unannotated genes are binned into the 'unclassified' category, thus providing users with an automated tracking system for genes not annotated.
The KeggCharts tool graphically displays the distribution of differentially expressed genes among KEGG biochemical pathways. Each pathway is linked to the KEGG pathway map, wherein differentially expressed genes from the original list are highlighted in red. In this view genes are further linked to additional annotations available through KEGG's DBGET retrieval system . As with GoCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens an HTML table displaying the gene identifier, LocusLink number, Gene Name, the pathway, and other classifications data. A 'Show All' button opens a new HTML table displaying all classification data. A 'Show Chart Data' button opens an HTML table containing the underlying chart data. Genes not classified by KeggCharts are handled in the same manner as with GoCharts.
The DomianCharts tool graphically displays the distribution of differentially expressed genes among PFAM protein domains . Each domain designation is linked to the Conserved Domain Database (CDD) of NCBI, where details regarding domain function, structure and sequence are readily available. As with GoCharts and KeggCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens an HTML table displaying the gene identifier, LocusLink number, Gene Name, the pathway, and other classifications data and a 'Show All' button opens a new HTML table displaying all classification data. The 'Show Chart Data' button opens an HTML table containing the underlying chart data. Genes not classified by DomainCharts are handled in the same manner as with GoCharts and KeggCharts.
Incubation of primary human PBMCs with HIV-1 gp120 resulted in the differential expression of 402 genes. While 16 genes modulated by HIV-1 gp120 have been previously been associated with HIV replication and/or envelope signaling, the remaining genes are of unknown function or have never been associated with HIV-1 or gp120. Converting this list of genes into biological meaning requires the gathering of pertinent information from several data repositories. For many researchers this process consists of iterative browsing through several databases for each gene, manually gathering gene-specific information regarding sequence, function, pathway, and disease association. In contrast, the systematic approach of DAVID simultaneously adds biologically rich information derived from several public data sources to lists of genes in parallel. Selecting DAVID's Annotation Tool and uploading the list of 402 differentially expressed genes initiates the functional annotation and analysis of the entire dataset. Once submitted the gene list is stored for the entire analysis session, allowing users to switch between modules without having to resubmit data.
Choosing the GoCharts module opens a new window with a variety options. Users choose between three general types of classification (biological process, molecular function, and cellular component and five levels of annotation that represent term coverage and specificity (see material and methods). Any combination of classification and coverage level can be specified. Also included are options to annotate gene lists with all GO terms available or only the most specific terms, which are referred to as terminal nodes. The option to choose different levels of term specificity provides needed flexibility and thus allows researchers to dynamically determine which level of coverage and specificity best suites their data and stage of analysis. For instance, early stage analyses may consist of annotating gene lists with very general terms in order to gain a broad understanding of the data. In this case selecting biological process and level 1 classifies genes using general terms such as 'death' and 'cell communication'. Using increased term specificity facilitates the extraction of more detailed functional information. In this case selecting biological process and level 5 classifies genes using terms such as 'apoptotic mitochondrial changes' and 'chemosensory perception'.
Choosing molecular function, level 3 and selecting the 'Chart Values' button produces a new histrogram that quickly reveals that nearly half (16/35) of the stress response genes possess cytokine activity (Figure 4C). Indeed, cytokines have been shown to play an important role in the HIV-1 lifecycle and the results obtained here suggest that treatment of PBMCs with HIV-1 envelope proteins significantly modulate the transcription numerous cytokine genes. The efficiency with which GoCharts systematically summarized this large dataset with graphic visualizations, while remaining linked to primary data and external resources drastically improved the discovery process.
DAVID's Annotation Tool, GoCharts, KeggCharts, andDomainCharts combine to provide high-throughput methods for functional annotation and biological discovery, all of which can be accessed via the internet at http://www.david.niaid.nih.gov. The Annotation Tool efficiently appended annotations to 402 genes in less than twelve seconds and provided functional summaries and links to external data sources, all of which could be downloaded to a users personal workstation for further analysis. Complementary features including graphic visualizations of functional categories, conserved protein domains, and biochemical pathways were provided by GoCharts, DomainCharts, and KeggCharts that quickly led to the identification of stress response cytokines and protein kinases as major functional categories modulated by HIV-1 envelope proteins. This analysis supports the findings reported by the original authors and illustrates the utility of DAVID in the rapid annotation and analysis of large datasets commonly generated by high-throughput expression profiling.
The development of any complete, in-silico discovery system requires full, query-based access to an integrated, up-to-date view of all relevant information, regardless of its physical location and content structure. Still in its infancy, DAVID represents the foundation of our continued development efforts that aim to integrate information-rich data sources and provide quantitative analysis methods that promote biological discovery and knowledge growth. We have immediate plans to add new data for identifying relationships among receptor/ligand interaction networks. The incorporation of data that links transcription factors to their respective binding sites within promoter regions is of equal priority. Quantitative methods able to identify enriched functional categories in a list of genes are also under development (Hosack DA et. al., manuscript in preparation). While committed to maintaining a system able to co-evolve with technological advancement and the novel forms of data that are sure to follow, DAVID's current design elements provide automated solutions that enable researchers to rapidly discover biological themes in large datasets consisting of lists of genes.
We thank Bill Wilton for information technology and network support. The project has been funded with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, under Contract No. NO1-C0-56000. The contents of this tool does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the United States government.
- Quackenbush J: Computation analysis of microarray data. Nat Rev Genet. 2001, 2: 418-427. 10.1038/35076576.PubMedView ArticleGoogle Scholar
- Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000, 28: 10-14. 10.1093/nar/28.1.10.PubMedPubMed CentralView ArticleGoogle Scholar
- Cathy Wu H, Hongzhan Huang, Leslie Arminski, Jorge Castro-Alvear, Yongxing Chen, Zhang-Zhi Hu, Robert Ledley S, Kali Lewis C, Hans-Werner Mewes, Bruce Orcutt C, Baris Suzek E, Akira Tsugita, Vinayaka CR, Lai-Su Yeh L, Jian Zhang, Winona Barker C: The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002, 30: 35-37. 10.1093/nar/30.1.35.PubMedPubMed CentralView ArticleGoogle Scholar
- Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics. 1998, 14: 656-664. 10.1093/bioinformatics/14.8.656.PubMedView ArticleGoogle Scholar
- Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, Lengieza C, Lew-Smith JE, Tillberg M, Garrels JI: YPD™, PombePD™, and WormPD™: model organism volumes of the BioKnowledge™ library, an integrated resource for protein information. Nucleic Acids Res. 2001, 29: 75-79. 10.1093/nar/29.1.75.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 38-41. 10.1093/nar/30.1.38.PubMedPubMed CentralView ArticleGoogle Scholar
- Gasteiger E, Jung E, Bairoch A: SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol. 2001, 3: 47-55.PubMedGoogle Scholar
- Cicala C, Arthos J, Selig SM, Dennis G, Hosack DA, Van Ryk D, Spangler ML, Steenbeke TD, Khazanie P, Gupta N, Yang J, Daucher M, Lempicki RA, Fauci AS: HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc Natl Acad Sci. 2002, 99: 9380-9385. 10.1073/pnas.142287999.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res. 2003, 31: 82-86. 10.1093/nar/gkg121.PubMedPubMed CentralView ArticleGoogle Scholar
- University of Michigan Annotation data. [http://dot.ped.med.umich.edu:2000/ourimage/pub/shared/JMR_pub_affyannot.html]
- The Gene Ontology Consortium. Nature Genet. 2000, 25: 25-29. 10.1038/75556.Google Scholar
- Sonnhammer ELL, Eddy SR, Durbin R: Pfam: A Comprehensive Database of Protein Domain Families Based on Seed Alignments. Proteins. 1997, 28: 405-420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L.PubMedView ArticleGoogle Scholar