Identifying biological themes within lists of genes with EASE
© BioMed Central Ltd 2003
Received: 17 April 2003
Published: 25 April 2003
EASE is a customizable software application for rapid biological interpretation of gene lists that result from the analysis of microarray, proteomics, SAGE, and other high-throughput genomic data. The biological themes returned by EASE recapitulate manually determined themes in previously published gene lists and are robust to varying methods of normalization, intensity calculation and statistical selection of genes. EASE is a powerful tool for rapidly converting the results of functional genomics studies from "genes to themes."
Biological relevance within lists of genes
High-density microarray and proteomic technologies have enabled the discovery of global patterns of biological responses with respect to experimental or natural perturbations . Much work has addressed the issues of data normalization and statistical selection of genes significantly modulated or clustered based upon expression profiles . The net result of these efforts is one or more lists of genes. Unfortunately, little work has addressed the issue of rapidly identifying biological themes in such lists . Most investigators currently annotate genes one at-a-time using internet-based databases or manual literature searches. Following this tedious process, many researchers struggle to identify the most salient biological themes in order to make sense of their results and have no systematic way to prioritize these themes for further analysis. A parallel issue in interpreting such data regards how to leverage the ever-expanding flood of functional genomic data and tools. We developed the Expression Analysis Systematic Explorer (EASE) to automate the process of biological theme determination for lists of genes and to serve as a customizable gateway to online analysis tools. This is the first report to show that the highest-ranking themes derived by a computational method can recapitulate manually derived themes in previously published results, and that these themes are stable to varying methods of gene selection.
EASE performs three basic functions with any list of genes: 1) over-representation analysis of functional gene categories, 2) customizable linking to online tools, and 3) creation of descriptive annotation tables. Each of these functions uses a system of tab-delimited text files that are easy to customize and update. EASE is an easy-to-use, customizable tool that allows investigators to systematically mine the mass of functional information associated with data generated by microarray, proteomics or SAGE studies.
EASE uses customizable text files for theme discovery, annotation, and linking to online tools
To analyze a gene list, EASE first maps the gene identifiers to a standardized gene accession (SGA) system via a simple text file in the \Data\Convert\ directory. The default SGA system used by EASE is LocusLink numbers. Upon conversion to the SGA system, EASE maps the genes to biological categories within various classification systems. Each system is specified in a text file in the \Data\Class\ directory that maps many-to-many relationships between genes and gene categories within the classification system. Similarly, EASE maps genes to annotation fields specified in files of the \Data\ directory. Users can therefore utilize any system of identifying genes with any custom annotation fields or categorical systems by creating the associated text files in the appropriate directory, as outlined in the help files of EASE. EASE comes equipped with an automated update routine that downloads and parses public annotation data sources and installs a LocusLink-based system of files, thereby allowing researchers to use EASE with the most up-to-date annotation information.
EASE constructs hyperlinks to definitions for various categorical systems and the gene categories therein with configuration files in the \Data\Class\URL data\ directory. EASE is also capable of loading the genes in the current gene list into various online tools by using simple URL configuration text files in the \Links\ directory. Both types of configuration files are text files that are simple to create or modify to facilitate the addition of new links to online tools and definitions for new categorical systems added by the user.
For over-representation analysis, EASE can utilize any number of systems of categorizing genes simultaneously. EASE calculates over-representation with respect to the total number of genes assayed and annotated within each system to allow for side-by-side comparisons of categories from categorization systems with varying levels of annotation. The conversion of gene identifiers to an SGA system such as LocusLink numbers is essential to the over-representation analysis to ensure that a single gene represented by more than one identifier (typical of Genbank) receives only one "vote" for each of its categories.
EASE uses the three systems of the Gene Ontology as default categorization systems, however any set of custom or public systems can be simultaneously analyzed, including: SwissProt and PIR keywords, transcription factor regulation, protein domains, pathway membership, chromosomal location, and MeSH headings or keywords extracted from gene-associated literature.
The user has a choice of two statistical measures of over-representation: the one-tailed Fisher exact probability or a variant thereof - referred to as the "EASE score" - calculated by penalizing (removing) one gene within the given category from the list and calculating the resulting Fisher exact probability for that category. The EASE score represents the lower bound of all possible jackknife probabilities and has advantages in terms of penalizing the significance of categories supported by few genes. The EASE score thus favors more robust categories than the Fisher exact probability.
Exploring a gene list with EASE
The core function of EASE is to annotate or analyze a list of genes input as gene identifiers, and display the result in the system web-browser or save the result in a tab-delimited text or Microsoft Excel format. The identifiers can be loaded from a text file or pasted into EASE from another application. Upon input of identifiers, the user can generate an annotation table by clicking the "Annotate Genes" button (Figure 1). The user can also link to any number of online tools such as DAVID  via the "Link to:" list box; this function automatically loads the information specific to the current gene list into the online tool, thereby allowing EASE to serve as a convenient interface to these resources.
The identification of biological themes in the gene list is initiated by clicking the "Find over-represented gene categories" button. This function returns an output of all gene categories ranked by over-representation, with associated probabilities, counts used in the probability calculation, associated genes from the original list and links to various online tools for these genes. The most significantly over-represented categories that result from this analysis are deemed "biological themes" of the gene list. The user can optionally limit these analyses to any particular set of gene categories to answer questions such as "what is special about the mitochondrial genes on my list compared to all mitochondrial genes on the microarray?" The user can further use the "Refine" functionality of EASE to remove specific genes from the original list and enable an over-representation analysis of the remaining genes exclusively. These two functions can be applied repeatedly until the gene list is thoroughly characterized. EASE also allows for comparisons of gene lists at a thematic level, wherein the results are expressed in terms of gene categories over-represented in one list compared to all lists combined.
Calculating statistics on thousands of gene categories can lead to a few seemingly significant probabilities due simply to random chance. To address this multiple comparison issue, EASE is capable of implementing a wide variety of probability corrections including Bonferroni-type methods and bootstrap methods performed by iteratively running over-representation analyses on random gene lists to more accurately determine the true probability of observing a given categorical enrichment. Nevertheless, the power of EASE is most appropriately viewed as an exploratory tool to direct the attention of the researcher to enriched biological themes by prioritizing functional categories based on the significance of over-representation.
EASE themes recapitulate manually-determined themes
The published gene lists of Kayo et al.  were re-analyzed with EASE to test the ability of EASE to generate themes comparable to manually determined themes. In the Kayo study, the authors generated four gene lists corresponding to genes up- and down-regulated in primate muscle in response to aging or caloric restriction. These gene lists were analyzed with the categorical over-representation function of EASE using EASE scores that were corrected for multiplicity using 10,000 bootstrap iterations. All significant (p < 0.05) categories resulting from each list were compared to the themes manually determined and published by Kayo et al. (Figure 2).
EASE themes are robust
Figure 3a demonstrates the instability of the size and overlap of the gene lists that result from varying gene selection methods. The percentage of genes overlapping in any two lists was highly variable, and ranged from 7% to 60%. In spite of this striking variation, the top five biological themes returned by EASE for each of the eight gene lists were virtually the same; all derived from a group of six categories that implicate a vigorous interferon-induced immune response in patients with rebounding HIV viral loads (Figure 3b). The conversion of genes to themes with EASE allowed the "biological result" of the experiment to be determined despite substantial differences in gene list content resulting from the use of various normalization, gene intensity and statistical selection methods.
"Genes To Themes" with EASE: Possible uses of the EASE Method
EASE rapidly converts a list of genes into an ordered table of robust biological themes that summarize the biological result of the experiment. This method has immediate utility for finding themes that most differentiate lists of genes, e.g. up-regulated versus down-regulated in a single experiment, but could potentially be applied to compare the results of different experiments, even involving different species and/or microarray platforms. The EASE method has proven useful for a SAGE analysis of cancer (W.D. Stein, manuscript in preparation) and for microarray analyses of cancer (A. Domkowski, manuscript in preparation; K. Akagi, personal communication), cataracts (M. Kantorow, manuscript in preparation) and immune function in HIV disease [9, 10]. The EASE method also enables a rapid assay for overlap between gene clusters identified in any number of experiments when the user creates gene classification schema based upon these clusters. EASE can potentially be used to facilitate the development of data normalization and gene selection criteria by observing the highest enrichment attained for EASE themes within a particular experiment in which the biological phenomenon is well characterized and confirmed. EASE allows investigators to fully leverage the potential of high-throughput functional genomics technologies to infer biological themes. A full-featured version of EASE is freely available to non-profit researchers for use on Windows operating systems http://david.niaid.nih.gov/david/ease.htm and a limited online version of the EASE over-representation function is available on the DAVID website .
- Heller MJ: DNA microarray technology: devices, systems, and applications. Annu Rev Biomed Eng. 2002, 4: 129-153. 10.1146/annurev.bioeng.4.020702.153438.PubMedView ArticleGoogle Scholar
- Quackenbush J: Microarray data normalization and transformation. Nat Genet. 2002, 32 Suppl: 496-501. 10.1038/ng1032.PubMedView ArticleGoogle Scholar
- Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet. 2002, 32 Suppl: 502-508. 10.1038/ng1033.PubMedView ArticleGoogle Scholar
- Database for Annotation, Visualization and Integrated Discovery. [http://david.niaid.nih.gov/]
- Kayo T, Allison DB, Weindruch R, Prolla TA: Influences of aging and caloric restriction on the transcriptional profile of skeletal muscle from rhesus monkeys. Proc Natl Acad Sci USA. 2001, 98: 5093-5098. 10.1073/pnas.081061898.PubMedPubMed CentralView ArticleGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Sidorov IA, Hosack DA, Gee D, Yang J, Cam MC, Lempicki RA, Dimitrov DS: Oligonucleotide microarray data distribution and normalization. Information Sciences. 2002, 146: 65-71. 10.1016/S0020-0255(02)00215-3.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Cicala C, Arthos J, Selig SM, Dennis G, Hosack DA, Van Ryk D, Spangler ML, Steenbeke TD, Khazanie P, Gupta N, et al: HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc Natl Acad Sci USA. 2002, 99: 9380-9385. 10.1073/pnas.142287999.PubMedPubMed CentralView ArticleGoogle Scholar
- Chun TW, Justement JS, Lempicki RA, Yang J, Dennis G, Hallahan CW, Sanford C, Pandya P, Liu S, McLaughlin M, et al: Gene expression and viral prodution in latently infected, resting CD4+ T cells in viremic versus aviremic HIV-infected individuals. Proc Natl Acad Sci USA. 2003, 100: 1908-1913. 10.1073/pnas.0437640100.PubMedPubMed CentralView ArticleGoogle Scholar