- Open Access
L2L: a simple tool for discovering the hidden significance in microarray expression data
© Newman and Weiner; licensee BioMed Central Ltd. 2005
- Received: 5 April 2005
- Accepted: 26 July 2005
- Published: 31 August 2005
L2L is a database consisting of lists of differentially expressed genes compiled from published mammalian microarray studies, along with an easy-to-use application for mining the database with the user's own microarray data. As illustrated by re-analysis of a recent study of diabetic nephropathy, L2L identifies novel biological patterns in microarray data, providing insights into the underlying nature of biological processes and disease. L2L is available online at the authors' website [http://depts.washington.edu/l2l/].
- Gene Ontology
- Diabetic Nephropathy
- Glomerular Cell
- Coexpression Relationship
- Ageing Human Brain
In only a few years since their development, high-throughput, whole-genome DNA microarrays have become an invaluable tool throughout biology. The appeal of microarrays seems most irresistible when the biological problem is most intractable; microarrays have become perhaps the most popular contemporary tool for hypothesis generation. Yet interpreting the mountain of data produced by a microarray experiment can be a frustrating chore. The most common outcome of such an experiment is a list of genes, or many such lists: genes that are induced or repressed under one condition or another, at one time point or another, in one cluster or another. The daunting task is to extract some meaning from these lists, either by identifying 'critical genes' which might single-handedly produce a biological effect, or by finding patterns in the list that point to an underlying biological process. The latter universally involves annotating each gene on the list and looking for groups of genes that share a particular characteristic. Until recently, this was done entirely by hand. Each gene was assigned, after a laborious literature search, to an arbitrary functional category like 'DNA repair' or 'metabolism'. A hypothesis might be based on which arbitrary categories appeared most often. Like any non-systematic approach, this one is vulnerable to our very human knack of seeing whatever pattern we wish in a noisy field. The Gene Ontology (GO) consortium  has brought systematic order to the field of gene annotation by pre-categorizing genes by biological process, molecular function, and cell component - thus eliminating the pattern-creating risk of post hoc annotation. A number of software tools now exist to automate the process of annotating a list of genes with GO categories. Several of these, including EASE , GOMiner , Onto-Express  and GO::TermFinder , also calculate the over-abundance of each category in the list, along with its statistical significance. However, even after functional annotation of the list of genes, uncertainty remains as to whether the results advance understanding of the biology at work in the system, and, if the system is a complex disease, whether the results help explain why the gene expression changes occurred. An alternative approach to interpreting gene expression data is to compare it with other related (or potentially related) gene expression data. The motivation is that microarray experiments exhibiting common changes in gene expression are likely to share one or more underlying molecular mechanisms. Furthermore, in some experiments, the underlying cause of the gene expression changes is well-defined: a specific gene deletion, for example, or treatment with a single receptor ligand. In such cases, the ability to connect the user's experiment with gene expression changes caused by a well-defined perturbation may lead immediately to a hypothesis regarding the underlying mechanism in the system under study.
The need for a standardized format for presenting and storing microarray data from disparate platforms has been recognized for several years. A consortium of researchers  has detailed a standardized format for presenting microarray data (MAIME)  as well as a markup language in which to encode those now-standardized data (MAGE-ML) . The data can be deposited in any of a number of large public repositories, including CIBEX, ArrayExpress, Oncomine and the NIH's Gene Expression Omnibus (GEO) [10–13]. All of these include web-accessible data-mining tools for browsing experiments and searching for the expression results associated with a particular gene. The sheer volume of deposited data is staggering, and represents a gold mine for bioinformaticians. Yet it all remains remarkably inaccessible to lay biologists. Although we can search GEO, for example, for microarray-identified genes one-by-one, there is no simple way to compare our data en masse with any other data in the repository, much less against all the data in the repository. Furthermore, repositories can make it difficult to extract the original results from the mass of deposited data; an interested user is often required to essentially re-analyze the data, with little knowledge of the original data analysis protocol or, in some cases, without access to all of the relevant data (for instance, GEO submissions do not usually include Affymetrix test-statistic data, a qualitative 'change call' which can be more accurate than the quantitative fold-change for detecting differential expression ).
The L2L Microarray Database collects an interesting subset of this public data in its most essential and accessible form - simple, well-annotated lists of genes, using a universal identifier, which were found to be either upregulated or downregulated under a particular condition. It is not intended to be an alternative to the public repositories, but an accessible and utilitarian supplement. The database can be easily applied to the global analysis of any gene expression experiment, producing insights that go well beyond gene-by-gene annotation. The development of L2L was inspired by our efforts to extract meaning from our own microarray analysis of the progeroid Cockayne syndrome (Newman JC, Bailey AD, Weiner AM, unpublished data), so the publications included in the database initially reflected topics thought to be related to this disease - ageing, cancer and DNA damage. Since then, the scope of the publications we included has expanded considerably to include chromatin structure, immune and inflammatory mediators, the hypoxic response, adipogenesis, growth factors, cell cycle regulators, and others. In spite of the parochial origins of the database, the wide range of topics now covered will make L2L of general interest to any investigator using microarrays to study human (and more generally, mammalian) biology. We demonstrate the breadth of L2L's utility below, by re-analyzing a published microarray dataset from a study of diabetic nephropathy - a subject completely unrelated to our original interests. Newman JC, Bailey AD, Weiner AM: manuscript in preparation.
We faced two major challenges in the creation of L2L, one philosophical and one practical. The philosophical problem, which has prevented any significant effort in this direction to date, is that no two microarray experiments are ever perfectly comparable. There is an almost infinite combinatorial complexity of organism, tissue type or cell line, RNA isolation technique, microarray platform, scanning instrument, experimental design, and data analysis technique - even if the question being asked is identical. To make a tool like L2L even possible, it is essential to exclude any incomparable information from each experiment, and convert the remainder to a common language that can be shared by all included experiments. We therefore removed all references to platform-specific probe identifiers, primarily because these would limit L2L to comparing experiments performed on identical platforms, but also because many manuscripts do not report probe IDs. Instead, we converted the probe IDs to the HUGO-approved symbols  of the genes they each represent, according the manufacturer's annotations, and ignored those that have no gene association because these cannot be reliably compared across platforms. We also excluded the reported magnitude of expression changes, because fold-changes are often not comparable across platforms . Furthermore, fold-change can be a misleading indicator of the significance of expression changes, especially for platforms like Affymetrix GeneChips that use an independent, and more robust, change call calculation . Finally, ignoring fold-changes vastly simplifies the computational task of comparing hundreds or thousands of lists.
The practical challenge was the extraction of published data and conversion to HUGO gene symbols. This was by far the most time-consuming of the tasks required to create L2L, despite the liberal use of automated tools. The first hurdle was the difficulty of extracting data from published papers in a usable form. Many tables of genes are published as graphical figures rather than textual tables. Supplemental data is often in the form of HTML tables, rather than text files. In both cases, the data are easy to view, but difficult to extract for other uses. More willful is the use of digital-rights management by certain journals to frustrate copying of any information from the electronic (PDF) version of the paper. In all of these situations, laborious manual transcription was required, instead of simple keystrokes to cut-and-paste the data. Repositories like GEO are only a partial solution to this presentation problem; the repositories contain all the raw data, but often lack information about the data analysis used to define a robust change, as well as the actual lists of robustly changed genes.
The second hurdle was actually identifying the genes on published lists. Many publications do not provide an unambiguous reference for each gene - only a common name and/or description. Those that do provide unambiguous references do so in a variety of forms - a HUGO name, LocusLink ID, GenBank accession, or (rarely) commercial probe ID. Online tools exist to interconvert many of these [17, 18] and were used whenever possible to convert each list to HUGO names. Ambiguous references were hand-converted by finding the proper match in LocusLink or EntrezGene. Some lists in the L2L Microarray Database are derived from mouse experiments; these were first converted to standard mouse gene names, then mapped to the corresponding HUGO gene name using the HomoloGene database  with an ad hoc tool. Any genes without HomoloGene entries were matched by hand in EntrezGene to the proper human homolog. Any gene reference, mouse or human, which could not be unambiguously mapped to a HUGO name was ignored. Duplicates within a list were also ignored. The fraction of the original data that could eventually be mapped to a HUGO name varied with the quality of the gene reference, the proportion of expressed sequence tags (ESTs), and whether mouse-human conversion was required. Most datasets with unambiguous human references have greater than 90% of non-EST, non-duplicate gene references represented in the L2L list of HUGO names. Mouse-human conversion reduced this proportion somewhat (largely due to immunity-related genes), as did descriptive gene references (due to ambiguity). Each list in the database is annotated with a meaningful short name, a longer description, the platform used to generate the list (for example, Affymetrix U95Av2), one or more keywords, and the PubMed ID of the source publication.
In addition to the L2L Microarray Database, L2L includes a set of lists for each of the three organizing principles of Gene Ontology - biological process, molecular function and cell component. These lists were compiled from the July 2004 GO association tables, which include associations between UNIPROT names and GO terms. UNIPROT's flat-files associate many human UNIPROT entries with a HUGO alias; an ad hoc tool was used to extract these relationships and convert the UNIPROT GO term assignments to unique HUGO GO term assignments. Another ad hoc tool then created a list for each GO term that contained every HUGO name associated with either that term or any of its descendants. Any lists with fewer than five genes were discarded because comparison to such a small list is unlikely to be informative. In all, there remained 2,169 GO-derived lists with a total of about 240,000 annotations, divided among the three organizing principles. A more detailed description of how the GO lists were compiled, along with downloadable versions of the ad hoc tools, is available on the L2L website .
Finally, L2L is not limited to using the four included sets of lists: L2L Microarray Database, GO: Biological Process, GO: Molecular Function, and GO: Cell Component. The modular nature of the tool means that new sets of lists can be created from any source of gene annotations. Some examples include protein-protein interaction databases like DIP, BRITE or BIND [20–22]; pathway annotations from KEGG, BioCarta or GenMAPP [23, 24]; experimental gene expression modules ; or the associations of gene names with literature keywords that can be compiled using tools like PubGene and TXTGate [26, 27]. Any source of gene annotation that can be represented as a set of lists, each specifying a group of genes that share some characteristic, can be easily used with L2L. We hope that the simple and open file formats will encourage others to contribute their own sets of lists to augment L2L or to create similar platform-independent resources.
Although we designed L2L for the lay biologist, we hope that the L2L Microarray Database will prove to be a valuable resource for the bioinformatician as well. For example, many investigators are interested in mapping networks of gene coexpression relationships with the goal of inferring previously unknown functional relationships, or even physical interactions, from shared expression profiles [28–30]. The L2L database is a significant source of primary data for such coexpression analyses. It currently contains 28,026 data points derived from microarray experiments, each of which represents a significant gene expression change. These data points encompass 10,151 gene names - a substantial fraction of the 33,000 HUGO names that had been assigned at the time of writing - and 6,009 of these genes occur at least twice in the database. Among these genes, there are 258,461 unique positive coexpression relationships (a pair of genes found together on different lists) that are found on at least two, and in some cases as many as 16, different lists. There are 20,338 negative coexpression relationships (pairs of genes that are inversely regulated, that is, one appearing on the 'up' and the other on the 'down' list for the same condition) that are found in at least two, and as many as ten, different conditions. We believe the L2L database's catalog of co-expression relationships is one of the largest yet available for human genes, and is based on more robust expression changes and a broader set of experimental conditions than other, albeit more sophisticated, efforts .
The modular design of L2L means that there are a variety of ways to interact with the L2L application, depending on the user's needs. The simplest is through the web interface. In addition to the four-step form described above, there is a 'More Options' page that allows the user to upload a custom translator library for microarray platforms that are not on the menu. Thus, while L2L is intended primarily for use with whole-genome expression microarrays, it can be used with data from any genomic or proteomic analysis. Alternatively, the L2L application itself can be downloaded and run from the command line on any computer with Perl and a UNIX-like command shell. This is ideal for users who want to use a custom set of lists or who need to rapidly process many different data files in a batch mode. L2L includes a basic textual interface that prompts the user for the location of the three necessary inputs: data file, translator library and set of lists. A batch mode bypasses the interface and allows the processing of any number of data files, each from a different microarray platform, against any or all sets of lists with a single command. Users are also free to download the entire L2L website and run it on their own web server.
L2L is remarkably fast because all of the potentially billions of search-for-match operations are implemented as hash-table lookups in Perl. Since relatively few data are stored in memory at any one time, performance is processor-bound on modern machines, and scales linearly only with the combined size of the lists - not with the size of the data file. A comparison of virtually any size data file to all 357 lists in the database, along with the creation of all output files, takes only about 15 seconds on a 1.4 GHz PowerPC. All files associated with L2L, including data, translator library and list, are in a simple tab-delimited, flat-file format. A detailed description of each file type is available on the L2L website ; users can create their own files from any text editor.
The ultimate test of a utility like L2L is whether it can produce novel biological insights from real-world microarray data. With this objective in mind, we downloaded several publicly available datasets and analyzed their lists of gene expression changes with L2L (the sample datasets and all results are available at the L2L website ). Diabetic nephropathy (DN) is one of the most common, and most devastating, complications of type 2 diabetes mellitus (T2DM) but its molecular etiology remains poorly understood. To generate new hypotheses, Baelde and colleagues examined gene expression patterns in human kidney glomeruli isolated either from normal kidneys or from kidneys afflicted with DN . Several hundred genes were found to be significantly changed in DN, and these were then classified according to GO category using MAPPFinder . The primary hypothesis that ultimately emerged from the experiment, however, relied entirely on an analysis of 'critical genes' - a handful of genes with biological functions that seemed likely to be relevant. Specifically, dysregulation of several tissue repair genes and repression of the growth factor VEGF led the authors to suggest diminished repair capacity in capillary endothelium as a possible etiology for DN. They also suggested, based on MAPPfinder's list of overabundant GO categories, that DN kidneys suffer from reduced nucleotide metabolism and disturbed cytoskeleton formation.
Three novel themes emerged from the comparison with the L2L Microarray Database of genes downregulated in DN. Firstly, many of these genes are induced by interferon - nine lists related to interferon and the viral response overlap very significantly with the list of genes repressed by DN (p values from 2e-4 to 2e-14). Perhaps related to this, genes downregulated in DN also significantly overlap with genes induced by tumor necrosis factor (TNF)α (p = 5e-5). Secondly, hypoxia-induced genes are repressed in DN - five lists have p values from 8e-3 to 8e-6. Thirdly, and most surprisingly, five lists of genes upregulated in adipocyte differentiation and function overlap with genes repressed by DN (p values from 2e-3 to 2e-7), whereas two lists of genes downregulated during adipocyte differentiation correlate with genes upregulated in DN (p = 0.002 and 0.0008).
The relationship between genes repressed in DN and genes induced by interferon (IFN) illustrates an important caveat regarding tissue-based microarray experiments: the complexity of the tissue itself makes it difficult to determine whether the results reflect changes in expression within glomerular cells, a different degree of leukocyte contamination, or even changing gene expression within those leukocytes. The latter two scenarios are consistent with previous findings of dysfunctional cell-mediated immunity in diabetes [38–41]. The association of genes repressed by DN with those induced by TNFα may be interpreted in this context as well, because at least one study suggested poor response to TNFα as one reason for the immune deficiency in T2DM . Since no cytokines appear on the list of differentially expressed genes, these data suggest - supposing the gene expression changes reflect contaminating leukocytes - that a poor transcriptional response of leukocytes to cytokines may cause the immune deficiency in T2DM.
The most widely accepted theory of pancreatic β-islet cell dysfunction in T2DM is that a variety of inflammatory signals from diet, adipocytes and the immune system combine to trigger apoptosis in those cells [42, 43]. Two of the most important signals are thought to be TNFα from adipocytes and IFNγ from leukocytes. It is intriguing, therefore, that while the L2L analysis found downregulation of IFNγ- and TNFα-induced genes in DN, the GO:Biological Process analysis specifically identified the downstream apoptotic effectors of these two cytokines (JAK/STAT for IFNγ, IκK/NFκB for TNFα) as also downregulated in DN. So rather than being an artifact of leukocyte contamination, these results could reflect reduced sensitivity to the blood-borne inflammatory signals that, in sensitive pancreatic islets, trigger β-islet cell apoptosis - the hallmark of the underlying disease.
The second theme - a poor hypoxic response - suggests a transcriptional defect more specific to glomerular cells. At first glance, the direction of this correlation is surprising: DN kidneys should already be under hypoxic stress if poor angiogenesis and endothelial dysfunction are partially responsible for DN. However, this effect is apparently swamped by the ischemia experienced by all kidneys following extraction, before RNA is harvested. Although all kidneys were handled identically, hypoxia-response genes were more strongly induced in the normal controls. This could suggest that DN glomeruli are already stressed, and unable to respond fully to further stress. The result could be a downward spiral of increasing damage and reduced function.
Adipogenesis, the third theme, also seems puzzling at first. Why would adipocyte differentiation genes be differentially regulated in kidney glomeruli? Another hallmark of diabetes is deranged adipocyte function - adipocytes are insulin-resistant, have diminished capacity to store fat, and secrete excessive amounts of inflammatory cytokines and free fatty acids . Such dysfunctional adipocytes may be primarily responsible for creating the chronic inflammatory state that brings about overt disease . Adipocytes are also one of the primary targets of the most widely used class of antidiabetic drugs. Thiazolidinediones (TZDs) are agonists of PPARγ, a transcription factor required for early adipocyte differentiation. TZDs can help restore normal adipocyte function in diabetics . The dysregulation of adipocyte differentiation genes, therefore, may be another fingerprint of the underlying disease, indicating either the dysfunction of contaminating adipocytes in the glomeruli preparations, or a surprising sensitivity of glomerular cells to the same dyslipidemic signals that perturb adipocyte function in diabetics. Interestingly, a microarray analysis of a mouse model of DN, contemporary with this human study, found deregulation of a number of lipid homeostasis genes .
Taken together, the L2L results demonstrate the importance of considering T2DM and its complications as part of a single, integrated disease process. The fingerprints of the underlying disease - inflammatory factors and adipocyte dysfunction - are readily detectable in kidney glomeruli, and suggest that the same factors that cause β-islet cell and adipocyte dysfunction are responsible for glomerular dysfunction as well. In fact, PPARγ is expressed in rodent glomeruli [48, 49] and treatment with a TZD enhances renal function in both rats and humans [50–52]. It would be interesting to determine which dyslipidemic signals affect DN glomeruli; how those signals are transduced in glomerular cells; and whether the result is abnormal intracellular lipid accumulation , or direct inhibition of glomerular function by activation of specific intracellular signaling pathways  - either of which might prevent glomerular cells from responding to normal growth and stress signals.
Deregulation of gene expression is now thought to underlie many of the effects of ageing in a variety of organisms, including humans. There is a well-defined link between human ageing and disruption of normal DNA methylation patterns [53–55]. A 'unified theory of ageing' has even been proposed, which asserts that 'the progressive and patterned alteration of chromosome structure is the primary cause of ageing' . Other investigators have suggested that such transcriptional deregulation is a programmed response to stresses that increase with age , the stochastic result of failed genome maintenance , or the specific result of the disruption of some critical (but unknown) cellular function [59, 60].
We analyzed two recent gene expression studies of the ageing human brain, to see if there were common patterns in the transcriptional deregulation. Lu and colleagues  found significant gene expression changes in the frontal cortex of individuals from 26 to 106 years of age. Genes involved in synaptic plasticity, vesicular transport and mitochondrial function were downregulated, while stress-response, antioxidant and DNA repair genes were upregulated. They found increased DNA damage at the promoters of downregulated genes, leading them to suggest that 'DNA damage may reduce the expression of selectively vulnerable genes involved in learning, memory and neuronal survival, initiating a programme of brain ageing that starts early in adult life'. Blalock and colleagues  correlated hippocampal gene expression with histological and clinical markers of Alzheimer's disease (AD). They found a large number of genes whose expression changes correlate with either or both incipient and overt disease, and suggest that the pathogenesis of AD is 'genomically orchestrated'. EASE analysis  showed that growth, differentiation and tumor suppressor pathways are upregulated early in the disease process, while protein-processing pathways are downregulated.
Using Gene Ontology lists, L2L quickly replicated the EASE results of Blalock et al. (the complete analysis is available on the L2L website ). Using the L2L Microarray Database, L2L also revealed a novel link between AD and the hypoxia response. Genes upregulated with overt AD overlapped significantly with two lists of genes upregulated in myocardium during heart failure (p values 2e-5 and 8e-10) and three lists of genes specifically induced by hypoxic stress (p values 0.002 to 0.005). Moreover, genes downregulated with overt AD overlapped with two lists of genes downregulated in heart failure (p values 0.004 and 5e-5).
Although patterns of related gene expression changes were easily found in a variety of ageing models, we could not clearly define a set of age-regulated genes. A small group of genes was commonly regulated in the two human studies we examined, but none was also consistently regulated in studies of mouse or monkey models, or even in human studies of other tissue types. Indeed, when only those genes that are commonly regulated in human brain were queried against the L2L Microarray Database, no significant overlaps were found except with the studies from which they were derived. Taken together, these data suggest that while transcriptional deregulation is a fundamental feature of cellular ageing phenotypes, the detailed transcriptional profiles are tissue-specific and perhaps, to some degree, stochastic. Thus, ageing-related gene expression changes in different tissues and models are sufficiently similar to suggest a common underlying mechanism, perhaps DNA damage to sensitive promoters  or failure to maintain chromatin structure ; however, differences between the profiles suggest that the specific genes deregulated in each situation must be drawn from a larger pool of genes exhibiting varying degrees of vulnerability to deregulation. This illustrates both the danger of relying too heavily on a 'critical genes' approach to explain ageing phenotypes, as well as the hope that there may well be a common underlying mechanism of transcriptional dysregulation waiting to be elucidated.
The question remains as to whether the results of an L2L analysis can be trusted. These concerns fall into two major categories, which might be described as qualitative and quantitative. The qualitative concern is whether the lists of differentially expressed genes in the database are trustworthy, and if comparison to a user's data can be meaningful. The quantitative concern is whether the statistics we use to judge the significance of the overlaps between a user's data and lists from the database provide a useful metric of biological meaning.
Could a small amount of poorly analyzed or biased data in the L2L database poison the well for all who drink? Much like the scientific process as a whole, L2L takes a distributed-competence approach, augmented by independent replication and careful statistical analysis, to mitigate this concern. Our working assumption is that investigators themselves are best qualified to judge the quality of their own data, and that published lists usually include only those genes for which a change call can be assigned with a reasonable probability. We augment this assumption by including in the database, whenever possible, microarray datasets generated by independent groups that have addressed the same or a closely related question. Given the noise inherent in any microarray experiment, a user can feel much more secure interpreting results which reflect overlap with several related database lists from different sources, rather than idiosyncratic overlap with just one list. Finally, L2L calculates a p value for each comparison that provides a quantitative assessment of the significance of an overlap. If an experiment is contaminated with random data due to experimental error or systematic bias, the likelihood of the L2L list derived from that experiment overlapping significantly with any other experimental data would be purely stochastic - unless both experiments suffer from a common systematic bias. For example, we performed a 10,891-trial simulation with randomized data to help validate our sample analysis of diabetic nephropathy. The odds of achieving a p value below 0.05 with random data was no greater than 0.05 for any list in the database, and as low as 0.001 (see supplemental data on the L2L website ). In the absence of common systematic bias, therefore, random data are very unlikely to produce spuriously significant L2L results.
There are two major potential sources of systematic bias: genes that are considered a priori to be 'interesting' or 'critical' based on previous data or theory, and platform-specific bias. Certain often-studied, well-understood genes - the very kind that lend themselves to 'critical gene' hypotheses - are represented on virtually all microarray platforms, and thus could be more likely to be found in random data acquired with any platform. Certain genes may also be more likely to be flagged as differentially expressed on a particular type of chip, perhaps because the chip is more sensitive to small variations at particular expression levels or because of probe-specific effects. If any systematic bias exists, it could only represent a higher likelihood of a random change in signal for that gene or probe - the chip does not know whether the control or experimental RNA is washed onto it, or with which dye color. So proper experimental design and data analysis should eliminate these false-positives before a user turns to L2L. The same applies to the published data from which the L2L lists are derived. If any false-positive genes do persist on database lists, the fact that L2L separately analyzes 'up' and 'down' lists mitigates their impact, because they will be randomly distributed between the two lists. These separate lists also provide great potential assurance for the user, if the 'up' and 'down' lists in the user's data both correlate significantly and respectively with the 'up' and 'down' lists (or vice versa) for a particular condition in the database (see Figure 4b, diabetic nephropathy and adipogenesis). The inclusion of data from independent groups can provide further assurance, because the same set of randomly changing genes is unlikely to be found in independent datasets from different platforms. Still, both sources of systematic bias can be directly addressed in a future release of L2L by more sophisticated statistical analysis algorithms. Each list in the database is annotated with the platform that produced it, so the frequency of occurrence of genes among lists from a given platform (platform-specific bias) as well as the overall occurrence of genes in the database (bias toward 'interesting genes') could be used to weight the contribution of each gene match to the overall significance of the overlap between two lists.
Sample data subjected to p value adjustment by Bonferroni correction or random-data simulation
Actual diabetic nephropathy (downregulated) data
Random-data simulation (10,891 trials)
Name of L2L database list
Total U95Av2 probes on list
Binomial p value
Hypergeometric p value
Poisson p value
Bonferroni-adjusted binomial p value
Median binomial p value
p value (list-specific) of actual binomial p value
p value (all lists) of actual binomial p value
FDR of actual binomial p value
The essential task for a statistical test in over-abundance analysis is to quantify how surprised we should be to see a particular degree of overlap or, conversely, how likely it is that the overlap occurred by chance. If the likelihood of success in a trial is p , and we perform n trials, what are the odds that we will see m or more successes? In the case of L2L, n is the number of probes that map to a list in the database, and p is the likelihood that any one of them will be found in the data by chance - the proportion of probes in the user's data out of all the probes on the microarray. A 'trial' tests whether one of the n probes derived from a database list is found in the user's data; success is a match. The binomial distribution permits the exact calculation of the odds of achieving a particular number of matches out of n trials. The cumulative probability of achieving m or more matches is found as follows:
L2L uses the Double Precision Cumulative Distribution Function Library (DCDFLIB) , implemented in the Math::CDF Perl module , to compute binomial probabilities. The binomial distribution performs trials with replacement - the odds of scoring a success remain constant for all trials. In reality, a probe can only be selected once, so the hypergeometric distribution, which calculates probabilities without replacement, is more accurate. However, it is more difficult to calculate than the binomial distribution, and in any event approaches the binomial distribution at large values of n and m, where replacement has little impact on the odds of the next trial. Alternatively, the Poisson distribution is easier to calculate than the binomial distribution, and approaches it where values of n are large and p small (as in most L2L analyses) . In our sample dataset of genes upregulated in diabetic nephropathy, the p values calculated from the hypergeometric distribution or Poisson distribution closely followed those calculated from the binomial distribution (Table 1; compare columns 5, 6 and 7). We therefore chose to use the binomial distribution as a reasonable compromise between accuracy and computational requirements.
The multiple-hypothesis problem is that when testing a large number of hypotheses simultaneously - here, that each of the hundreds of lists in the L2L database might overlap significantly with the user's data - the odds of producing a low p value by chance become substantial . For example, with 357 lists in the L2L database, we might expect purely random data to produce about 18 'significant' overlaps with p values <0.05 (357 * 0.05). There are two common approaches that either reduce the odds of seeing any such false-positive p values, or mitigate their effect. The former approach is to control the family-wise error rate, usually by applying some adjustment to the calculated p values. This adjustment can be the same for all p values (termed 'single-step') or can vary as we evaluate each p value in order ('step-down' or 'step-up'). The single-step Bonferroni is the most common adjustment, and is simply the multiplication of the p value by the number of hypotheses (p * n, n being 357 in this case). We found the Bonferroni adjustment to be excessively conservative, based on the simulation-adjusted p values and false discovery rate (see below, and Table 1). The single-step Sidák, which uses the adjustment (1 - (1 - p )^n ), produced near-identical results to the Bonferroni for low p values. Since n has a large initial value, step-down procedures for these two adjustments - where n is decremented by 1 as we adjust each p value in ascending order - did not produce substantially different adjusted p values.
An attractive alternative to simple adjustments based on the number of hypotheses is to perform simulations with random data, and adjust p values based on their frequency of occurrence among the random results. We therefore undertook a 10,891-trial simulation using datasets of the same size as our diabetic nephropathy sample (513 probes), drawn randomly from all the probes on the U95Av2 microarray (10,877 probes). We used true random numbers from Random.org  for all simulations. As expected, the median binomial p value calculated from these random data was not significant for any list (Table 1, column 9). We compared each p value from the actual sample data to the simulation-generated p values for that specific list, and for all lists together. In both cases, the frequency of occurrence of a p value equal to or less than the actual p value (that is, the simulation-adjusted p value) was generally lower than the actual p value (Table 1). This shows that, at least for the diabetic nephropathy dataset on the U95Av2 platform, a simple calculation of p values based on the binomial distribution gives a good approximation of the actual likelihood of seeing an overlap by chance. The capability to perform a simulation analysis will be included in a future release of the downloadable L2L application. However, the utility of a simulation analysis is proportional to the number of trials run, because an adjusted p value cannot be lower than (1/number of trials). Each 'trial' is a full-fledged L2L analysis, so a 10,000-trial simulation takes four orders of magnitude longer to run than a single analysis, not considering the time required to create random datasets. The computational requirements are therefore daunting, and preclude it from being practical in a web-based tool.
All such p value adjustments, however they are made, aim to reduce the chances of seeing any false positives. They can therefore be too conservative if, as in most biological questions, permitting a few false-positives is a reasonable trade-off for seeing more true data. The false-discovery rate (FDR) is an increasingly popular approach to the multiple-hypothesis problem that mitigates the effect of false-positives by estimating how many there are at a given level of significance, rather than trying to eradicate them . It can therefore be substantially more powerful than controlling the family-wise error rate. We used our random-data simulation to calculate the FDR at all levels of significance by dividing the average number of random occurrences of a p value less than or equal to a given number by the number of occurrences in the actual data of a p value less than or equal to that number. Column 12 of Table 1 shows that if we use the least significant binomial p value of our 22 sample lists (0.0075) as a cutoff, only 2% of the lists with equal or lower p values are expected to be false positives. Overall, a binomial p value of 0.05 corresponded to an FDR of about 10%, and 0.01 to 2.5%. The capability to calculate FDR from simulation data will be included in a future version of the downloadable L2L application, but these sample data suggest that the simple and economical binomial calculation of L2L, with a rough p value threshold of 0.05-0.01, strikes a reasonable balance between stringency and power.
Sample data subjected to permutation analysis or comparison by gene symbol instead of probe ID
10% Data permutation (10,891 trials)
Comparison by gene symbol
Name of L2L database list
Binomial p value (actual)
Median permutation binomial p value
p value (list-specific) of actual binomial p value
p value (all lists) of actual binomial p value
FDR of actual binomial p value
Total gene symbols on list
Binomial p value
The second potential source of p value inflation arises from the universal nature of the database. The common language, HUGO symbols, must be translated to platform-specific probe identifiers for the user's microarray. If only a handful of genes in a database list are represented on the microarray, and one of those genes happens to be represented by several probes, all of which are differentially expressed in the user's experiment, the list will generate a highly significant p value on the questionably narrow basis of that single gene. A user can see on a Listmatch page exactly which genes or probes created a small but significant overlap, and judge if it appears to be an artifact of translation. Users should be particularly wary of genes used as hybridization controls. We re-analyzed our diabetic nephropathy sample data without probe translation, using only gene symbols (Table 2). Several of the 22 sample lists dropped out of statistical significance; most of these were due to STAT1, an Affymetrix hybridization control, being represented by six probes in the data. Users may wish to remove control probes from their data before analyzing it with L2L. A future release of L2L will incorporate a directed-permutation algorithm into the statistical analysis to ensure that a reported p value is not overly reliant on a single gene.
The idea of finding the overlap between two lists of differentially expressed genes, like the idea of a central repository of microarray data, dates to the earliest microarray experiments. One of its earliest expressions was through Venn diagrams that compare differentially expressed genes within a single series of experiments. Global clustering of microarrays is a more sophisticated, and more popular, example of this sort of comparative analysis , and has proven its worth for class discovery - for example, defining new, and potentially biologically relevant, subspecies of tumors [75, 76]; and for class prediction - for example, predicting the behavior and susceptibility to therapy of a tumor by comparison to tumors with known outcomes [77, 78]. However, the simpler pair-wise approach of L2L has the advantages of extending well across different platforms and not requiring access to raw data - only to lists of differentially expressed genes. It is well suited, therefore, to its task of finding common patterns between diverse gene expression studies, and enabling biological inferences to be drawn from the commonalities it finds.
VennMapper, created by Smid et al. , is one recent attempt in this direction . It is a software tool that identifies overlaps in lists of differentially expressed genes (defined by an arbitrary fold-change cutoff) from user-supplied heterologous datasets, and calculates the statistical significance of the results using a z-value derived from a normal binomial distribution. The statistical approach is similar to that used by a variety of data mining tools that examine a list of genes for over-representation of GO categories, like GOMiner, EASE, Onto-Express and GO::TermFinder [2–5]. VennMapper and EASE, like the L2L Microarray Analysis Tool, are really general-purpose tools for comparing any given list of genes with any other list of genes. The authors of both tools suggest extending their use to comparing a user's data with 'previously published gene lists' , or 'comparing microarray data studying apoptosis, hypoxia, etc. with microarray data focusing on clinical backgrounds, like cancer, (viral) infections or neurological disease' . L2L was conceived and developed independently of either of these tools, but fills the need that their authors, and others, have identified. Moreover, it does so in a way that is at once flexible, powerful, and extensible, yet simple enough to be accessible to every user of microarrays.
We are indebted to Roger Bumgarner of the University of Washington Center for Expression Arrays for generous support, suggestions and critiques throughout. We are also grateful for the support of Peter Rabinovich and the Nathan Shock Center of Excellence for the Basic Biology of Aging, at the University of Washington. This work was supported by the NIGMS Medical Scientist Training Program, a fellowship from the Cora May Poncin Foundation (J.C.N), and by NIH GM41624 (A.M.W.).
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biol. 2003, 4: R70-10.1186/gb-2003-4-10-r70.PubMedPubMed CentralView ArticleGoogle Scholar
- Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, et al: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003, 4: R28-10.1186/gb-2003-4-4-r28.PubMedPubMed CentralView ArticleGoogle Scholar
- Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling gene expression using onto-express. Genomics. 2002, 79: 266-270. 10.1006/geno.2002.6698.PubMedView ArticleGoogle Scholar
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth123.PubMedPubMed CentralView ArticleGoogle Scholar
- L2L Microarray Analysis Tool. [http://depts.washington.edu/l2l/]
- Microarray Gene Expression Data Society - MGED Society. [http://www.mged.org]
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001, 29: 365-371. 10.1038/ng1201-365.PubMedView ArticleGoogle Scholar
- Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, et al: Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 2002, 3: RESEARCH0046-10.1186/gb-2002-3-9-research0046.PubMedPubMed CentralView ArticleGoogle Scholar
- Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y: CIBEX: center for information biology gene expression database. C R Biol. 2003, 326: 1079-1082.PubMedView ArticleGoogle Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al: ArrayExpress - a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003, 31: 68-71. 10.1093/nar/gkg091.PubMedPubMed CentralView ArticleGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 2004, 6: 1-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.PubMedPubMed CentralView ArticleGoogle Scholar
- Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res. 2004, 32 (Database issue): D255-D257. 10.1093/nar/gkh072.PubMedPubMed CentralView ArticleGoogle Scholar
- Tan PK, Downey TJ, Spitznagel EL, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC: Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 2003, 31: 5676-5684. 10.1093/nar/gkg763.PubMedPubMed CentralView ArticleGoogle Scholar
- The Cancer Genome Anatomy Project Batch Gene Finder. [http://cgap.nci.nih.gov/Genes/BatchGeneFinder]
- MatchMiner. [http://discover.nci.nih.gov/matchminer/index.jsp]
- NCBI HomoloGene. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=homologene]
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMedPubMed CentralView ArticleGoogle Scholar
- KEGG BRITE Database. [http://www.genome.jp/brite/]
- Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003, 31: 248-250. 10.1093/nar/gkg056.PubMedPubMed CentralView ArticleGoogle Scholar
- The Cancer Genome Anatomy Project - Pathways. [http://cgap.nci.nih.gov/Pathways]
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002, 31: 19-20. 10.1038/ng0502-19.PubMedView ArticleGoogle Scholar
- Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004, 36: 1090-1098.PubMedView ArticleGoogle Scholar
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28. 10.1038/88213.PubMedGoogle Scholar
- Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate: profiling gene groups with text-based information. Genome Biol. 2004, 5: R43-10.1186/gb-2004-5-6-r43.PubMedPubMed CentralView ArticleGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001, 29: 482-486. 10.1038/ng776.PubMedView ArticleGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res. 2002, 12: 37-46. 10.1101/gr.205602.PubMedPubMed CentralView ArticleGoogle Scholar
- Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FC: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell. 2002, 9: 1133-1143. 10.1016/S1097-2765(02)00531-2.PubMedView ArticleGoogle Scholar
- Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004, 14: 1085-1094. 10.1101/gr.1910904.PubMedPubMed CentralView ArticleGoogle Scholar
- Smid M, Dorssers LC, Jenster G: Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics. 2003, 19: 2065-2071. 10.1093/bioinformatics/btg282.PubMedView ArticleGoogle Scholar
- AmiGO. [http://www.godatabase.org]
- GeneCards. [http://bioinfo.weizmann.ac.il/cards/index.shtml]
- Entrez Gene. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=gene]
- Baelde HJ, Eikmans M, Doran PP, Lappin DW, de Heer E, Bruijn JA: Gene expression profiling in glomeruli from human kidneys with diabetic nephropathy. Am J Kidney Dis. 2004, 43: 636-650. 10.1053/j.ajkd.2003.12.028.PubMedView ArticleGoogle Scholar
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol. 2003, 4: R7-10.1186/gb-2003-4-1-r7.PubMedPubMed CentralView ArticleGoogle Scholar
- Kukreja A, Cost G, Marker J, Zhang C, Sun Z, Lin-Su K, Ten S, Sanz M, Exley M, Wilson B, et al: Multiple immuno-regulatory defects in type-1 diabetes. J Clin Invest. 2002, 109: 131-140. 10.1172/JCI200213605.PubMedPubMed CentralView ArticleGoogle Scholar
- Chang FY, Shaio MF: Decreased cell-mediated immunity in patients with non-insulin-dependent diabetes mellitus. Diabetes Res Clin Pract. 1995, 28: 137-146. 10.1016/0168-8227(95)00168-8.PubMedView ArticleGoogle Scholar
- Eibl N, Spatz M, Fischer GF, Mayr WR, Samstag A, Wolf HM, Schernthaner G, Eibl MM: Impaired primary immune response in type-1 diabetes: results from a controlled vaccination study. Clin Immunol. 2002, 103: 249-259. 10.1006/clim.2002.5220.PubMedView ArticleGoogle Scholar
- Attallah AM, Abdelghaffar H, Fawzy A, Alghraoui F, Alijani MR, Mahmoud LA, Ghoneim MA, Helfrich GB: Cell-mediated immunity and biological response modifiers in insulin-dependent diabetes mellitus complicated by end-stage renal disease. Int Arch Allergy Appl Immunol. 1987, 83: 278-283.PubMedView ArticleGoogle Scholar
- Donath MY, Storling J, Maedler K, Mandrup-Poulsen T: Inflammatory mediators and islet beta-cell failure: a link between type 1 and type 2 diabetes. J Mol Med. 2003, 81: 455-470. 10.1007/s00109-003-0450-y.PubMedView ArticleGoogle Scholar
- Rhodes CJ: Type 2 diabetes-a matter of beta-cell life and death?. Science. 2005, 307: 380-384. 10.1126/science.1104345.PubMedView ArticleGoogle Scholar
- Bays H, Mandarino L, DeFronzo RA: Role of the adipocyte, free fatty acids, and ectopic fat in pathogenesis of type 2 diabetes mellitus: peroxisomal proliferator-activated receptor agonists provide a rational therapeutic approach. J Clin Endocrinol Metab. 2004, 89: 463-478. 10.1210/jc.2003-030723.PubMedView ArticleGoogle Scholar
- Lazar MA: How obesity causes diabetes: not a tall tale. Science. 2005, 307: 373-375. 10.1126/science.1104342.PubMedView ArticleGoogle Scholar
- Evans RM, Barish GD, Wang YX: PPARs and the complex journey to obesity. Nat Med. 2004, 10: 355-361. 10.1038/nm1025.PubMedView ArticleGoogle Scholar
- Mishra R, Emancipator SN, Miller C, Kern T, Simonson MS: Adipose differentiation-related protein and regulators of lipid homeostasis identified by gene expression profiling in the murine db/db diabetic kidney. Am J Physiol Renal Physiol. 2004, 286: F913-F921. 10.1152/ajprenal.00323.2003.PubMedView ArticleGoogle Scholar
- Asano T, Wakisaka M, Yoshinari M, Iino K, Sonoki K, Iwase M, Fujishima M: Peroxisome proliferator-activated receptor gamma1 (PPARgamma1) expresses in rat mesangial cells and PPARgamma agonists modulate its differentiation. Biochim Biophys Acta. 2000, 1497: 148-154. 10.1016/S0167-4889(00)00054-9.PubMedView ArticleGoogle Scholar
- Guan Y, Zhang Y, Schneider A, Davis L, Breyer RM, Breyer MD: Peroxisome proliferator-activated receptor-gamma activity is associated with renal microvasculature. Am J Physiol Renal Physiol. 2001, 281: F1036-F1046.PubMedView ArticleGoogle Scholar
- Isshiki K, Haneda M, Koya D, Maeda S, Sugimoto T, Kikkawa R: Thiazolidinedione compounds ameliorate glomerular dysfunction independent of their insulin-sensitizing action in diabetic rats. Diabetes. 2000, 49: 1022-1032.PubMedView ArticleGoogle Scholar
- Imano E, Kanda T, Nakatani Y, Nishida T, Arai K, Motomura M, Kajimoto Y, Yamasaki Y, Hori M: Effect of troglitazone on microalbuminuria in patients with incipient diabetic nephropathy. Diabetes Care. 1998, 21: 2135-2139.PubMedView ArticleGoogle Scholar
- Bakris G, Viberti G, Weston WM, Heise M, Porter LE, Freed MI: Rosiglitazone reduces urinary albumin excretion in type II diabetes. J Hum Hypertens. 2003, 17: 7-12. 10.1038/sj.jhh.1001444.PubMedView ArticleGoogle Scholar
- Issa JP: Epigenetic variation and human disease. J Nutr. 2002, 132 (8 Suppl): 2388S-2392S.PubMedGoogle Scholar
- Imai S, Kitano H: Heterochromatin islands and their dynamic reorganization: a hypothesis for three distinctive features of cellular aging. Exp Gerontol. 1998, 33: 555-570. 10.1016/S0531-5565(98)00037-0.PubMedView ArticleGoogle Scholar
- Richardson B: Impact of aging on DNA methylation. Ageing Res Rev. 2003, 2: 245-261. 10.1016/S1568-1637(03)00010-2.PubMedView ArticleGoogle Scholar
- Jameson CW: Towards a unified and interdiciplinary model of ageing. Med Hypotheses. 2004, 63: 83-86. 10.1016/j.mehy.2004.01.021.PubMedView ArticleGoogle Scholar
- Roy AK, Oh T, Rivera O, Mubiru J, Song CS, Chatterjee B: Impacts of transcriptional regulation on aging and senescence. Ageing Res Rev. 2002, 1: 367-380. 10.1016/S1568-1637(02)00006-5.PubMedView ArticleGoogle Scholar
- Hasty P, Campisi J, Hoeijmakers J, van Steeg H, Vijg J: Aging and genome maintenance: lessons from the mouse?. Science. 2003, 299: 1355-1359. 10.1126/science.1079161.PubMedView ArticleGoogle Scholar
- Vijg J, Calder RB: Transcripts of aging. Trends Genet. 2004, 20: 221-224. 10.1016/j.tig.2004.04.007.PubMedView ArticleGoogle Scholar
- Kyng KJ, May A, Kolvraa S, Bohr VA: Gene expression profiling in Werner syndrome closely resembles that of normal aging. Proc Natl Acad Sci USA. 2003, 100: 12259-12264. 10.1073/pnas.2130723100.PubMedPubMed CentralView ArticleGoogle Scholar
- Lu T, Pan Y, Kao SY, Li C, Kohane I, Chan J, Yankner BA: Gene regulation and DNA damage in the ageing human brain. Nature. 2004, 429: 883-891. 10.1038/nature02661.PubMedView ArticleGoogle Scholar
- Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW: Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci USA. 2004, 101: 2173-2178. 10.1073/pnas.0308512100.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee CK, Klopp RG, Weindruch R, Prolla TA: Gene expression profile of aging and its retardation by caloric restriction. Science. 1999, 285: 1390-1393. 10.1126/science.285.5432.1390.PubMedView ArticleGoogle Scholar
- Kayo T, Allison DB, Weindruch R, Prolla TA: Influences of aging and caloric restriction on the transcriptional profile of skeletal muscle from rhesus monkeys. Proc Natl Acad Sci USA. 2001, 98: 5093-5098. 10.1073/pnas.081061898.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiang CH, Tsien JZ, Schultz PG, Hu Y: The effects of aging on gene expression in the hypothalamus and cortex of mice. Proc Natl Acad Sci USA. 2001, 98: 1930-1934. 10.1073/pnas.98.4.1930.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee CK, Weindruch R, Prolla TA: Gene-expression profile of the ageing brain in mice. Nat Genet. 2000, 25: 294-297. 10.1038/77046.PubMedView ArticleGoogle Scholar
- Bandyopadhyay D, Medrano EE: The emerging role of epigenetics in cellular and organismal aging. Exp Gerontol. 2003, 38: 1299-1307. 10.1016/j.exger.2003.09.009.PubMedView ArticleGoogle Scholar
- DCDFLIB. [http://odin.mdacc.tmc.edu/anonftp/#DCDFLIB]
- CPAN - Math-CDF. [http://search.cpan.org/dist/Math-CDF/]
- Ewens WJ, Grant GR: Statistical Methods in Bioinformatics: An Introduction. 2005, New York: Springer Science+Business Media, 2View ArticleGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Stat Sci. 2003, 18: 71-103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
- Random.org. [http://www.random.org]
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995, 57: 289-300.Google Scholar
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet. 2001, 2: 418-427. 10.1038/35076576.PubMedView ArticleGoogle Scholar
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.PubMedView ArticleGoogle Scholar
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001, 98: 10869-10874. 10.1073/pnas.191367098.PubMedPubMed CentralView ArticleGoogle Scholar
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, et al: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002, 8: 68-74. 10.1038/nm0102-68.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.