SpeCond: a method to detect condition-specific gene expression
© Cavalli et al.; licensee BioMed Central Ltd. 2011
Received: 21 April 2011
Accepted: 18 October 2011
Published: 18 October 2011
Transcriptomic studies routinely measure expression levels across numerous conditions. These datasets allow identification of genes that are specifically expressed in a small number of conditions. However, there are currently no statistically robust methods for identifying such genes. Here we present SpeCond, a method to detect condition-specific genes that outperforms alternative approaches. We apply the method to a dataset of 32 human tissues to determine 2,673 specifically expressed genes. An implementation of SpeCond is freely available as a Bioconductor package at http://www.bioconductor.org/packages/release/bioc/html/SpeCond.html.
Cells sharing the same genomic information are able to express it in different ways to achieve cell-specific functions or respond to different environmental changes. Transcriptional regulation is the first step at which this specificity is determined, as it is the most basic level at which gene expression is controlled. Recent surveys of transcriptomic data across numerous cell types revealed two broad categories of gene expression: ubiquitous; and tissue- or cell-type-specific expression [1, 2]. The first category contains genes that are expressed in most tissues at similar levels and they are thought to provide core cellular functionality [3, 4]. The second category comprises genes with distinct expression in a few tissues or conditions, which are likely to be important for defining cell-specific functions.
In datasets with only a few conditions, it is possible to compare pairs of conditions using standard or moderated t-tests [5–7]. However, this becomes impractical with large datasets, as the number of pairwise comparisons increases exponentially with respect to the number of conditions studied. An alternative method is the non-standard ANOVA, which tests all possible groups of samples against each other. However, this involves computationally intensive dynamic programming and cannot detect specificity in individual conditions. Moreover, the method requires equal standard deviations between all groups of conditions being compared: this cannot be assumed as genes might have similar expression levels in some conditions - and thus small standard deviations - and more divergent expression levels in others. A further alternative is the Tukey test, although this method requires independence between groups of conditions and a normal distribution of group means, criteria that are often not met in microarray experiments. Importantly, most of these and other methods assume that expression values follow a single normal distribution. This assumption is generally not satisfied, which means that methods do not model the data correctly and therefore lead to false positive results .
An alternative to these approaches is a mixture model-based procedure to model gene expression. EMMIX-GENE  and EMMIX-FDR  are software packages that apply this technique to cluster genes displaying similar expression patterns. However, these packages were not specifically developed to detect condition-specific expression, and therefore cannot be readily applied for this purpose on large datasets. Moreover, the method is not implemented in commonly used analysis platforms such as Bioconductor, making it difficult to integrate with additional analysis pipelines.
Two additional methods were recently developed with the specific aim of identifying condition-specific gene expression. First, a method called ROKU  implements Shannon's information theory entropy followed by an outlier detection method  to detect tissue specificity. This method is implemented in the Tissue Specific Genes Analysis (TSGA) R package . It returns a list of conditions in which each gene is specifically expressed. Unfortunately, this method depends on a pre-defined set of ubiquitously expressed genes to model background expression levels - information that is generally not available prior to analysis. Furthermore, the TSGA method produces qualitative outputs - a gene is classified as either condition-specific or not without ranking genes or conditions - which makes the resulting lists difficult to prioritize for further analysis. Second, Vaquerizas et al.  previously used a propensity measure for a given gene to be expressed at a certain level in particular conditions relative to its expression across other conditions. The method provides a ranking of condition-specificity across samples. However, there is no control over the number of conditions in which a gene can be specific and there is no statistically meaningful threshold for specificity. Therefore, to our knowledge there is currently no straightforward and statistically robust method available to detect condition-specific gene expression.
Here we present a new method called SpeCond (for Specific Condition) to detect condition-specificity from a dataset of gene expression measurements. The method fits a normal mixture model to the expression profile of each gene, and identifies outlier conditions. We compare SpeCond against several alternative approaches using a gold standard dataset and demonstrate that SpeCond outperforms other methods. Finally, we apply the SpeCond approach to a subset of the Genome Novartis Foundation SymAtlas dataset , and identify specifically expressed genes from 32 human tissues samples. The method is freely available as an R package within the Bioconductor software project [15–17] at .
SpeCond in a nutshell
Modeling the null distribution
Previous methods have modeled gene-expression values using a Gaussian distribution. However, most datasets do not fit this distribution well, as they often exhibit varying degrees of skewness . To overcome this, we use a mixture model that fits between one and three normal distributions to the expression profile of a given gene (Figure 1c). This is achieved using the mclust package [19–21] in the R software environment [16, 15]. The algorithm performs a hierarchical clustering of a mixture model of normal distributions via expectation-maximization. The best-fitting model is then selected using the Bayesian information criterium.
Identifying condition-specific expression values
Next, SpeCond computes a P-value for every expression value to determine whether a gene is specifically expressed. These P-values are based on the null distribution of each gene, and are computed as the sum of the P-values obtained from each mixture component, weighted by the proportion of the component in the mixture model. This procedure is applied to each gene in turn, and the overall set of P-values is corrected for multiple testing (Benjamini and Yekutieli method ).
Finally, a gene is determined to be specific if at least one adjusted P-value is below the specified threshold (pv parameter set to 0.05 by default). As a result, SpeCond classifies each gene as either displaying specific expression or not and returns the list of condition(s) in which it is specific (Figure 1e).
SpeCond's behavior is determined by a set of user-defined parameters. These can be classified into three classes: (i) those controlling the implementation of the normal mixture model (λ and β); (ii) those used to decide which normal distributions are included in the final null distribution (md, per, mlk and rsd); and (iii) a P-value threshold to define a gene as being condition-specific (pv). A more detailed description of the parameters, including our choice for the default parameters, is given in Additional file 1.
Comparison with other approaches
Numbers of tissue-specific genes for 32 human tissues
Number of tissue-specific genes
We also performed a Gene Ontology (GO) enrichment analysis using the g:Profiler web-tool  and computed overall log-scores to compare the performance of each method from a biological perspective (Additional file 1). SpeCond and TSGA showed similar enrichment levels, outperforming the propensity method (log-scores = 18,316, 17,664, and 15,629 for SpeCond, TSGA and the propensity method, respectively). Therefore, overall, SpeCond displays better sensitivity and specificity than either of the other available methods.
Detecting tissue specificity across the human genome
To assess the biological significance of the results obtained with SpeCond, we performed a GO enrichment analysis for each set of tissue-specific genes. For 28 out of the 32 analyzed tissues, we observed many expected molecular functions and pathways. For example, the GO terms 'contractile fiber' and 'heart morphogenesis' are enriched in heart, 'spermatogenesis' is specifically enriched in testis, and 'T cell activation' is enriched in the thymus. The remaining four tissues show a smaller number of specific genes, which did not allow the identification of significantly enriched functions among the specific genes.
Closer examination of the 287 liver-specific genes detected by SpeCond showed many genes that are important for liver functions, such as amino acid and fatty acid metabolic processes or gluconeogenesis. Among them are genes previously known to have liver-specific expression, such as NR1I3, a key regulator of xenobiotic and endobiotic metabolism , and INSIG1, which takes part in metabolic control . In addition, we found genes that had not been originally assigned to have a liver-specific function. One example is ATF5, which is implicated in differentiation, proliferation and survival in different cell types but whose function in liver had not been annotated. The first indication of its function as a regulator of the hepatic stress response was recently published .
Another example is illustrated by the central nervous system. The brain, fetal brain and spinal cord present the largest list of tissue-specific genes (511 for brain, 406 for fetal brain and 266 for spinal cord; Table 1) and share 144 specific genes showing neural-related specific expression patterns. Functional profiling of tissue-specific genes shared by the three tissues revealed well-known nervous-tissue functions such as 'generation of neuron', 'axonogenesis', and 'synaptic transmission', as well as the neural cellular component 'neurofilament cytoskeleton'. In addition, we were able to identify EAAT1 (Excitatory amino acid transporter 1) as specific in the three tissues outlined above. This gene is known as a member of a family of high-affinity sodium-dependent transporter molecules that regulate neurotransmitter concentrations at the excitatory glutamatergic synapses of the mammalian central nervous system . Further, we detected many genes with expression profiles specific for these tissues that have not been experimentally associated with any neural function in small-scale studies. Among these we found ZNF365 and ZNF536, two transcription factors previously reported to have brain- and spinal cord-specific expression .
Bioconductor R package
The widespread use of microarrays in biological research over the past few years has generated a flood of data characterizing gene expression across many tissues in different species . Determining tissue- or condition-specific expression from these datasets is an important aspect of genomic analysis. Indeed, genes with a particularly high expression level in few conditions are likely to be involved in cell-specific functions; therefore, such genes could represent good candidates for tissue markers or drug targets. However, this detection is difficult to perform using traditional statistical techniques and few other methods were available.
Here we present SpeCond, a new statistical method to detect condition-specific expression from microarray data. We show that SpeCond is able to detect reliable tissue-specific genes and we evaluated its performance against alternative approaches. In all cases, SpeCond displayed higher sensitivity and a lower false discovery rate. Importantly, the SpeCond package is not a black box; the user is encouraged to test different parameter sets to find the best sets returning meaningful results according to relevant biological questions. Indeed, the large set of visualization tools allows the user to examine expression patterns in detail, to verify the fitting of the normal mixture distribution, as well as to easily compare the overall specific gene sets resulting from the use of different sets of parameters. In addition, the selection of inputted conditions can alter the results outputted by SpeCond; therefore, the user might consider applying standard clustering methods to identify the global variability in expression patterns among the different conditions, before manually selecting the most relevant conditions for the analysis.
A further advantage of SpeCond is its ability to generate ranked lists of genes based on their tissue-specific expression. The ability to classify genes in regard to their contribution to tissue-specificity should be helpful to experimentalists that wish to identify candidate genes for detailed follow-up studies. In addition, these ranked lists can be used in computational approaches, such as the examination of the organization of tissue-specific transcriptional networks or the putative annotation of unknown gene functions based on their expression pattern.
In the future, it will be very interesting to analyze RNA-seq data with the same purpose. However, the model will need to be modified, since a normal distribution-based model would not be the best to fit sequencing data. A negative binomial distribution as used in the DESeq method  is certainly more appropriate, and therefore a mixture of negative binomial distribution model would need to be created.
SpeCond is a new statistical method to detect condition-specific expression from microarray data. SpeCond does not impose a single normal distribution to estimate the underlying distribution but computes an estimate of the null distribution using a normal mixture model. SpeCond is an ideal choice when no previous data about the organization of the system under study are available, as it is not assumed that the measured expression values follow a single normal distribution. Finally, SpeCond is immediately applicable to many datasets measuring gene expression, including the detection of tissue-specific alternative splicing, in any species.
Genomics Institute of the Novartis Research Foundation
receiver operating characteristic
Tissue Specific Genes Analysis.
We would like to thank Wolfgang Huber for his advice as well as the Luscombe group for discussions on the method. FMGC is funded by the EMBL PhD programme. This work is supported by the European Molecular Biology Laboratory (EMBL) and the EpiGeneSys Network.
At the proofs stage, a couple of publications have come to light that may be useful for readers. These include the following work [30, 31]. The authors developed a method taking advantage of the large amount of publicly available microarray datasets to identify silenced and expressed genes. The use of a large amount of samples allows this method to take into account the batch effect and so classify accurately the expression of genes in each condition. This resource is
very useful as it provides transcription profiles of hundreds of cell types in both humans and mouse, including both ubiquitous and condition-specific genes.
- Freilich S, Massingham T, Bhattacharyya S, Ponsting H, Lyons PA, Freeman TC, Thornton JM: Relationship between the tissue-specificity of mouse gene expression and the evolutionary origin and function of the proteins. Genome Biology. 2005, 6: R56-10.1186/gb-2005-6-7-r56.PubMedPubMed CentralView ArticleGoogle Scholar
- Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM: A census of human transcription factors: function, expression and evolution. Nature reviews. Genetics. 2009, 10: 252-63. 10.1038/nrg2538.PubMedView ArticleGoogle Scholar
- Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M: Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiological genomics. 2000, 2: 143-7.PubMedGoogle Scholar
- Butte AJ, Dzau VJ, Glueck SB: Further defining housekeeping, or "maintenance," genes Focus on "A compendium of gene expression in normal human tissues". Physiological genomics. 2001, 7: 95-6.PubMedGoogle Scholar
- Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, R Irizarry WH. 2005, New York: Springer, 397-420.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98: 5116-21. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang S: A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC bioinformatics. 2007, 8: 230-10.1186/1471-2105-8-230.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang J, Jia M, Zhu L, Yuan Z, Li P, Chang C, Luo J, Liu M, Shi T: Systematical Detection of Significant Genes in Microarray Data by Incorporating Gene Interaction Relationship in Biological Systems. PLoS ONE. 2010, 5 (10): e13721-10.1371/journal.pone.0013721.PubMedPubMed CentralView ArticleGoogle Scholar
- McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002, 18: 413-422. 10.1093/bioinformatics/18.3.413.PubMedView ArticleGoogle Scholar
- McLachlan GJ, Bean RW, Ben-Tovim Jones L: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 1608-1615. 10.1093/bioinformatics/btl148.PubMedView ArticleGoogle Scholar
- Kadota K, Ye J, Nakai Y, Terada T, Shimizu K: ROKU: a novel method for identification of tissue-specific genes. BMC bioinformatics. 2006, 7: 294-10.1186/1471-2105-7-294.PubMedPubMed CentralView ArticleGoogle Scholar
- Kadota K, Nishimura S-I, Bono H, Nakamura S, Hayashizaki Y, Okazaki Y, Takahashi K: Detection of genes with tissue-specific expression patterns using Akaike's information criterion procedure. Physiological genomics. 2003, 12: 251-9.PubMedView ArticleGoogle Scholar
- Ye Chengyin WX: TSGA: an R package for tissue specific genes analysis. 2008, [http://www.cab.zju.edu.cn/ics/faculty/zhuj/software/tsga/index.htm]Google Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101: 6062-7. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299-314. 10.2307/1390807.Google Scholar
- Team RDC: R: A Language and Environment for Statistical Computing. 2008, Vienna, Austria, [http://www.R-project.org]Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMedPubMed CentralView ArticleGoogle Scholar
- SpeCond Bioconductor package. [http://bioconductor.org/packages/release/bioc/html/SpeCond.html]
- Fraley C, Raftery AE: MCLUST: Software for model-based cluster analysis. Journal of Classification. 1999, 16: 297-306. 10.1007/s003579900058.View ArticleGoogle Scholar
- Fraley C, Raftery AE: Enhanced software for model-based clustering, density estimation, and discriminant analysis: MCLUST. Journal of Classification. 2003, 20: 263-286. 10.1007/s00357-003-0015-3.View ArticleGoogle Scholar
- Fraley C, Raftery AE: mclust Version 3 for R: Normal Mixture Modeling and Model-based Clustering. Technical Report 504, University of Washington, Department of Statistics. 2006, (revised December 2009)Google Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics. 2001, 29: 1165-1188. 10.1214/aos/1013699998.View ArticleGoogle Scholar
- Reimand J, Kull M, Peterson H, Hansen J, Vilo J: g:Profiler-a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic acids research. 2007, 35: W193-200. 10.1093/nar/gkm226.PubMedPubMed CentralView ArticleGoogle Scholar
- Yamamoto Y, Kawamoto T, Negishi M: The role of the nuclear receptor CAR as a coordinate regulator of hepatic gene expression in defense against chemical toxicity. Archives of biochemistry and biophysics. 2003, 409: 207-11. 10.1016/S0003-9861(02)00456-3.PubMedView ArticleGoogle Scholar
- Peng Y, Schwarz EJ, Lazar MA, Genin A, Spinner NB, Taub R: Cloning, human chromosomal assignment, and adipose and hepatic expression of the CL-6/INSIG1 gene. Genomics. 1997, 43: 278-84. 10.1006/geno.1997.4821.PubMedView ArticleGoogle Scholar
- Pascual M, Jose M, Castell V, Jover R: ATF5 Is a Highly Abundant Liver-Enriched Transcription Factor that Cooperates with Constitutive Androstane Receptor in the Transactivation of CYP2B6: Implications in Hepatic Stress Responses. Pharmacology. 2008, 36: 1063-1072.Google Scholar
- Kirschner MA, Arriza JL, Copeland NG, Gilbert DJ, Jenkins NA, Magenis E, Amara SG: The Mouse and Human Excitatory Amino Acid Transporter Gene (EAAT1) Maps to Mouse Chromosome 15 and a Region of Syntenic Homology on Human Chromosome 5. Genomics. 1994, 22: 631-633. 10.1006/geno.1994.1437.PubMedView ArticleGoogle Scholar
- Lukk M, Kapushesky M, Nikkilä J, Parkinson H, Goncalves A, Huber W, Ukkonen E, Brazma A: A global map of human gene expression. Nature biotechnology. 2010, 28: 322-4. 10.1038/nbt0410-322.PubMedPubMed CentralView ArticleGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology. 2010, 11: R106-10.1186/gb-2010-11-10-r106.PubMedPubMed CentralView ArticleGoogle Scholar
- Zilliox MJ, Irizarry RA: A gene expression bar code for microarray data. Nature Methods. 2007, 4: 911-3. 10.1038/nmeth1102.PubMedPubMed CentralView ArticleGoogle Scholar
- McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA: The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research. 2011, 39: D1011-5. 10.1093/nar/gkq1259.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.