Analysis of gene expression in a developmental context emphasizes distinct biological leitmotifs in human cancers
© Naxerova et al.; licensee BioMed Central Ltd. 2008
Received: 4 March 2008
Accepted: 8 July 2008
Published: 08 July 2008
In recent years, the molecular underpinnings of the long-observed resemblance between neoplastic and immature tissue have begun to emerge. Genome-wide transcriptional profiling has revealed similar gene expression signatures in several tumor types and early developmental stages of their tissue of origin. However, it remains unclear whether such a relationship is a universal feature of malignancy, whether heterogeneities exist in the developmental component of different tumor types and to which degree the resemblance between cancer and development is a tissue-specific phenomenon.
We defined a developmental landscape by summarizing the main features of ten developmental time courses and projected gene expression from a variety of human tumor types onto this landscape. This comparison demonstrates a clear imprint of developmental gene expression in a wide range of tumors and with respect to different, even non-cognate developmental backgrounds. Our analysis reveals three classes of cancers with developmentally distinct transcriptional patterns. We characterize the biological processes dominating these classes and validate the class distinction with respect to a new time series of murine embryonic lung development. Finally, we identify a set of genes that are upregulated in most cancers and we show that this signature is active in early development.
This systematic and quantitative overview of the relationship between the neoplastic and developmental transcriptome spanning dozens of tissues provides a reliable outline of global trends in cancer gene expression, reveals potentially clinically relevant differences in the gene expression of different cancer types and represents a reference framework for interpretation of smaller-scale functional studies.
The historical roots of our understanding of the intimate connection between tumorigenesis and developmental processes reach back to 1858, when Rudolf Virchow first suggested that neoplasms arise "in accordance with the same law, which regulates embryonic development" . Since then, his idea has profoundly influenced medicine and still remains highly relevant today. The similarities between cancer and development are evident on many levels of observation: microscopically, cancerous tissues appear as undifferentiated masses, with some tumor types even exhibiting embryonic tissue organization. The increased mobility of malignant cells, leading to invasion of the local environment with the potential for subsequent travel to distant organs (representing one of the most problematic clinical aspects of cancer), is reminiscent of migratory behavior during development. On the molecular level, the shared characteristics between certain malignant tumors and developing tissues with respect to transcription factor activity , regulation of chromatin structure  and signaling  have been documented. In particular, several studies have suggested that part of the cancer transcriptome represents a 'developmental signature', that is, it contains a set of genes that are collectively active during development. For lung cancer [5, 6], liver cancer , Wilms' tumor , colon cancer [9, 10] and medulloblastoma , gene expression patterns resembling early developmental stages of the corresponding organ have been identified in the tumor profile. The results of these transcriptome-scale analyses are important because they offer a glimpse into fundamental biological processes underlying tumorigenesis and provide a natural framework for understanding complex cancer gene expression signatures that are difficult to interpret otherwise. Moreover, developmental signatures harbor a clinical relevance that we are only beginning to discover. For example, lung cancers can be risk-stratified by their similarity to lung development and pluripotency gene signatures can be used to predict outcome in breast cancer [6, 12].
In the present study, we paint a novel picture of the oncological landscape by comparing a variety of human cancers based on their developmental signature. Our analysis was inspired by the following questions: to which extent can the transcriptome of a tumor, which is oftentimes perceived as an aberration, be 'explained' by developmental gene expression? Does the developmental signature represent a feature of most, and possibly all, human cancers or does gene expression in different tumors fall into distinct groups with respect to development? Is recapitulation of developmental gene expression programs a tissue-specific phenomenon or is the developmental signature largely composed of general transcriptional modules that play a ubiquitous role in developmental processes? The answers to these open questions have therapeutic implications . If a broad range of tumors employs primitive developmental mechanisms that are shared across tissues to sustain their growth and survival, a certain drug or class of drugs could be capable of affecting them all. If, on the other hand, highly lineage-specific mechanisms govern malignant growth and behavior, focus has to be put on identifying and targeting tissue-specific regulators.
The results from the integrative analysis of gene expression in cancer and development presented here suggest that the developmental information content of most human cancers indeed is significant. The developmental signature of cancers originating from various tissues exhibits low tissue-specificity, indicating that a large portion of the cancer transcriptome is composed of general developmental modules. Furthermore, we describe three developmentally distinct groups of cancer, validate the class distinction on a new time series of embryonic development in the mouse and show that the behavior of genes in lung development is predictable by their expression across the three groups. We explore the biological themes dominating the expression profiles of these classes and demonstrate that one group recapitulates early developmental gene expression patterns and is characterized by an 'individualistic' signature with upregulation of pluripotency genes and suppression of genes involved in cell-cell communication and signal transduction. A second group exhibits a 'communicative' gene expression signature that is active in late development, is enriched in genes involved in immune response, cell-cell and cell-matrix interactions and resembles a wound healing signature. A third group connects the previous two with a transition phenotype. While social and anti-social aspects of cancer have been widely popularized, this study points out the possibility of a more subtle classification of different cancers that tend to evoke different types of 'survival mechanisms'. Finally, we identify a core program of genes that are upregulated in most cancers and show that these genes are coexpressed in early development.
Placing human cancers on a developmental landscape
Our analysis is based on a large-scale comparison of gene expression in 10 developmental processes and 32 cancer data sets. To paint an unbiased picture of the association between gene expression in development and oncogenesis, we selected data from a wide biological range. Our development database encompasses gene expression time series characterizing processes as diverse as heart development in the mouse, human T cell development and in vitro differentiation of murine embryonic stem cells (see Additional data file 6 for a list of all data sets). Cancer gene expression data include tumors from most commonly affected anatomical locations and corresponding normal tissue as a reference.
In the next step, we determined the relationship of gene expression in cancer to each of the ten DTs. We identified the genes that were up- and downregulated in a cancer relative to its corresponding normal tissue and tracked their position (or the position of their mouse ortholog for murine developmental processes) on the DTs . In the following, we will use two kinds of plots to summarize the resulting distribution: a frequency plot (Figure 1a) for an intuitive overview of where deregulated cancer genes fall on the DT and a probability density plot (Figure 1b) that allows a more accurate quantification of the cancer-development relationship. The frequency plot is divided into two panels: on the left side, the frequency of upregulated genes on the DT is shown; on the right side, the DT is depicted again with the distribution of downregulated genes (Figure 1).
The probability density plot shows how likely genes in different segments of the DT are to be expressed/suppressed in cancer (see the Figure 1 legend for details). If there was no correlation between gene expression in cancer and development, the probability distributions would follow a straight line with slope 1. However, if certain parts of the DT contain genes that are up- or downregulated in cancer with a higher frequency than expected by chance, the slope of the probability density increases. Conversely, if cancer genes are depleted in a particular segment of the DT, the slope becomes flatter. For the deregulated genes in Figure 1b, this results in an 'open eye' shape of the probability density (the legend to Figure 1 details the quantification of this shape).
A variety of cancers have activated a predominantly tissue-independent developmental signature
Perhaps unexpectedly, the specificity of upregulated lung cancer genes for early development (and downregulated genes for late development) can be reproduced on DTs derived from atrial chamber development, ES cell differentiation and T cell development (more examples can be found in Additional data file 1). Apparently, gene expression programs that are exploited during lung tumorigenesis play a ubiquitous role in processes involving differentiation and morphogenesis. This result is in contrast to the prevailing notion that recapitulation of developmental gene expression in cancer is a tissue-specific phenomenon [9, 11].
Examination of the developmental distribution of Wilms' tumor genes suggests that this property is not unique to lung cancers. The segregation of up- and downregulated genes in Wilms' tumor on lung development occurs even more convincingly than the separation of lung cancer genes. A similar result for many other tumor types (Additional data file 1) suggests that this is unlikely to be solely attributable to the embryonal nature of Wilms' tumor. Instead, a general developmental signature that shows very little evidence of tissue-specificity seems to be a hallmark of many cancers. However, there are several notable exceptions.
Upregulated genes in glioblastoma (2c) follow a similar pattern to lung adenocarcinoma and Wilms' tumor in early development, but an additional peak prominently occurs on the late end of the DTs. Beyond expressing early genes, glioblastomas have activated other, distinct transcriptional programs that are characteristic of later developmental stages. The developmental gradient in this case is not capable of 'explaining' the glioblastoma gene expression signature unambiguously. An even more striking example is ovarian cancer (Figure 2d), a tumor that is in many respects the developmental complement of glioblastoma: upregulated genes tend to avoid early and late development, while downregulated genes have a preference for the extremes of the DT. Apparently, transcriptional states in different cancers map to distinct domains of physiological gene expression. These divergent developmental patterns are unlikely to be random fluctuations. First, their recurrence with respect to changing developmental backgrounds suggests a robust association. Second, up- and downregulated genes have complementary patterns; where upregulated genes are abundant on the DT, downregulated genes are infrequent and vice versa. The expression of certain sets of genes seems to be mutually exclusive; if one set is active, the other set is invariably turned off. Third, a limited number of patterns consistently recurs in different data sets.
Finally, Figure 2e shows the developmental profile of a disease that does not directly belong to the cancer family: liver cirrhosis. The developmental timing of deregulated genes in cirrhosis is strikingly different from most cancers. Upregulated genes have a preference for late development, downregulated genes tend to be enriched on the early end of the DTs. This example illustrates that the distribution of deregulated genes in development indeed is a pathophysiology-specific phenomenon.
Three distinct groups of tumors emerge from the developmental landscape
The cases discussed in Figure 2 are a collection of representative examples highlighting some fundamental properties of the association between cancer and development. By visual inspection it is already clear that the developmental profiles of lung adenocarcinoma and Wilms' tumor are more similar to each other than to ovarian cancer, for example. However, if we want to extend this assessment of similarity to a larger number of tumors, a quantitative description of the 'shape' of the developmental profile is required. We realized this quantification by fitting two linear curves to each probability distribution, one curve representing its slope in the early part of the DT and the other one approximating the late slope (Figure 1b). Thus, each combination of cancer and developmental process is summarized by a unique set of four values, consisting of two slopes for upregulated and two slopes for downregulated genes.
Group 2 contains several tumors with an ambiguous correlation with developmental gene expression. Glioblastoma is part of this group, next to several other central nervous system tumors, breast cancer, and the more aggressive forms of papillary renal cell carcinoma (subtypes 1.2A and 2). Examination of the frequency plots and probability distributions for these cancers (Additional data file 1) shows that two types of tumors are found in this group: those that do recapitulate early developmental gene expression, but also exhibit additional transcriptional programs that are not consistent with the developmental gradient (for example, glioblastoma); and tumors that are consistent with the gradient, but whose deregulated genes show a less dramatic preference for the extremes of the DTs (for example, breast carcinoma).
Group 3, featuring several subtypes of ovarian cancer, prostate cancer, two independent data sets of papillary thyroid carcinoma (PTC) and two independent instances of renal cell carcinoma, displays a transcriptional phenotype that is completely distinct from groups 1 and 2. Upregulated genes have no clear preference for early development. In fact, in some instances they accumulate on the late end of the DTs, co-clustering with liver cirrhosis, dysplastic liver and ulcerative colitis. The behavior of downregulated genes varies considerably. In some cases - most notably the ovarian cancers - they complement upregulated genes, but in PTC 3 for example, up- and downregulated genes peak in similar DT segments, hinting at active regulatory mechanisms that are not found in normal developmental processes. It is apparent that group 3 is a much more heterogeneous collection of diseases than groups 1 or 2.
Of note, two data sets in group 3 have counterparts of histologically similar tumors located in group 1. PTC is represented with three, and clear cell renal cell carcinoma (CRCC) with two independent data sets in our database. Two of the PTC data sets belong to group 3; a third data set, which is divided in three histological subtypes of PTC (follicular, tall cell and conventional variant) is part of group 1. Possibly, the lacking histological subclassification of PTCs belonging to group 3 emphasizes a different transcriptional theme in those tumors. Even more likely, the paired experimental design of the two group 3 PTC data sets - in both cases, tissue from the same patient served as a normal control - influences the gene expression signature. We will address this issue in more detail in the discussion.
The CRCC data sets are concordant as far as the top third of differentially expressed genes is concerned. Considering only the 450 most differentially expressed genes reveals a pronounced preference of upregulated genes for the late part of DTs in both data sets (Additional data file 3), making CRCC more similar to diseases like liver cirrhosis and ulcerative colitis and implying that the early peak that places CRCC 1 among the 'early developmental' tumors is a less significant addition to a prominent 'late' transcriptional program.
While groups 1 and 3 are clearly distinct, it is debatable whether group 2 should be treated as its own entity. It is apparent that there is a spectrum of developmental signatures, with most cancer types clustering at its early or late end and a few intermediate cases that cannot be classified unambiguously. Examining the distribution of probability distribution slope values for upregulated genes in the early segment of the DTs (the most distinguishing feature) exemplifies this point (Additional data file 8). The distribution is bimodal, with most cancers falling into the early or late peak and group 2 tumors occupying the middle. To achieve a clear biological separation in subsequent analyses, we decided to treat these intermediate cases as a distinct class; it remains to be determined in more comprehensive studies whether this group can be identified reproducibly.
The contribution of proliferation-related genes to the developmental pattern in cancer
Since early stages of most developmental processes involve massive proliferation, part of the similarity between early development and cancer can most certainly be attributed to cell cycle (CC)-related genes. Also, the clinical behavior of the cancers constituting the three groups raises the question whether a proliferation signature could be driving their developmental profile. Group 1 mostly consists of aggressive tumors with low doubling times (for example, urinary bladder cancer, lung cancer, Wilms' tumor), while group 3 contains more indolent forms. Tumors like ovarian and renal cancer are associated with poor outcome because they metastasize frequently and do not respond well to chemotherapy, but their growth rate tends to be relatively low [14–16]. Also, prostate and thyroid cancers are well-known for their slow growth [17, 18].
Gene expression in groups 1, 2 and 3 is dominated by different biological processes
GO category enrichment
BP - overrepresented
BP - underrepresented
CC - overrepresented
Multicellular organismal process
Antigen processing and presentation
Cytokine and chemokine mediated signaling pathway
Multicellular organismal process
Biopolymer metabolic process
Cell cycle phase
MHC protein complex
Group 1 (16)
DNA repair (15)
Cell cycle (15)
RNA splicing (13)
Multicellular organismal process (16)
G-protein coupled receptor protein signaling pathway (16)
Neurological process (16)
Nuclear part (15)
Multicellular organismal process (15)
Organ development (14)
Cell communication (11)
Primary metabolic process (14)
RNA processing (14)
DNA metabolic process (14)
Plasma membrane (16)
Extracellular region (13)
Voltage-gated potassium channel complex (8)
Group 2 (6)
Cell cycle (6)
DNA replication (6)
Response to DNA damage stimulus (6)
Multicellular organismal development (5)
Anatomical structure development (5)
System development (4)
Protein complex (5)
Replication fork (5)
Monovalent inorganic cation transport (5)
ATP synthesis coupled proton transport (5)
Oxidative phosphorylation (4)
DNA recombination (6)
Immune response (5)
Macromolecule metabolic process (5)
Proton-transporting two-sector ATPase complex (5)
Extracellular matrix (3)
Group 3 (13)
Immune response (10)
Multicellular organismal process (8)
Cell adhesion (6)
Response to wounding (5)
Cellular metabolic process (10)
Nucleobase, nucleoside, nucleotide and nucleic acid metabolic process (9)
RNA metabolic process (8)
Plasma membrane (10)
Extracellular region (10)
Cellular metabolic process (10)
Protein metabolic process (6)
RNA processing (5)
Multicellular organismal process (10)
Immune response (10)
Cell activation (8)
PEN versus ESEN
Cell cycle phase
DNA metabolic process
Generation of precursor metabolites and energy
Lipid metabolic process
Lipid biosynthetic process
Cofactor metabolic process
Macromolecule metabolic process
Intracellular signaling cascade
M phase of mitotic cell cycle
The difference between early and late developmental genes, and consequently genes activated in group 1 versus group 3, is also evident when comparing the cellular localization of their gene products. Proteins that are produced in early development and group 1 are predominantly located in the nucleus. Similarly, upregulated genes in group 2 have products with nuclear localization and specific involvement in the CC. Gene products of lDEV500 and group 3, however, are chiefly membrane-associated or secreted into the extracellular space.
Finally, we compared the PEN to development and cancer. As expected, upregulated genes were mostly CC-related. However, they were not depleted for cell communication or signal transduction genes like eDEV500 and cancers in groups 1 and 2, suggesting that proliferating cells of the endometrium retain a higher level of communication with their surroundings than those in cancer or early development. Downregulated genes were associated with lipid metabolism and showed no enrichment for organogenesis or multicellular processes like lDEV500 and downregulated genes in group 1. Taken together, these results suggest a unique relationship between malignancy and development that is not fully recapitulated in normal proliferating tissues.
Among hundreds of curated gene sets, the developmental signature is the best descriptor of approximately 50% of interrogated tumor types
We next wanted to determine how well our developmental signatures describe the difference between cancer and normal tissue in a direct comparison with other gene sets. We downloaded the C2 database from MSigDB , a collection of gene sets derived from gene expression studies and known pathways, and tested the enrichment of approximately 1,000 gene sets in the up- and downregulated genes of our data sets. Subsequently, we compared the results with the performance of eDEV500, lDEV500 and four smaller gene sets that were defined analogously, eDEV200/lDEV200 and eDEV100/lDEV100.
MSigDB C2 gene sets most significantly enriched in groups 1-3
eDEV500 is less significant in group 2 than in group 1. This is consistent with previous results showing a less pronounced clustering of upregulated genes in early development for group 2. Instead, two independent serum response signatures are enriched in the upregulated genes (SERUM_FIBROBLAST_CORE_UP, CHANG_SERUM_RESPONSE_UP). Besides stimulating proliferation, serum exposure induces a wound healing response in fibroblasts, involving the activation of genes that play a role in intercellular signaling and remodeling of the extracellular matrix . These are both processes that map to late development in our analysis. Indeed, group 2 tumors tend to have both an early and a late peak in the frequency distribution of upregulated genes (Figure 2).
As already noted in the context of GO classification, gene sets enriched in group 3 are a counterpart of group 1. eDEV500 does not rank among the top gene sets, nor do any of the stem cell signatures. Instead, three signatures that are enriched in group 1 downregulated genes are overrepresented in the upregulated genes of group 3 (TARTE_MATURE_PC, SANSOM_APC_5_DN, NAKAJIMA_MCS_UP). The combination of serum-induced cell division (SERUM_FIBROBLAST_CELLCYCLE) and immune response gene sets again suggests an association with wound healing, but the early developmental component that is so prominent in group 1 and also present in group 2 is lacking in group 3.
The class distinction is reproducible on an independent time series
Example genes from the consensus sets of groups 1-3 ordered by their average rank across all DTs
Consensus group 1
Non-metastatic cells 1, protein (NM23A)
Chaperonin containing TCP1, subunit 7 (eta)
Proliferating cell nuclear antigen
H2A histone family, member X
Ribonucleotide reductase M1 polypeptide
Chaperonin containing TCP1, subunit 3 (gamma)
Rac GTPase activating protein 1
CDC28 protein kinase regulatory subunit 2
Consensus group 2
Nudix (nucleoside diphosphate linked moiety X)-type motif 1
Maternal embryonic leucine zipper kinase
Deoxythymidylate kinase (thymidylate kinase)
Denticleless homolog (Drosophila)
Proteasome (prosome, macropain) subunit, beta type, 9 (large multifunctional peptidase 2)
Transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)
Proteasome (prosome, macropain) subunit, beta type, 8 (large multifunctional peptidase 7)
Consensus group 3
Maternal embryonic leucine zipper kinase
Rac GTPase activating protein 1
CDC28 protein kinase regulatory subunit 2
Uncoupling protein 2 (mitochondrial, proton carrier)
Matrix metallopeptidase 7 (matrilysin, uterine)
Protective protein for beta-galactosidase (galactosialidosis)
Interleukin 7 receptor
Transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)
Major histocompatibility complex, class I, C
A core program of genes expressed in most cancers is active in early development
We have presented a comprehensive, tissue-spanning comparison of gene expression in normal development and human cancer. Main conclusions emerging from this analysis are that a large percentage of tumors recapitulate early developmental gene expression and that the developmental signature in these cancers exhibits low tissue-specificity. Furthermore, we have identified three groups of cancers distinguished by disparate developmental signatures. One group has an early developmental phenotype and expresses genes that are characteristic of stem cells. From a developmental perspective, this group presents very homogeneously. This is all the more surprising as it contains cancer types with complex karyotypes, which are currently thought to lead to more 'chaotic' gene expression. A second, more heterogeneous group tends to be more similar to late development and is characterized by an inflammatory signature. A small group of cancers presents as a transition phenotype between these two extremes and displays both characteristics. This group distinction is reproducible with respect to a new time series of embryonic lung development in the mouse. Finally, we have identified a core program of genes that are expressed in most cancers and mapped the activity of this transcriptional program to early development.
An unexpected result is the low tissue-specificity of the developmental signature, contrasting with previous reports [9, 11]. We cannot exclude the possibility that cancer types that were not included in our database recapitulate more tissue-specific developmental patterns. However, our findings suggest that comprehensive comparisons against a diverse set of developmental backgrounds are required before a specific association between a cancer and the development of its cognate tissue can be established on the gene expression level. It is likely that a lineage-specific aspect does exist in cancer gene expression , but its magnitude seems to be small in comparison with more generic developmental modules. Possibly, microRNA profiles might be better suited for detection of such subtle signals because they reflect more specific processes than mRNA profiles .
Given the type of analysis conducted here, intended to reveal broad brush strokes rather than subtleties, the clear segregation of tumors into three groups with distinct expression patterns is surprising. Clearly, the developmental trajectory provides a meaningful background for capturing large-scale differences in gene expression across diverse conditions. That said, we can only speculate as to what the biological determinants of the observed segregation might be as they are potentially as broad as the contexts in which a proliferative response is 'normal' and physiological.
The capability to divide in response to certain conditions is an inherent property of most cells. Epithelia can augment the production of new cells in response to mechanical irritation , fibroblasts divide to reconstitute injured tissue , the microvasculature of the female reproductive system periodically expands , hepatocytes reconstitute liver tissue after hepatectomy  and, of course, cells divide to form a new organism during embryogenesis. The transcriptional programs driving these processes might be as diverse as the contexts that trigger proliferation. Cancers likely exploit endogenous cellular mechanisms to sustain their growth, but our understanding of which of the available paths towards proliferation is chosen in different types of cancers is rudimentary. Our analysis suggests that tumors in group 1 recapitulate an embryonic phenotype: they express early developmental and stem cell genes, suppress genes characteristic of mature tissues and they have downregulated messages required for intercellular communication and signaling. Cell cycle in these tumors might be fueled through the same mechanisms that are employed in rapidly proliferating blastemal cells. Group 3, on the other hand, presents a different picture. Differentially expressed genes imply that proliferation here could occur in the context of wound healing, which is associated with all the processes that are relevant in group 3 (inflammatory reaction, proliferation, tissue remodeling). The conception of cancer as a 'wound that does not heal' has often been cited . Our analysis suggests that it might be more applicable to some tumors than to others. Indeed, the clinical behavior of tumors in group 3 seems to exhibit some special features. Even though ovarian cancer is known as a malignancy with poor prognosis, its growth rate often is slow and patients can live with metastases for years . Renal cell carcinoma also has a poor prognosis when metastatic; however, most renal cell carcinomas have an indolent growth rate . Finally, thyroid cancers are also recognized for their slow growth . It is not clear whether the inflammatory gene expression signature we observe in these tumors is a cause or consequence of this particular behavior, but further investigation of this question has profound clinical implications. If tumors truly rely on distinct programs for proliferation and survival, a classification system that takes these differences into consideration could provide valuable guidelines for therapeutic decisions. Based on our study, for example, we would predict that a drug interfering with the wound healing program might be effective against both ovarian and renal carcinoma, but not against Wilms' tumor or lung adenocarcinoma. Interestingly, a recent paper that examined gene expression in mouse models of colon carcinoma in a developmental context revealed a distinction between Smad3-/- and Tgfb1-/-; Rag2-/- models (both exhibiting a strong inflammatory component and showing similarity to late colon development) and ApcMin/+ and AOM models, which recapitulated early colon development . This result implies that different genetic alterations might underlie the distinct gene expression signatures in group 1 and 3 cancers.
To refine the distinction between the developmental groups of cancers emerging from our analysis, more data - ideally acquired under standardized conditions - are necessary. While the embryonic cancers in group 1 seem to represent a fairly homogeneous population with respect to their developmental component, diseases in group 3 are far more heterogeneous and more reliable data would probably lead to further sub-classification of these cancers. Standardized data would also likely help to resolve the group affiliation of ambiguous cases like thyroid carcinoma. Both PTC data sets mapping to group 3 are paired experiments, with tumor and normal tissue coming from the same patient, while the PTC data set in group 1 is unpaired. Such differences seem to have a larger impact on the genes that are identified as differentially expressed than commonly assumed. A recent study elegantly proves this point by showing an altered gene expression signature in 'normal' tissue adjacent to lung tumors . Other possibly confounding factors like the degree of lymphocyte infiltration in different samples and the already mentioned specification of histological subtype might play important roles in determining the developmental profile of a tumor and should be accounted for in future studies.
To gain a better understanding of the biology underlying different loci on the developmental landscape, it might also be helpful to include more pathological conditions unrelated to cancer in the analysis. For many diseases, we have sufficiently good understanding of etiology and pathophysiology to be able to use them as 'landing lights' on the developmental surface.
The results presented here suggest that there is great potential for better understanding of human disease in a 'macrobiological' approach to analyzing high-throughput data. Shifting our focus from single sets of genes or processes to the biology of aggregates on the order of the entire transcriptome is likely to be useful in establishing highly robust molecular correlations between seemingly unrelated disease phenotypes.
Materials and methods
All gene expression data with the exception of the lung development validation series came from the public domain. Developmental time courses were profiled on several different Affymetrix chips (MG-430 2.0, MG-430A, MG-U74, Mu11K, HG-U133A). To exclude potential platform-related bias, we restricted ourselves to Affymetrix HG-U133A or HG-U133 Plus 2.0 arrays for cancer gene expression profiles. A detailed description of all data sets can be found in Additional data file 6.
When available, .CEL files were downloaded and arrays were normalized and expression measures calculated using the robust multi-array average [36, 37]. When raw data were not available, MAS5 preprocessed expression values were downloaded, quantile-normalized and log2-transformed.
Cross-platform comparison and homology mapping
On Affymetrix arrays, a gene is often assayed by several probe sets. We first reduced each platform to unique Entrez Gene IDs. To avoid artifacts in downstream analyses caused by biased probe set selection, we randomly chose the probe set that would represent a gene on a particular platform. Probe sets with no Entrez ID were removed. In the next step, we used the homologene database (NCBI) to define orthologs between the human and the mouse. Entrez IDs with no ortholog were removed from all platforms. Finally, we matched orthologous genes across platforms. The resulting 'consensus' between the different platforms consists of 5,166 unique genes.
Construction of the developmental timeline: principal components analysis
We used principal components analysis to construct a DT for each developmental time series . Principal components analysis is a linear data transformation technique that allows representation of the original data in a new coordinate system in which the axes (principal components (PCs)) are chosen to capture as much variation in the data as possible in a decreasing order, that is, PC1 accounts for x% variability, PC2 for y%, PC3 for z% and so on, with x > y > z. We first normalized the expression values of each gene across conditions (time points) to mean 0 and standard deviation 1. Principal components analysis was carried out on the normalized data using the R language and environment for statistical computing , with genes representing objects and time points representing the features whose dimensionality is to be reduced. For the purposes of our analysis, we were interested in the PC that is most significantly associated with time. For each developmental time series M (5,166 rows/genes and k columns/time points), we therefore computed the correlation between a time vector (1,2,3...,k) and the component loadings for each of the k PCs. For all time series with the exception of liver regeneration, PC1 was most significantly correlated with the time vector (>0.8). For liver regeneration, the highest correlation (~0.6) was found for PC3, indicating that the major changes in gene expression during liver development do not occur in a continuum from time point 1 to k, as in development. The most active stage of hepatocyte regeneration occurs approximately 48 h after hepatectomy, while our time series spans 0-72 h (Seth Karp, personal communication).
The DT of a developmental time series is the original data matrix M (5,166 rows/genes and k columns/time points) after the transformation:
y = v T × d T
where v T denotes the k-dimensional PC of M that is most significantly correlated with time.
Analysis of differential gene expression and construction of developmental profiles
Differential gene expression between tumors and corresponding controls was determined using significance analysis of microarrays (SAM) . All genes with a q-value <0.05 were considered differentially expressed. For purposes of consistency, SAM based on an unpaired t-test was used for all data sets, even though paired data were available in four cases.
Frequency plots and probability distributions
Frequency plots were constructed by dividing the DTs in 13 equally sized (approximately 400 genes) compartments and plotting the compartment index against the number of upregulated and downregulated genes mapping to that compartment.
We then quantified the shape of the distribution by fitting two independent linear models to the data, one regression line representing the probability distribution on the early end of the DT and another one approximating the distribution on the late end (illustrated in Figure 1b and top right corner of Figure 3). The goal was to find two regression lines that best approximate the probability distribution and use their slopes as a two-dimensional summary of how the cancer genes map to the DT. Since each probability distribution has a unique shape, it has to be determined in each individual case at which point on the DT the breakpoint between the early and late model should occur to achieve an optimal approximation. We computed a series of F statistics (Chow test) for each potential breakpoint in the probability distribution; that is, we tested how different the coefficients of the two regression lines are if we choose the breakpoint at DEV[i] for i = 774,775,...4,391 (this excludes the earliest and latest parts of the DT because the linear model should represent a sufficiently large segment). The optimal breakpoint is defined as the maximal value in the series of F statistics. All computations were done using the strucchange package available at . The fit of the regression lines to all probability distributions (blue lines) can be viewed in Additional data files 9 (before cell cycle subtraction) and 10 (after cell cycle subtraction). For each combination of cancer and DT, this approach yields four regression lines: two models representing early and late probabilities for upregulated genes and two models for downregulated genes. For each cancer, we can thus summarize the relationship to the 10 DTs in a 40-dimensional vector (4 regression line slopes × 10 DTs). These vectors were for used for clustering using Euclidian distance and Ward's linkage (Figure 3).
Cell cycle subtraction
We downloaded cell cycle scores for 38,578 probes . The scaled Fourier periodicity scores ranged from 0.1-58; the cutoff for being considered cell cycle regulated in the original publication was 3.2. In our CC subtracted data sets, we allowed only genes with a defined score <1.
Meta-developmental signatures, consensus gene sets and GO characterization
The meta-developmental signatures were determined by computing the average rank of each gene across all ten DTs and selecting the x genes with earliest (eDEVx) and latest (lDEVx) expression. Enrichment of GO categories in meta-developmental signatures and deregulated genes in cancer was determined against the background of all 5,166 genes in our analysis using Bioconductor's GOstats package . Consensus gene sets for tumors in groups 1-3 were defined as those genes that are upregulated (downregulated) in at least 60% of datasets belonging to a given group, that is, 11/15 for group 1, 5/6 for group 2 and 8/13 for group 3.
C2 gene set enrichment
We downloaded all C2 gene sets from the Broad Institute website , eliminated the fraction that had an overlap of less than 15 genes with our data sets, and augmented the C2 collection with 10 meta-developmental gene signatures (DEV30, 50, 100, 200, 500 and LATEDEV30, 50, 100, 200, 500). We then tested the enrichment of each of these 999 gene signatures in the up- and downregulated genes of our 32 data sets using Fisher's exact test (one-sided). Clustering of all data sets by the p-values for the top 20 enriched gene sets in groups 1-3 was accomplished using Manhattan distance and Ward's linkage.
R scripts for all above-mentioned analyses are provided on the website accompanying this paper .
Validation time series: embryonic lung development
Whole lungs were dissected from C57BL/6J mice at E11.5, E13.5, E14.5, E16.5 and postnatal day 5 and stored in RNAlater (Ambion, Austin, TX, USA). All time points represent gene expression patterns of individuals; only E11.5 was a pooled sample (seven pups). Total RNA was extracted using Ambion's mirVana miRNA isolation kit and tested for quality using a bioanalyzer (Agilent, Santa Clara, CA, USA). RNA integrity numbers ranged from 9.2-9.7. The samples were prepared for hybridization to Affymetrix Mouse 430 2.0 arrays according to the manufacturer's instructions. Processed and raw data have been submitted to Gene Expression Omnibus  under accession number GSE11539 and are also available in RMA-normalized form as Additional data file 7.
Additional data files
The following additional data are available. Additional data file 1 contains the frequency plots for all cancer types and all developmental time series. Additional data file 2 contains the frequency plots for all cancers and all time series after CC subtraction. Additional data file 3 shows the frequency plots for the top 450 differentially expressed genes in CRCC1 and CRCC2. Additional data file 4 contains a heatmap of probability distribution slopes after CC subtraction (analogously to Figure 3). Additional data file 5 is a heatmap of enrichment p-values for gene sets that ranked among the 20 most enriched in the downregulated genes of either group 1, 2 or 3. Additional data file 6 is a spreadsheet containing detailed annotation for all data sets used in this study. Additional data file 7 contains the raw data for the lung development validation time series (after RMA-normalization). Additional data file 8 contains a smooth histogram of early upregulated (UpE) probability distribution slopes for all cancer data sets. Additional data file 9 contains the probability distribution plots and linear regression fits for all cancers and all time series. Additional data file 10 shows the same data as additional data file 9, but after CC subtraction.
clear cell renal cell carcinoma
early secretory endometrium
papillary thyroid carcinoma
significance analysis of microarrays.
We thank Zoltan Szallasi, Vania Nose, Iris Eisenberg, Wilhelmine DeVries, Judah Folkman, Michael Galdzicki and Alal Eran for fruitful discussions, valuable comments on the manuscript and technical assistance.
- Virchow RLK: Cellular Pathology. 1859, BerlinGoogle Scholar
- Hartwell KA, Muir B, Reinhardt F, Carpenter AE, Sgroi DC, Weinberg RA: The Spemann organizer gene, Goosecoid, promotes tumor metastasis. Proc Natl Acad Sci USA. 2006, 103: 18969-18974. 10.1073/pnas.0608636103.PubMedPubMed CentralView ArticleGoogle Scholar
- Sparmann A, van Lohuizen M: Polycomb silencers control cell fate, development and cancer. Nat Rev Cancer. 2006, 6: 846-856. 10.1038/nrc1991.PubMedView ArticleGoogle Scholar
- Liu S, Dontu G, Mantle ID, Patel S, Ahn NS, Jackson KW, Suri P, Wicha MS: Hedgehog signaling and Bmi-1 regulate self-renewal of normal and malignant human mammary stem cells. Cancer Res. 2006, 66: 6063-6071. 10.1158/0008-5472.CAN-06-0054.PubMedPubMed CentralView ArticleGoogle Scholar
- Borczuk AC, Gorenstein L, Walter KL, Assaad AA, Wang L, Powell CA: Non-small-cell lung cancer molecular signatures recapitulate lung developmental pathways. Am J Pathol. 2003, 163: 1949-1960.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu H, Kho AT, Kohane IS, Sun Y: Predicting survival within the lung cancer histopathological hierarchy using a multi-scale genomic model of development. PLoS Med. 2006, 3: e232-10.1371/journal.pmed.0030232.PubMedPubMed CentralView ArticleGoogle Scholar
- Coulouarn C, Derambure C, Lefebvre G, Daveau R, Hiron M, Scotte M, Francois A, Daveau M, Salier JP: Global gene repression in hepatocellular carcinoma and fetal liver, and suppression of dudulin-2 mRNA as a possible marker for the cirrhosis-to-tumor transition. J Hepatol. 2005, 42: 860-869. 10.1016/j.jhep.2005.01.027.PubMedView ArticleGoogle Scholar
- Dekel B, Metsuyanim S, Schmidt-Ott KM, Fridman E, Jacob-Hirsch J, Simon A, Pinthus J, Mor Y, Barasch J, Amariglio N, Reisner Y, Kaminski N, Rechavi G: Multiple imprinted and stemness genes provide a link between normal and tumor progenitor cells of the developing human kidney. Cancer Res. 2006, 66: 6040-6049. 10.1158/0008-5472.CAN-05-4528.PubMedView ArticleGoogle Scholar
- Hu M, Shivdasani RA: Overlapping gene expression in fetal mouse intestine development and human colorectal cancer. Cancer Res. 2005, 65: 8715-8722. 10.1158/0008-5472.CAN-05-0700.PubMedView ArticleGoogle Scholar
- Kaiser S, Park YK, Franklin JL, Halberg RB, Yu M, Jessen WJ, Freudenberg J, Chen X, Haigis K, Jegga AG, Kong S, Sakthivel B, Xu H, Reichling T, Azhar M, Boivin GP, Roberts RB, Bissahoyo AC, Gonzales F, Bloom GC, Eschrich S, Carter SL, Aronow JE, Kleimeyer J, Kleimeyer M, Ramaswamy V, Settle SH, Boone B, Levy S, Graff JM, et al: Transcriptional recapitulation and subversion of embryonic colon development by mouse colon tumor models and human colon cancer. Genome Biol. 2007, 8: R131-10.1186/gb-2007-8-7-r131.PubMedPubMed CentralView ArticleGoogle Scholar
- Kho AT, Zhao Q, Cai Z, Butte AJ, Kim JY, Pomeroy SL, Rowitch DH, Kohane IS: Conserved mechanisms across development and tumorigenesis revealed by a mouse development perspective of human cancers. Genes Dev. 2004, 18: 629-640. 10.1101/gad.1182504.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu R, Wang X, Chen GY, Dalerba P, Gurney A, Hoey T, Sherlock G, Lewicki J, Shedden K, Clarke MF: The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med. 2007, 356: 217-226. 10.1056/NEJMoa063994.PubMedView ArticleGoogle Scholar
- Garraway LA, Sellers WR: Lineage dependency and lineage-survival oncogenes in human cancer. Nat Rev Cancer. 2006, 6: 593-602. 10.1038/nrc1947.PubMedView ArticleGoogle Scholar
- Berek JS: Interval debulking of ovarian cancer — an interim measure. N Engl J Med. 1995, 332: 675-677. 10.1056/NEJM199503093321010.PubMedView ArticleGoogle Scholar
- Oda T, Takahashi A, Miyao N, Yanase M, Masumori N, Itoh N, Sato MA, Kon S, Tsukamoto T: Cell proliferation, apoptosis, angiogenesis and growth rate of incidentally found renal cell carcinoma. Int J Urol. 2003, 10: 13-18. 10.1046/j.1442-2042.2003.00558.x.PubMedView ArticleGoogle Scholar
- Fujimoto N, Sugita A, Terasawa Y, Kato M: Observations on the growth rate of renal cell carcinoma. Int J Urol. 1995, 2: 71-76. 10.1111/j.1442-2042.1995.tb00427.x.PubMedView ArticleGoogle Scholar
- Riesco-Eizaguirre G, Santisteban P: New insights in thyroid follicular cell biology and its impact in thyroid cancer therapy. Endocr Relat Cancer. 2007, 14: 957-977. 10.1677/ERC-07-0085.PubMedView ArticleGoogle Scholar
- Schmid HP, McNeal JE, Stamey TA: Clinical observations on the doubling time of prostate cancer. Eur Urol. 1993, 23 (Suppl 2): 60-63.PubMedGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002, 13: 1977-2000. 10.1091/mbc.02-02-0030..PubMedPubMed CentralView ArticleGoogle Scholar
- Jaroudi S, SenGupta S: DNA repair in mammalian embryos. Mutat Res. 2007, 635: 53-77. 10.1016/j.mrrev.2006.09.002.PubMedView ArticleGoogle Scholar
- Karnoub AE, Weinberg RA: Chemokine networks and breast cancer metastasis. Breast Dis. 2007, 26: 75-85.Google Scholar
- Yusuf F, Rehimi R, Morosan-Puopolo G, Dai F, Zhang X, Brand-Saberi B: Inhibitors of CXCR4 affect the migration and fate of CXCR4+ progenitors in the developing limb of chick embryos. Dev Dyn. 2006, 235: 3007-3015. 10.1002/dvdy.20951.PubMedView ArticleGoogle Scholar
- Karnoub AE, Dash AB, Vo AP, Sullivan A, Brooks MW, Bell GW, Richardson AL, Polyak K, Tubo R, Weinberg RA: Mesenchymal stem cells within tumour stroma promote breast cancer metastasis. Nature. 2007, 449: 557-563. 10.1038/nature06188.PubMedView ArticleGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.PubMedPubMed CentralView ArticleGoogle Scholar
- Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO: The transcriptional program in the response of human fibroblasts to serum. Science. 1999, 283: 83-87. 10.1126/science.283.5398.83.PubMedView ArticleGoogle Scholar
- Yu P, Huang B, Shen M, Lau C, Chan E, Michel J, Xiong Y, Payan DG, Luo Y: p15(PAF), a novel PCNA associated factor with increased expression in tumor tissues. Oncogene. 2001, 20: 484-489. 10.1038/sj.onc.1204113.PubMedView ArticleGoogle Scholar
- Kufe D, Pollock R, Weichselbaum R, Bast R, Gansler T, Holland J, Frei E: Cancer Medicine. 2003, Hamilton, Canada: BC Decker Inc.Google Scholar
- Andäng M, Hjerling-Leffler J, Moliner A, Lundgren TK, Castelo-Branco G, Nanou E, Pozas E, Bryja V, Halliez S, Nishimaru H, Wilbertz J, Arenas E, Koltzenburg M, Charnay P, El Manira A, Ibañez CF, Ernfors P: Histone H2AX-dependent GABA(A) receptor regulation of stem cell proliferation. Nature. 2008, 451: 460-464. 10.1038/nature06488.PubMedView ArticleGoogle Scholar
- Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR: MicroRNA expression profiles classify human cancers. Nature. 2005, 435: 834-838. 10.1038/nature03702.PubMedView ArticleGoogle Scholar
- Potten CS, Barthel D, Li YQ, Ohlrich R, Matthé B, Loeffler M: Proliferation in murine epidermis after minor mechanical stimulation. Part 1. Sustained increase in keratinocyte production and migration. Cell Prolif. 2000, 33: 231-246. 10.1046/j.1365-2184.2000.00178.x.PubMedView ArticleGoogle Scholar
- Darby IA, Hewitson TD: Fibroblast differentiation in wound healing and fibrosis. Int Rev Cytol. 2007, 257: 143-179. 10.1016/S0074-7696(07)57004-X.PubMedView ArticleGoogle Scholar
- Reynolds LP, Killilea SD, Redmer DA: Angiogenesis in the female reproductive system. FASEB J. 1992, 6: 886-892.PubMedGoogle Scholar
- Otu HH, Naxerova K, Ho K, Can H, Nesbitt N, Libermann TA, Karp SJ: Restoration of liver mass after injury requires proliferative and not embryonic transcriptional patterns. J Biol Chem. 2007, 282: 11197-11204. 10.1074/jbc.M608441200.PubMedView ArticleGoogle Scholar
- Dvorak HF: Tumors: wounds that do not heal. Similarities between tumor stroma generation and wound healing. N Engl J Med. 1986, 315: 1650-1659.PubMedView ArticleGoogle Scholar
- Stearman RS, Dwyer-Nield L, Grady MC, Malkinson AM, Geraci MW: A macrophage gene expression signature defines a field effect in the lung tumor microenvironment. Cancer Res. 2008, 68: 34-43. 10.1158/0008-5472.CAN-07-0988.PubMedView ArticleGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMedPubMed CentralView ArticleGoogle Scholar
- The R Project for Statistical Computing. [http://www.r-project.org/]
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- strucchange. [http://cran.r-project.org/web/packages/strucchange/]
- Identification of Genes Periodically Expressed in the Human Cell Cycle and Their Expression in Tumors. [http://genome-www.stanford.edu/Human-CellCycle/Hela/]
- Bioconductor. [http://www.bioconductor.org]
- C2 Gene Sets. [http://www.broad.mit.edu/gsea/msigdb/genesets.jsp?collection=C2]
- Macrobiology at the Children's Hospital Informatics Program. [http://www.chip.org/macrobiology/]
- Gene Expression Omnibus. [http://www.ncbi.nlm.nih.gov/geo/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.