The molecular portrait of in vitro growth by meta-analysis of gene-expression profiles
© Sandberg and Ernberg; licensee BioMed Central Ltd. 2005
Received: 27 January 2005
Accepted: 21 June 2005
Published: 27 July 2005
Cell lines as model systems of tumors and tissues are essential in molecular biology, although they only approximate the properties of in vivo cells in tissues. Cell lines have been selected under in vitro conditions for a long period of time, affecting many specific cellular pathways and processes.
To identify the transcriptional changes caused by long term in vitro selection, we performed a gene-expression meta-analysis and compared 60 tumor cell lines (of nine tissue origins) to 135 human tissue and 176 tumor tissue samples. Using significance analysis of microarrays we demonstrated that cell lines showed statistically significant differential expression of approximately 30% of the approximately 7,000 genes investigated compared to the tissues. Most of the differences were associated with the higher proliferation rate and the disrupted tissue organization in vitro. Thus, genes involved in cell-cycle progression, macromolecule processing and turnover, and energy metabolism were upregulated in cell lines, whereas cell adhesion molecules and membrane signaling proteins were downregulated.
Detailed molecular understanding of how cells adapt to the in vitro environment is important, as it will both increase our understanding of tissue organization and result in a refined molecular portrait of proliferation. It will further indicate when to use immortalized cell lines, or when it is necessary to instead use three-dimensional cultures, primary cell cultures or tissue biopsies.
How different are cells grown in vitro from cells that are part of a tissue? Human tissues and tumors are complex and heterogeneous as they are composed of different cell types that influence each other through paracrine signaling pathways and interactions with extracellular matrix (ECM). Cell lines on the other hand consist of a more or less clonal cell populations that lack interactions with other cell types and interact with an artificial support such as plastic. Cell adaptation to in vitro microenvironments have probably involved recalibrations of many cellular pathways through genetic alterations , transcriptional alterations , different post-transcriptional regulation  and changed signaling networks . Thus, the degree to which cell lines are representative of the specific cell types they were derived from varies [5, 6]. Furthermore, among cell lines established for in vitro growth there is an overwhelming bias for tumor-derived cells. It has been very hard to establish non-transformed cells for long-term in vitro growth. Detailed comparisons of the genotypic and phenotypic characteristics of in vitro grown cells with a panel of normal and tumor tissues may reveal how cell lines have adapted to in vitro environments. Moreover, comparisons of cell lines with both tumors and the normal tissues they were derived from are needed to assess how well they represent their tissue of origin and which of their features may have been acquired in vitro.
Analyses of mRNA expression levels using DNA microarrays have contributed to an increasingly detailed understanding of patterns of gene expression in different tissues [7, 8] and also how in vitro selection and adaptation affect basic cellular processes. So far, these studies have been focused on single cell types. Cell lines from colon , breast , lymphoma , leukemia , and lung origin  have been compared to their corresponding in vivo malignancies. These studies have consistently demonstrated that different cell lines of the same tissue origin are more similar to each other than to the tumors they derived from. From these gene-expression studies, it has also been repeatedly shown that genes associated with proliferation [2, 10, 11] and ribosomal activity  are upregulated in cell lines. However, no study so far has addressed the issue of whether the same genes are perturbed by the in vitro environment in cell lines derived from tumors of different tissue origins, that is, if there may be an 'in vitro expression profile'.
Developing meta-analytical tools for comparing gene-expression data generated in different studies and laboratories is important. Some meta-analysis of gene-expression profiles of multiple tumors and normal tissues have been pursued, identifying common upregulated genes in neoplastic transformation and in relation to tumor differentiation status . Moreover, a collection of gene-expression data from different tumor types has been used to identify upregulated or repressed modules of genes with coherent expression profiles in specific tumors . In both these studies, gene-expression data was gathered from multiple platforms and laboratories, although the data were analyzed independently (that is, for each dataset separately). In the first study, the expression levels in each array were normalized independently to unit length (a median expression of zero and a standard deviation of one) . In the second study, each gene was subtracted by the mean expression level across the samples in each dataset, respectively . Subsequently, genes which were consistently up- or downregulated could be identified in comparisons within multiple datasets .
In this study, we describe a cross-site approach to quantitatively integrate gene-expression profiles from three laboratories [15–17] comprising 60 cell lines and 311 tissue samples. We integrated gene-expression data from cell lines derived from tumors of nine different tissue-origins (NCI60 cell lines) with two large gene-expression datasets of human tissues and human tumors. All these studies used the same platform and array-type (Affymetrix Hu6800). Using a meta-analysis we defined the transcriptional changes observed in all cell lines compared to both normal and tumor tissues independent of tissue origin. The cell lines showed statistically significant differential expression of approximately 30% of the approximately 7,000 genes investigated. Among the upregulated genes we consistently found - not surprisingly - many genes involved in macromolecular turnover, cell-cycle progression, energy metabolism, and histone modifications. Adhesion molecules and membrane signaling proteins were enriched among the downregulated genes, a possible consequence of the disrupted tissue organization in vitro. The origin-independent transcriptional alterations defined in this study are probably the consequence of the in vitro adaptation and selection. As such, our data will be important to improve our understanding of the biological consequences of in vitro growth and thus how well cell lines correspond to the in vivo tissues and tumors.
Normalization of gene-expression profiles from multiple sources
Sources of gene-expression data
Number of cell lines
Number of normal tissue samples
Number of tumor samples
Validation of the quantitative comparison across datasets
Therefore, we performed the identical analysis of dataset II (the validation dataset) comprising both cell lines and tissue samples within the same study. Using the identical SVD procedure, cell lines were again separated from tissues in their correlation with the two first eigenarrays (Figure 3b). This excluded the possibility that the cell line versus tissues distinction in dataset I was a technical artifact. Moreover, the separation of cell lines from tissue samples was captured by the first eigenarray in both datasets demonstrating that this difference was the largest in the gene-expression data. Hierarchical clustering of the gene expression in datasets I and II, were also found to repeatedly separate all cell lines from normal and tumor tissues (data not shown).
Identification of origin-independent transcriptional alterations in vitro
We next sought to estimate the number of genes that were specifically up- or downregulated in cell lines and responsible for the distinct separation of cell lines from tissue samples. We used significance analysis of microarrays (SAM)  to identify the number of genes with statistically significant differential expression as a function of the false discovery rate (FDR). In dataset I, using conservative criteria, we identified 1,500 genes with an estimated FDR of zero, and 2,900 genes at a FDR of 1% (Figure 3c). For example, at a FDR of 1% only 29 false positives are estimated out of the 2,900 genes identified. In dataset II we identified 1,800 genes at a FDR of zero and 3,400 genes at a FDR of 1% (Figure 3d). In total, using a FDR of 1%, we identified 41% of the genes as differentially expressed between cell lines and tissues in dataset I and 29% in dataset II respectively. To investigate the generality of our results, we investigated whether the identical genes were identified as up- or downregulated in cell lines in dataset I and II despite the sample and platform differences. Of the 2,000 most differentially expressed genes in dataset I, we found corresponding probe sets for 1,476 of the genes on the HGU95A arrays (635 upregulated and 841 downregulated genes) using a recently published map . We confirmed the upregulation of 399 genes (63% of the genes; p < 4e-70, Fisher's exact test) and 176 (21% of the genes; p < 1e-7, Fisher's exact test) of the downregulated genes in cell lines by identifying the intersection with the genes with statistically significant differential expression in dataset II (FDR of 1%). The list of genes found to be differentially expressed in both datasets is found in Additional data file 1. Second, we also compared the score of differential expression for all genes in both datasets (Figure 3e). A correlation coefficient of 0.33 between the degree of differential expression in dataset I and II was observed, even though they are generated using two different Affymetrix arrays and the sample origins were diverse. Again, this demonstrated that the results obtained by comparing the cell lines to normal and tumor tissues in dataset I were not due to technical artifacts.
Classification of samples based upon the in vitrosignature
Classification of cell lines and tissue samples across five datasets
Number of cell lines
Number of tissue samples
Dataset III 
Dataset IV 
Dataset V 
Features of the in vitrogene-expression signature
The genes downregulated in cell lines and only expressed in subsets of tissues and tumors were likely to represent tissue-specific genes for which the expression was lost in cell lines (Figure 4a). Indeed, examples of tissue-specific genes that were downregulated in cell lines were identified for blood cells (Figure 4, cluster A, for example, PBXIP1, ISGF3 and IkB-alpha), brain tumors (Figure 4, cluster C and sample cluster 6, for example, CCND2 and APPBP2), renal biopsies (Figure 4, cluster E, for example, hMT-If) and brain normal and tumor biopsies (Figure 4, cluster F, for example, Protocadherin 2).
Leukemias (sample clusters 4 and 5 in Figure 4), lymphomas (sample cluster 3 in Figure 4), and germinal center cells (sample cluster 7 in Figure 4) had gene-expression profiles most similar to those of the cell lines. They had downregulated a large portion of the genes similarly downregulated in cell lines (Figure 4, cluster D). They had also upregulation of genes associated with replication (cluster G, for example, TOPII, MCM2, MCM3 and MCM6) and metabolism (cluster H). The information of all genes present in Figure 4 along with its presence in different subclusters can be found in Additional data file 1. A high-resolution image of Figure 4 with all sample names and gene identifiers can be found in Additional data file 3.
Transcriptional alterations affect multiple biological processes
Biological process upregulated in vitro
Total number of genes
Ribosome biogenesis and assembly
Regulation of translation
tRNA aminoacylation for protein translation
Regulation of translational initiation
Transcription from Pol I promoter
RNA splicing, via transesterification reactions with bulged adenosine as nucleophile
RNA splicing, via transesterification reactions
Nuclear mRNA splicing, via spliceosome
Nucleobase, nucleoside, nucleotide and nucleic acid metabolism
Purine nucleotide metabolism
Purine nucleotide biosynthesis
Purine ribonucleotide metabolism
Purine ribonucleotide biosynthesis
Nucleoside triphosphate metabolism
Ribonucleoside triphosphate metabolism
Ribonucleoside triphosphate biosynthesis
Nucleoside triphosphate biosynthesis
Purine ribonucleoside triphosphate metabolism
Purine ribonucleoside triphosphate biosynthesis
Purine nucleoside triphosphate metabolism
Purine nucleoside triphosphate biosynthesis
Protein modifiication and degradation
Intracellular protein transport
Amino acid and derivative metabolism
Amino acid metabolism
Ubiquitin-dependent protein catabolism
Modification-dependent protein catabolism
Amino acid activation
Energy derivation by oxidation of organic compounds
Main pathways of carbohydrate metabolism
Coenzyme and prosthetic group metabolism
Coenzyme and prosthetic group biosynthesis
Tricarboxylic acid cycle
ATP synthesis coupled electron transport (sensu Eukarya)
ATP synthesis coupled electron transport
Cell organization and biogenesis
Mitotic cell cycle
Cytoplasm organization and biogenesis
DNA replication and chromosome cycle
Nuclear organization and biogenesis
S phase of mitotic cell cycle
Chromosome organization and biogenesis (sensu Eukarya)
Establishment and/or maintenance of chromatin architecture
M phase of mitotic cell cycle
DNA-dependent DNA replication
Microtubule cytoskeleton organization and biogenesis
G1/S transition of mitotic cell cycle
G2/M transition of mitotic cell cycle
M-phase specific microtubule process
DNA replication initiation
Mitotic spindle assembly
Pre-replicative complex formation and maintenance
Covalent chromatin modification
Response to endogenous stimulus
Response to DNA damage stimulus
Biological process downregulated in vitro
Total number of genes
Membrane signaling and cell adhesion
Cell surface receptor linked signal transduction
G-protein coupled receptor protein signaling pathway
Enzyme linked receptor protein signaling pathway
G-protein signaling, coupled to IP3 second messenger (phospholipase C activating)
Extracellular structure organization and biogenesis
Extracellular matrix organization and biogenesis
The use of immortalized cell lines as model systems of normal and pathological tissues is controversial [5, 26–28]. There are obvious general differences between the environment of cells growing in vitro and that of in vivo tissue cells, including oxidative pressure, nutrient accessibility, cell-cell contact and interactions with ECM, as well as in growth rate. These differences influence the gene expression and the phenotype of the cells grown in vitro. Many gene-expression studies have analyzed the differences between cell lines derived from a specific tumor tissue to the corresponding tumor tissues and primary cultures [2, 10, 12, 29]. These studies are important to asses how cell-line model systems have maintained the gene expression of their tumor origins, that is, their tissue identities. We have previously developed a method to assess how gene expression in individual cell lines relates to tumors of different tissue origins . It is, however of equal importance to pinpoint the cellular processes affected by long term in vitro growth irrespectively of tissue origin. Therefore we have performed a comprehensive analysis of gene-expression profiles of 60 cell lines and 311 samples from multiple tissue origins. The analyses showed that approximately 30% of the genes investigated were differentially expressed in immortalized cell lines.
We used GO to characterize the cellular processes that were transcriptionally altered in cell lines. This analysis identified the common biological processes that were transcriptionally altered in rapidly dividing cells, that is, a molecular portrait of proliferation. In support of previous findings [2, 10], these data confirmed an upregulation of genes involved in translation, cell-cycle regulation and DNA replication. In addition, this comparison identified many other cellular processes that were upregulated (Table 2). Genes involved in energy metabolism, nucleotide metabolism, splicing, protein modifications and degradation, and chromatin regulation were enriched among the upregulated genes in vitro. As expected, many of the upregulated genes seem to be directly involved in cell divisions. For example, the maintenance methylation enzyme, DNA methyltransferase 1 (DNMT1), was consistently upregulated in the rapidly dividing cell lines. DNMT1 methylates newly synthesized DNA and is directly involved in the DNA replication process. The de novo DNA methylation enzymes DNMT3A and DNMT3B were not, however, upregulated in cell lines. Therefore, it is tempting to speculate that the list of upregulated genes is enriched in genes directly involved in the essential cellular processes for rapidly diving cells (for example, DNA replication). The gene list might therefore be used to predict which cellular factors are general and which factors have more specialized regulatory roles. Certain histone-modifying proteins (HDAC1, EZH2, and HP1 beta and gamma subunits) were upregulated in cell lines whereas others were not. Could these factors also be directly involved in DNA replication?
Among the genes downregulated in vitro we detected many involved in cell communication, membrane signaling, and adhesion to ECM. A downregulation of genes involved in ECM interactions were previously found in a serial analysis of gene expression (SAGE) study . Our results confirm their observation. We further demonstrate that additional membrane signaling proteins, working downstream of G-protein-coupled receptors, were downregulated in vitro. The downregulation of many proteins involved in membrane signaling, cell-cell communication and adhesion to ECM probably reflect the altered environment for cells growing in vitro and in defined cell-culture media and in contrast to the organization of cells in tissues [6, 26, 27]. Indeed, when transplanting tumor cell lines into immunodeficient mice and analyzing the resulting tumors, genes involved in ECM and cell adhesion were again upregulated . The gene-expression comparison presented in this study could also be used for detailed characterization of particular pathways  to identify which are up- or downregulated as part of the cell-line adaptation to in vitro conditions.
This study compared immortalized cell lines to solid tumors of diverse origins. Tissues are complex, heterogeneous mixtures of cell types, whereas cell lines contain just one more-or-less clonal cell type, selected for its ability to grow under in vitro conditions. It is likely that the expression of genes in tumor-derived cell lines is more similar to that in the malignant cells within the tumor tissue. Thus the in vitro signature is a combined effect of in vitro adaptation and selection for subtypes of cells from the tissue. Although at present it would be methodologically very hard to establish the contribution from either of these two phenomena, some general remarks can be made. Genes more highly expressed in the malignant cell would appear upregulated in cell lines as a result of the enrichment of this cell in culture. Because the tumor samples contained at least 50% malignant cells (usually more, see Materials and methods) this 'enrichment effect' could never result in an artificial fold-change of more than 2. In our data, 344 genes (dataset I) and 1,159 genes (dataset II) were upregulated in cell lines with a fold-change exceeding 2. It is therefore impossible that the enrichment effect explains the major part of the observed upregulation of genes in vitro. It could only bias the numbers to a limited extent. On the other hand, the degrees of infiltration of stromal cells vary between different solid tumors . There is a possibility that genes upregulated in stromal cells appear downregulated in cell lines as a result of the lack of these cells in culture. This dilution effect could potentially result in an apparent downregulation in cell lines of genes with a fold-change value exceeding 2. This requires that there is a sixfold change in the expression in the stromal compartment comprising 20% of the cells in the tumor, for a gene to appear downregulated by more than twofold in cell lines. One extreme, but interesting, possibility would be that the cells growing in vitro are derived from a putative 'cancer stem cell' . In that case the enrichment effect could be profound, and the observed expression signature would then be a combination of the in vitro adaptation and selection for a common cancer stem cell signature. These intriguing issues might be resolved using laser-capture microdissection  on specific subpopulations of cells within the tumor for cases where reliable stem-cell markers can be established or applying tissue modeling in in vitro three-dimensional culture systems [26, 27]. It must be emphasized, however, that the tumor tissue phenotype is very much dictated by the interplay between different cell types, which is decisively interrupted by growth in vitro [28, 33]. The interplay between malignant cells and stroma can be dissected using xenografts. In a recent study, human cell lines were injected into mice and the effect of stromal components on the gene expression of the malignant cell was specifically investigated . Finally, it is of fundamental importance to pinpoint the common transcriptional differences and similarities of these cell lines to their tissues of origin irrespective of their causes, as in our study. These cell lines are routinely used as model systems of tumors and normal tissues. Therefore the nature and volume of effects related to in vitro culture are profoundly relevant.
It would be interesting to investigate the temporal aspects of the establishment of the in vitro signature. In a recent study 6- and 24-hour primary cultures of hepatocytes were compared to liver tissues . Not surprisingly, it was found that the gene-expression profiles separated gradually with time. However, the genes reported to be upregulated at 6 and 24 hours are not the same as the ones that were found to be universally upregulated in our tumor-derived cell lines, indicating the need for a longer period of time before the in vitro signature gets established. Other studies have identified higher expression of a limited set of proliferation-associated genes in immortalized cancer cell lines when compared with primary cultures [10, 29]. Therefore, it is likely that the extensive differential expression observed in this study occur as a result of long-term adaptation due to in vitro selection and adaptation.
This study also introduced a fruitful cross-site approach for quantitative comparison of gene-expression data from different laboratories. The growing wealth of gene-expression data available in public databases offers great opportunities for computational experiments. It must, however, be emphasized that a successful comparison of gene-expression data from different laboratories depends on the quality of the data and similarities in the experimental protocols used . Therefore, careful quality controls and validations of gene-expression comparisons must always be performed. If available, raw data files (that is, CEL files) would enable additional quality controls (such as checking the image for hybridization scratches) and the use of different methods to estimate transcript levels . We developed a quality-control procedure by examining the scalar factors, correlation between similar samples, SVD, and an independent validation dataset. This approach was successful in the analysis of gene-expression data from three different laboratories (using the same Affymetrix Hu6800 platform). Thus, quantitative comparisons of gene-expression data from different sites may be feasible.
This cross-site comparison of gene expression in cell lines, normal, and tumor tissues revealed a distinct in vitro gene-expression signature. This signature deserves attention as a biological phenomenon itself, as it can elucidate and teach us about the impressive consequences of in vitro selection and adaptation, with implications for tissue organization and future tissue engineering in vitro.
Materials and methods
We compiled gene-expression data on cell lines, normal, and tumor samples from three different studies [15–17] that all used Affymetrix Hu6800 arrays. The National Cancer Institute NCI60 cell-line gene-expression data  were downloaded from Cancer Program Data Sets . The tab-delimited text file (NCI60_aug99_resfile.txt) contained scaled expression data together with 'absolute calls' (absent, present and marginal). The 60 cell lines came from the following tissues: lung (n = 9), colon (n = 7), breast (n = 8), ovary (n = 6), leukemia (n = 6), renal (n = 8), melanoma (n = 8), prostate (n = 2), nervous system (n = 6). Gene-expression data for 59 human tissue samples  were downloaded from Human Gene Expression Index  in an already normalized format and represented the following samples: blood (n = 1), brain (n = 11), breast (n = 2), colon (n = 1), cervix (n = 1), endometrium (n = 2), esophagus (n = 1), kidney (n = 6), liver (n = 6), lung (n = 6), muscle (n = 6), myometrium (n = 2), ovary (n = 2), placenta (n = 2), prostate (n = 4), spleen (n = 1), stomach (n = 1), testis (n = 1), vulva (n = 3). Gene-expression profiles of 60 normal and 189 tumor samples from 14 different tissue origins  were downloaded as raw (unscaled) gene-expression data (GCM_Total.res) from Cancer Program Data Sets . Tumor tissue origins were: breast, prostate, lung, colon, lymphoma, melanoma, bladder, uterus, leukemia, kidney, ovary, mesothelioma, and central nervous system. Normal samples were from the following tissues: breast, prostate, lung, colon, germinal center, bladder, uterus, peripheral blood, kidney, pancreas, ovary and central nervous system. All tumors were biopsy specimens from primary sites, obtained before any treatment and enriched for at least 50% malignant cells . For further details see .
An independent validation dataset (dataset II) that contained both in vivo samples (n = 70) and cell lines (n = 25) hybridized to Affymetrix HGU95A arrays  was downloaded from the Gene Expression Atlas . The gene-expression data had previously been scaled using the GeneChip Global Scaling algorithm to a target intensity of 200.
Three datasets were used to assess our ability to classify samples into either cell lines or tissues. Dataset III comprised 10 cell lines and 123 tissue samples . Genes were matched between U133A and HGU95A on the basis of best-match spreadsheets from Affymetrix NetAffx . Dataset IV  comprised 15 cell lines and 64 tumors (mostly lymphomas) . Dataset V comprised 10 cell lines and 81 lung tumors and normal biopsies  and we used UniGene identifiers to map their genes to our Affymetrix array identifiers. Only a limited number of genes (n = 36) of the 576 had a UniGene match. Nevertheless, using only 36 genes most samples were correctly classified as cell lines or tissues. The HUVEC cells of unknown passage from dataset II and FACS-purified cells were excluded from this classification of cell lines and tissues.
To compare the gene-expression data generated in different laboratories we rescaled each sample to equal global chip intensity. The global scaling algorithm was calculated from the positive average difference values excluding the top and bottom 2% average difference values. A reference sample (lung-derived cell line: NSCLC_H460) was chosen on the basis of its average percent present and its average global chip intensity before rescaling. All other samples were rescaled to the equal average chip intensity as the reference sample. We thereafter 'thresholded' the data using a ceiling of 16,000 units and a floor of 20 units.
Singular value decomposition
Singular value decomposition (SVD) is a standard method in linear algebra and the mathematical details of SVD for gene-expression analysis have been described in detail elsewhere [19–21]. In brief, a gene-expression matrix (with rows of genes and columns of arrays) after SVD is decomposed into three matrices USVT. The left singular vectors (hereafter called eigenarrays) are the columns of matrix U, the diagonal in S are the singular values and the rows of VT the right singular vectors. We projected the gene-expression pattern of each sample into a two-dimensional SVD subspace, by measuring the correlation between the gene expression of each sample to the first two eigenarrays. Before SVD calculation we pre-processed the expression data for each gene independently to an average expression level of zero and a standard deviation of one. We used the SVD implementation in Numerical Python (version 23.1) for Python 2.3.3.
Significance analysis of microarrays
We used the significance analysis of microrrays (SAM)  available as an Excel add-in (version 1.21) to identify the number of differentially expressed genes, as a function of the false discovery rate (FDR). We identified statistically significant genes at estimated FDR of zero and 1% (based on 1,000 permutations) and using a fold-change cutoff of 1.5.
Classification of gene-expression profiles
We used the genes identified as differentially expressed in dataset I and II (n = 576) to assess whether we could classify samples in five different datasets into either 'cell lines' or 'tissues'. Dataset I and II correspond to the datasets detailed above (table 1) and were used as initial controls. Before calculation we pre-processed the expression data for each gene independently to an average expression level of zero and a standard deviation of one for each dataset separately. For each dataset, we then calculated the mean gene-expression levels for each gene independently across all cell lines and tissues, respectively. The average cell line expression profile and tissue profile within each dataset were referred to as the 'cell line centroid' and 'tissue centroid'. Then we calculated the Euclidean distance (De) between each sample and the cell line centroid and tissue centroid, respectively. We integrated the two distances into a simple score by calculating the difference between the Euclidean distance to the tissue centroid and cell line centroid. Thus, samples that resemble cell lines more than tissues would have a short Euclidean distance towards the cell line centroid and a longer distance towards the tissue centroid and therefore get a positive score. For all datasets a bimodal distribution of scores was observed (see Additional data file 2 for the distributions of scores for samples in the five datasets). We defined a threshold for each dataset that gave equal amounts of false positives and false negatives. Then all scores above threshold were classified as 'cell line' and all scores below threshold as 'tissue'. The performance of the classification was reported as the accuracy, that is, the sum of the true positives and true negatives divided by the total number of predictions for each dataset.
We used GoMiner  to analyze the lists of up- and downregulated genes for GO categories that were significantly statistically over-represented. We used the second generation GoMiner program that first estimates the p-value using Fisher's exact test and then corrects the p-values for the multiple comparisons by estimating the FDR. We reported only GO categories that had corrected p-values of less than 0.05.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 lists the genes found to be differentially expressed in cell lines versus tissues in both datasets, with corresponding gene names, probe identifiers, SAM d scores and fold-change values. The order of the genes in this table is identical to Figure 4. Additional data file 2 contains a figure with a graph of the distribution of scores for all samples in the five different datasets respectively. Additional data file 3 is a high-resolution image of Figure 4 in which all sample names and gene identifiers can be found. Additional data file 4 lists the dataset-specific GO categories downregulated in only cell lines from dataset I. These categories were mainly of immunological processes and are listed with corresponding statistics and GO identifiers. Additional data file 5 describes the calculations used in the discussion to estimate cell composition effects on gene-expression comparisons.
We thank Alexey Kutsenko, Ola Larsson and Ebba Brakenhielm for comments on earlier versions of the manuscript and members of the Ernberg lab for fruitful discussions. The research was funded by the Swedish Knowledge Foundation, the Swedish Cancer Society and the Swedish Research Council.
- Roschke AV, Tonon G, Gehlhaus KS, McTyre N, Bussey KJ, Lababidi S, Scudiero DA, Weinstein JN, Kirsch IR: Karyotypic complexity of the NCI-60 drug-screening panel. Cancer Res. 2003, 63: 8634-8647.PubMedGoogle Scholar
- Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, et al: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432.PubMedView ArticleGoogle Scholar
- Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003, 302: 2141-2144. 10.1126/science.1090100.PubMedView ArticleGoogle Scholar
- Irish JM, Hovland R, Krutzik PO, Perez OD, Bruserud O, Gjertsen BT, Nolan GP: Single cell profiling of potentiated phospho-protein networks in cancer cells. Cell. 2004, 118: 217-228. 10.1016/j.cell.2004.06.028.PubMedView ArticleGoogle Scholar
- Masters JR: Human cancer cell lines: fact and fantasy. Nat Rev Mol Cell Biol. 2000, 1: 233-236. 10.1038/35043102.PubMedView ArticleGoogle Scholar
- Jacks T, Weinberg RA: Taking the study of cancer cell survival to a new dimension. Cell. 2002, 111: 923-925. 10.1016/S0092-8674(02)01229-1.PubMedView ArticleGoogle Scholar
- Sandberg R, Yasuda R, Pankratz DG, Carter TA, Del Rio JA, Wodicka L, Mayford M, Lockhart DJ, Barlow C: Regional and strain-specific gene expression mapping in the adult mouse brain. Proc Natl Acad Sci USA. 2000, 97: 11038-11043. 10.1073/pnas.97.20.11038.PubMedPubMed CentralView ArticleGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.PubMedPubMed CentralView ArticleGoogle Scholar
- Perou CM, Jeffrey SS, van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JC, et al: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA. 1999, 96: 9212-9217. 10.1073/pnas.96.16.9212.PubMedPubMed CentralView ArticleGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.PubMedView ArticleGoogle Scholar
- Virtanen C, Ishikawa Y, Honjoh D, Kimura M, Shimane M, Miyoshi T, Nomura H, Jones MH: Integrated classification of lung tumors and cell lines by expression profiling. Proc Natl Acad Sci USA. 2002, 99: 12357-12362. 10.1073/pnas.192240599.PubMedPubMed CentralView ArticleGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA. 2004, 101: 9309-9314. 10.1073/pnas.0401994101.PubMedPubMed CentralView ArticleGoogle Scholar
- Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004, 36: 1090-1098.PubMedView ArticleGoogle Scholar
- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, et al: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA. 2001, 98: 10787-10792. 10.1073/pnas.191368598.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA. 2001, 98: 15149-15154. 10.1073/pnas.211566398.PubMedPubMed CentralView ArticleGoogle Scholar
- Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, et al: A compendium of gene expression in normal human tissues. Physiol Genomics. 2001, 7: 97-104.PubMedView ArticleGoogle Scholar
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, et al: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA. 2002, 99: 4465-4470. 10.1073/pnas.012025199.PubMedPubMed CentralView ArticleGoogle Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA. 2000, 97: 10101-10106. 10.1073/pnas.97.18.10101.PubMedPubMed CentralView ArticleGoogle Scholar
- Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV: Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA. 2000, 97: 8409-8414. 10.1073/pnas.150242097.PubMedPubMed CentralView ArticleGoogle Scholar
- Wall ME, Rechtsteiner A, Rocha LM: Chapter 5: Singular value decomposition and principal component analysis. A Practical Approach to Microarray Data Analysis. Edited by: Berrar DP, Dubitzky W, Granzow M. 2003, Norwell, MA: KluwerGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003, 33: 49-54. 10.1038/ng1060.PubMedView ArticleGoogle Scholar
- Klein U, Tu Y, Stolovitzky GA, Mattioli M, Cattoretti G, Husson H, Freedman A, Inghirami G, Cro L, Baldini L, et al: Gene expression profiling of B cell chronic lymphocytic leukemia reveals a homogeneous phenotype related to memory B cells. J Exp Med. 2001, 194: 1625-1638. 10.1084/jem.194.11.1625.PubMedPubMed CentralView ArticleGoogle Scholar
- Zeeberg BR, Qin H, Narasimhan S, Sunshine M, Cao H, Kane DW, Reimers M, Stephens R, Bryant D, Burt SK, et al: High-Throughput GoMiner, an 'industrial-strength' integrative Gene Ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID). BMC Bioinformatics. 2005, 6: 168-10.1186/1471-2105-6-168.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang S: Beyond the Petri dish. Nat Biotechnol. 2004, 22: 151-152. 10.1038/nbt0204-151.PubMedView ArticleGoogle Scholar
- Cukierman E, Pankov R, Stevens DR, Yamada KM: Taking cell-matrix adhesions to the third dimension. Science. 2001, 294: 1708-1712. 10.1126/science.1064829.PubMedView ArticleGoogle Scholar
- Hanahan D, Weinberg RA: The hallmarks of cancer. Cell. 2000, 100: 57-70. 10.1016/S0092-8674(00)81683-9.PubMedView ArticleGoogle Scholar
- Dairkee SH, Ji Y, Ben Y, Moore DH, Meng Z, Jeffrey SS: A molecular 'signature' of primary breast cancer cultures; patterns resembling tumor tissue. BMC Genomics. 2004, 5: 47-10.1186/1471-2164-5-47.PubMedPubMed CentralView ArticleGoogle Scholar
- Sandberg R, Ernberg I: Assessment of tumor characteristic gene expression in cell lines using a tissue similarity index (TSI). Proc Natl Acad Sci USA. 2005, 102: 2052-2057. 10.1073/pnas.0408105102.PubMedPubMed CentralView ArticleGoogle Scholar
- Stein WD, Litman T, Fojo T, Bates SE: A serial analysis of gene expression (SAGE) database analysis of chemosensitivity: comparing solid tumors with cell lines and comparing solid tumors from different tissue origins. Cancer Res. 2004, 64: 2805-2816.PubMedView ArticleGoogle Scholar
- Creighton C, Kuick R, Misek DE, Rickman DS, Brichory FM, Rouillard JM, Omenn GS, Hanash S: Profiling of pathway-specific changes in gene expression following growth of human cancer cell lines transplanted into mice. Genome Biol. 2003, 4: R46-10.1186/gb-2003-4-7-r46.PubMedPubMed CentralView ArticleGoogle Scholar
- Mueller MM, Fusenig NE: Friends or foes - bipolar effects of the tumour stroma in cancer. Nat Rev Cancer. 2004, 4: 839-849. 10.1038/nrc1477.PubMedView ArticleGoogle Scholar
- Pardal R, Clarke MF, Morrison SJ: Applying the principles of stem-cell biology to cancer. Nat Rev Cancer. 2003, 3: 895-902. 10.1038/nrc1232.PubMedView ArticleGoogle Scholar
- Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta LA: Laser capture microdissection. Science. 1996, 274: 998-1001. 10.1126/science.274.5289.998.PubMedView ArticleGoogle Scholar
- Boess F, Kamber M, Romer S, Gasser R, Muller D, Albertini S, Suter L: Gene expression in two hepatic cell lines, cultured primary hepatocytes, and liver slices compared to the in vivo liver gene expression in rats: possible implications for toxicogenomics use of in vitro systems. Toxicol Sci. 2003, 73: 386-402. 10.1093/toxsci/kfg064.PubMedView ArticleGoogle Scholar
- Chu TM, Deng S, Wolfinger R, Paules RS, Hamadeh HK: Cross-site comparison of gene expression data reveals high similarity. Environ Health Perspect. 2004, 112: 449-55.PubMedPubMed CentralView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.PubMedPubMed CentralView ArticleGoogle Scholar
- Cancer Program Data Sets. [http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi]
- Human Gene Expression Index. [http://www.hugeindex.org]
- Gene Expression Atlas. [http://expression.gnf.org]
- Affymetrix NetAffx. [http://www.affymetrix.com/analysis/index.affx]
- Sturn A, Quackenbush J, Trajanoski Z: Genesis: cluster analysis of microarray data. Bioinformatics. 2002, 18: 207-208. 10.1093/bioinformatics/18.1.207.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.