GeneCount: genome-wide calculation of absolute tumor DNA copy numbers from array comparative genomic hybridization data
© Lyng et al.; licensee BioMed Central Ltd. 2008
Received: 18 January 2008
Accepted: 23 May 2008
Published: 23 May 2008
Absolute tumor DNA copy numbers can currently be achieved only on a single gene basis by using fluorescence in situ hybridization (FISH). We present GeneCount, a method for genome-wide calculation of absolute copy numbers from clinical array comparative genomic hybridization data. The tumor cell fraction is reliably estimated in the model. Data consistent with FISH results are achieved. We demonstrate significant improvements over existing methods for exploring gene dosages and intratumor copy number heterogeneity in cancers.
Array comparative genomic hybridization (aCGH) is widely used for genome-wide mapping of DNA copy number changes in malignant cells [1, 2]. Genetic gains and losses impact gene expression levels, and thereby promote tumor growth and progression [3–5]. Numerous clinical studies have been performed to find tumor characteristics and to classify patients with respect to their prognosis based on the copy number changes [6, 7]. The usefulness of the aCGH data is limited, however, because only relative and not absolute copy numbers are achieved, making the interpretation of the data and comparisons across experiments difficult. Absolute DNA copy numbers can be obtained only on a single gene basis by the use of fluorescence in situ hybridization (FISH). Development of genome-wide methods for this purpose would enable generation of universal gene copy number databases of individual diseases that could be utilized more widely, as is the goal of several public repositories like the Mitelman Database of Chromosome Aberrations in Cancer .
The relative values achieved in aCGH experiments are influenced by the total DNA content (ploidy) of the tumor cells, the proportion of normal cells in the sample, and the experimental bias, in addition to the DNA copy numbers. The values are presented as intensity ratios between tumor and normal DNA . The data are normalized so that the ratio of 1.0 is the baseline for the analysis, and corresponds to two DNA copies in near diploid (2n) tumors. The copy number changes are identified from the ratios deviating from the baseline, using statistical methods for ratio smoothing and breakpoint detection [9–12]. To assign an absolute copy number to each ratio level identified by the statistical analysis and thereby score genetic aberrations are, however, challenging. In aneuploid tumors with gross alterations in the DNA content, the baseline represents a copy number other than 2, like 3 or 4 in tri- or tetraploid tumors, or a non-integer value when the DNA content differs from n, 2n, 3n, ... mn . The presence of normal cells within the sample and experimental bias reduce the ratio dynamics. Moreover, in many tumors, several subpopulations of malignant cells with different genetic characteristics exist, leading to intratumor heterogeneity in the DNA copy numbers [14–16] and increased complexity in the data. Unreliable results occur, therefore, when common ratio levels are used to score gains and losses in tumors with different ploidy and normal cell content.
The confounding effect caused by normal cells within tumor samples is recognized as a problem in aCGH analyses and has been handled by excluding low purity samples [17, 18] or correcting the ratio levels based on histological examination of tumor sections . The latter approach is not satisfactory because only the proportion of connective tissue surrounding the tumor parenchyma, and not the infiltrating immune cells, is precisely quantified. Moreover, the measurements cannot be performed on exactly the same tissue as used in the aCGH experiment and may, therefore, not be representative. A model including the CGH ratios, ploidy, and experimental bias has been proposed for estimation of absolute DNA copy numbers in tumor cell lines . To our knowledge, no method exists that also considers the normal cell content and, thus, is suited for analyses of clinical tumor samples.
We here present a new model, GeneCount, where the proportion of normal cells is estimated and corrected for and possible intratumor heterogeneity in DNA copy numbers is considered. Inputs to our model are the DNA index (DI, where DI = 1/2·tumor ploidy), tumor cell fraction, experimental bias, and aCGH ratios. Predetermined measures of tumor ploidy, determined either by flow or image based cytometry, are needed. The tumor cell fraction can be determined by, for example, flow cytometry on the same part of the sample as used in the aCGH experiment. In cases of unknown normal cell content, the tumor cell fraction is estimated in the model. The experimental bias is determined from the X-chromosome ratio in aCGH experiments where male and female DNA is compared. Smoothed ratio levels from any existing statistical analysis tools for breakpoint detection can be used.
We show that the model enabled automatic and genome-wide calculation of DNA copy numbers from aCGH data of both hematopoietic and solid tumors. The feasibility of GeneCount was demonstrated by analysis of 94 lymphomas, for which the DNA index and tumor cell fraction had been determined by use of flow cytometry and an extensive exploration of DNA copy numbers had been performed by the use of FISH in previous studies [20–25]. The GeneCount results, both based on the pre-determined tumor cell fraction and that determined by the model, were compared with the FISH data of 362 genes with and without gains and losses, showing 97% consistency in both cases. In particular, we explored the copy numbers achieved in the t(14;18) translocated chromosomal region involving BCL2. We further demonstrated the potential of GeneCount in analysis of solid tumors without pre-determined tumor cell fractions by relating the copy number of selected genes in 93 cervical cancers to gene expression and treatment outcome. By use of GeneCount we obtained a higher sensitivity in detecting cervix tumors with copy number changes than was obtained in analysis based directly on the ratio levels. Finally, we identified intratumor heterogeneity of DNA copy numbers in the lymphomas and cervical cancers, and showed how this information could be used to draw conclusions about the evolution of the genetic aberrations in the tumors. GeneCount was implemented in a software package to be used downstream of statistical methods for breakpoint detection, and results based on both the GLAD and CGH-Explorer packages are presented [9, 11]. We supply our method through the open-source and free web-based database BioArray Software Environment (BASE) .
Basis of GeneCount
A further reduction in the ratio dynamics occurs due to experimental bias (Equation 4). The bias, as represented by the dynamic factor, q, can be determined from control experiments, where normal DNA from males and females is cohybridized (Figure 1e). Theoretically, the X-chromosome ratio is 0.5 (-1 on a log2 scale), but the experimental bias reduces the ratio dynamics, leading to a ratio level closer to zero. The absolute value of the log2-transformed ratio level was used as a measure of q (Figure 1e). This value differed little among the slide series used here, ranging from 0.75-0.85 with a mean ± standard deviation of 0.80 ± 0.04 based on 8 control experiments. A q-value of 0.8 and range of 0.7-0.9 was used in the GeneCount calculations in the cases of known and unknown tumor cell fraction, respectively.
To enable automatic calculation of the copy number associated with each array probe, we implemented GeneCount in a program to be run on top of statistical analysis packages for aCGH ratio smoothing and breakpoint detection (Additional data file 1). A separate algorithm was developed for samples with unknown tumor cell fraction, where the fraction was estimated based on two ratio levels and DI (panel B in Additional data file 1), as described in Materials and methods. One decimal was included in the calculated DNA copy numbers when evaluating the results in comparison with FISH data. Otherwise, the numbers were rounded off to the nearest integer values.
GeneCount copy numbers in comparison with FISH data
We compared the GeneCount results of 94 lymphomas with previously published FISH data from the same tumors [20–25]. The FISH probes were located at chromosomal regions with frequent copy number changes (Figure 1 and Additional data file 2), and copy numbers within the range of 0-8 had been measured. The DIs, ranging from 0.95-2.23, and the tumor cell fractions, ranging from 27% to 98%, were used as inputs to GeneCount, together with the smoothed aCGH ratios from the GLAD and CGH-Explorer packages. CGH-Explorer applied a more extensive ratio smoothing than GLAD, and this led occasionally to differences in the ratio levels and breakpoint detection between the two programs.
GeneCount with known tumor cell fraction
GeneCount with unknown tumor cell fraction
DNA copy numbers in translocated chromosomal regions
GeneCount analysis of solid tumors
The feasibility of our method for analysis of solid tumors without information of tumor cell fraction was explored in 99 cervical cancers, for which the DI ranged from 1.00-3.16. The tumor cell fraction could be estimated for 93 and 89 tumors based on GLAD and CGH-Explorer, respectively, fulfilling the requirements for this estimation (Materials and methods). The tumor cell fractions were poorly correlated with the values determined by analysis of histological sections (Additional data file 5). In most cases, the histology result was higher than the GeneCount estimate, probably because immune cells infiltrating the tumor parenchyma were not properly quantified by the histological examination. In a few cases, however, the histology result was higher, probably reflecting that different parts of the sample were used in the aCGH and histology analyses. The tumors for which the tumor cell fraction could be estimated by GeneCount were included in the further analyses.
A higher number of genetic aberrations were generally found in the cervical cancers than in the lymphomas. High level amplifications with more than 2.5-fold increases in gene dosage (that is, copy number, N, relative to total DNA content given by two times the DNA index (N/(2.DI)), were found in about half of the tumors and most frequently on chromosomes 5p and 11q. GeneCount analysis showed copy numbers within the range of 5-80 in these regions, which were often surrounded by gains at lower levels.
The copy number changes of MRPS23 have previously been shown to correlate with survival probability . Survival analysis based on the GeneCount data of MRPS23 identified more patients with poor outcome than the corresponding analysis based on ratio levels (Figure 6d,e). Hence, 15 high risk patients were identified based on the GeneCount results, whereas only 5 patients were classified with high risk based on the ratio levels. Nine of the ten patients that were not identified based on ratio levels (blue curve in Figure 6d) had aneuploid tumors with a DNA index ranging from 1.10-1.92. The remaining diploid tumor had a relatively low tumor cell fraction of 23%.
Intratumor heterogeneity in DNA copy numbers
Nineteen (20%) lymphomas and 44 (50%) cervical cancers had one or more heterogeneous DNA regions with copy numbers 1&2, 2&3, or 3&4 (Additional data files 8 and 9). Reliable detection of heterogeneity required tumor cell fractions above 24% (Additional data file 10) and 5 out of 93 cervical cancers were therefore excluded from this analysis. Lymphoma L309/89 (Figure 7) had previously been identified as heterogeneous by FISH, showing one population with three and another with four copies of MYC and centromeres 8 and 17 . Moreover, several of the heterogeneous aberrations in the cervical cancers, such as loss on chromosome 4 and X and gain on 11q and 17 in C005/01, loss on 6q and gain of 11q in C006/01, and loss on 4 in C023/01, were similar to those detected earlier by conventional CGH . The previous study was, however, based on a different set of biopsies, which probably explains the lack of consistency for some of the tumors.
We have shown that GeneCount is a reliable method for genome-wide calculation of DNA copy numbers in clinical tumor samples. Such data are biologically interesting in themselves but may also lead to improved prediction of treatment outcome and aid in the identification of novel tumor suppressors and oncogenes. We applied the method to lymphomas, for which accurate measures of tumor cell fraction and DNA copy numbers have been obtained by other techniques that could be compared with the GeneCount results. We further used the method on cervical cancers, for which tumor cell fractions representative of the aCGH data are more difficult to achieve by a separate technique. The GeneCount model is simple, due to the use of normal cells with two DNA copies throughout the genome as a reference sample. Moreover, the estimated copy numbers are restricted to positive integers, increasing the robustness of the method. A requirement for achieving the absolute quantification format is the use of pre-determined tumor ploidies, whereas the tumor cell fractions, if not known, and experimental bias can be estimated from the aCGH data.
The experimental bias is mainly caused by signals from unsuppressed repetitive sequences and nonspecific hybridization . The bias influences the test and reference sample equally and independently of the DNA copy number, since it is generated by sequences distributed throughout the genome. The bias could, therefore, be summed up in an array specific factor, q, representing the dynamics of the log-transformed ratios. Mohapatra et al.  included the bias as a constant factor affecting the absolute, rather than the log-transformed, ratios in their model for pure tumor cells. Our approach seems justified because the noise (width) of the log-transformed ratios was independent of the ratios and, therefore, of the DNA copy numbers (Additional data file 6). We allowed for a small variation in q when calculating the tumor cell fraction to account for minor differences in the bias across the tumors. This q-value, optimized for each tumor, was highly similar to the mean q determined from control experiments, indicating that the bias was stable across experiments. Moreover, the discrepancies between the GeneCount and FISH results were related to the specific genetic aberration involved and, therefore, to the breakpoint detection algorithm, rather than to possible uncertainties in q. Recent developments in array CGH technology, utilizing oligonucleotides rather than bacterial artificial chromosome (BAC) clones, led to improved ratio dynamics and reduction in the experimental bias due to less repetitive sequences . Ongoing work in our laboratory shows that by using oligoarrays, GeneCount can be applied with a q value close to 1.
Inclusion of the tumor cell fraction is a prerequisite for the calculation of absolute DNA copy numbers in clinical tumor samples. The lymphoma data were based on single cell suspensions made from the entire lymph nodes. A tumor cell fraction representative of the lymph node could, therefore, be determined with high accuracy by a separate technique like flow cytometry. In solid tumors such as cervical cancers, the normal cells consist of stroma, which is highly heterogeneously distributed within the tissue, and immune cells, which infiltrate the tumor parenchyma. A measure of the tumor cell fraction achieved by, for example, histological examination, which is based on a part of the sample different from that used for the aCGH experiment and/or fails to quantify the proportion of immune cells accurately, is, therefore, not precise enough for the calculation of DNA copy numbers. Histology data may, however, be useful for preselecting tumor enriched samples for the aCGH analysis. Fairly stable estimates of the tumor cell fraction, consistent with the values measured by flow cytometry, were achieved by the use of GeneCount. The estimates led to DNA copy numbers in agreement with the FISH data, suggesting that the accuracy of the tumor cell fractions was sufficient for reliable data analysis. Selection of appropriate ratio levels for the estimation was crucial for achieving this accuracy. We required that the tumors had at least two aberrations with different copy numbers and with more than ten array probes each to reduce errors caused by poorly defined ratio levels and breakpoints. Moreover, only ratio levels deviating more than 0.15 (log2 scale) from the baseline were selected, implying that tumor fractions higher than 24% (diploid) and 36% (tetraploid) were needed when copy numbers were changed to 3 or 5 copies, respectively (Additional data file 12).
The few discrepancies between the GeneCount and FISH data were not related to our model, but rather to the ability of the statistical methods to detect some of the aberrations. Hence, the consistency between the GeneCount and FISH results was similar to the reliability of GLAD in detecting breakpoints in simulated data . The highest accuracy of the GeneCount results was obtained for well defined aberrant regions containing at least three array probes. In these cases a ratio level representative of the corresponding copy number was achieved and the probability of detecting the aberration was high. The increased uncertainty in the results of narrow aberrations implies that they should be confirmed by a separate technique like FISH. Moreover, to ensure sufficient ratio dynamics and, therefore, a high probability of breakpoint detection, a tumor cell fraction higher than a certain value, which depends on the experimental noise and tumor ploidy, is needed. With the noise of our experiments (Additional data file 6), a tumor cell fraction above 23% in diploid, and somewhat higher in hyperdiploid cases, enabled separation of an aberration with more than three array probes (Additional data file 13). This fraction also enabled detection of heterogeneous DNA copy numbers involving more than ten array probes (Additional data file 10). In experiments with more noise, caused by, for example, poor DNA quality, higher tumor cell fractions are required. In comparison, at least 50% tumor cells is suggested for optimal detection of gains and losses by conventional CGH .
The DNA copy number of genes involved in translocations cannot be directly assessed by FISH when a probe covering the breakage region is used, because signals from both the original chromosomes are detected in the translocated derivatives. Correction of the probe signal to achieve the true copy number requires knowledge of the breakpoint and genes involved in the translocation. Reliable FISH analysis in solid tumors, where the translocations are not well identified and may occur throughout the genome  is, therefore, particularly challenging. By aCGH, the probe signal is measured independently of the actual genome organization of the DNA covered by the probe. Hence, in the case of balanced translocations, a correct result will be obtained even if the probe covers the breakpoint. If the probe is located at the start or end of an amplified or deleted region (unbalanced translocation), the aCGH ratios of the adjacent probes ensure that the correct copy number is calculated. Our model therefore provides a novel method for assessment of copy numbers both in balanced and unbalanced translocated regions and without knowing that the translocation exists.
Current methods for analysis of aCGH data generally score genetic gains and losses based on ratio levels [31–36]. The breakpoints in individual tumors can be detected with high accuracy by use of statistical algorithms like GLAD and CGH-Explorer. However, the existing downstream analyses, using common ratio levels for scoring aberrations across tumors, fail to identify gains and losses in cases of high ploidy and normal cell content. By the use of GeneCount, the ratio levels are replaced with the absolute copy numbers relative to the total DNA content as measures of gene dosage, which can be compared across tumors regardless of ploidy and normal cell content. Hence, copy number changes that were not detected by analysis based on ratio levels, but showed significant correlation with gene expression, were found in cervical cancers, suggesting that improved results were achieved. Moreover, many patients with poor outcome that had MRPS23 gain by GeneCount had no gain based on ratio levels. In the latter case, the gain was masked by high content of normal cells or high ploidy, showing that GeneCount is more sensitive in detecting patients with genetic aberrations. The finding further demonstrates that GeneCount applies well to solid tumors for which the tumor cell fraction is generally unknown and must be estimated by the method. Advances in current statistical analysis methods may utilize adjustable ratio levels for scoring gains and losses, optimizing the cut-off ratios for each tumor based on a mathematical evaluation of the ratio dynamics. Such methods may account for varying ploidy and normal cell content across diploid, triploid, and tetraploid tumors. However, the strategy is not useful for tumors with an intermediate ploidy like 1.25 (Figure 1c). In contrast, the absolute DNA copy number relative to the total DNA content, or gene dosage, is comparable also across such tumors.
We also showed that GeneCount can provide genome-wide and high resolution information of intratumor heterogeneity in the DNA copy numbers. Such heterogeneity has previously been detected only on a single gene basis by FISH or at low resolution by conventional CGH analyses [14, 15, 20, 37, 38], probably reflecting a high genomic instability . Detection of heterogeneity involving two DNA copies by the use of FISH is challenging, since the heterogeneous tumor population is difficult to distinguish from normal cells. The probability to detect heterogeneity with GeneCount depends on the fraction of tumor cells with the heterogeneous aberration. Obviously, the probability is largest at a fraction of 50%, but fractions higher than 70% and lower than 30% were also identified. Heterogeneity in low copy numbers, like 1&2 and 2&3, are more easily detected, since the separation between the log-transformed ratio levels are larger. At higher copy numbers, the possibility to detect heterogeneity decreases, depending on the ploidy and normal tissue content. However, we also identified heterogeneous regions with copy number 3&4 in several tumors and 4&5 in one tumor. Finally, the probability to detect heterogeneity also depends on the proportion of the genome that is affected. In our data severe heterogeneity affecting up to 40% of the genome could be analyzed with GeneCount (C002/01; Additional data file 9). With an increasingly larger part of the genome affected, difficulties in finding breakpoints and even homogeneous aberrations eventually occur, leading to unreliable results regardless of analysis method.
The heterogeneity data led to insight into the evolutionary sequence of the copy number changes. The homogeneous aberrations had probably occurred prior to the heterogeneous ones . Moreover, in cases where the heterogeneous aberrations appeared to be present in different fractions of the tumor cell population, these aberrations could be ordered chronologically in a serial and/or parallel sequence. It was not always possible to identify the correct sequence among the proposed ones, as could be done by comparing data for several biopsies from the same tumor . However, identification of the heterogeneous as well as the homogeneous aberrations suggests a further possible investigation of the exact combination of aberrations in each subpopulation, employing, for example, triple-color FISH with one probe for a homogeneous aberrant region and two for the heterogeneous ones.
In the heterogeneity analysis we assumed that the ploidy was the same for all subpopulations of malignant cells. This assumption was justified because no cases were observed with two aneuploid populations by flow cytometry. A possible difference in the ploidy of two aneuploid populations within a tumor was therefore probably smaller than 10%, leading to less than 10% uncertainty in the copy numbers calculated by GeneCount (Equation 4; data not shown). The same uncertainty also applied to near diploid and heterogeneous cervical cancers. These tumors often showed a broad G1 peak with a CV up to 10% by flow cytometry, probably reflecting the existence of several subpopulations with ploidy within the range of 1.0-1.1. Moreover, few or no light chain positive cells were observed in the diploid population of the aneuploid lymphomas, suggesting that the diploid population contained primarily normal cells. It is possible, however, that the diploid population of the aneuploid cervical cancers contained malignant cells, as we have previously shown for aneuploid colorectal cancers . This might have led to larger uncertainties in the heterogeneous copy numbers due to the use of an erroneous DNA index of the diploid population. The data of such tumors can be improved by sorting the diploid and aneuploid fractions by flow cytometry  for separate aCGH and GeneCount analyses.
GeneCount provides reliable DNA copy numbers, both when based on the tumor cell fractions determined by flow cytometry and those estimated by the method. Accurate data are also achieved in translocated chromosomal regions, as demonstrated for the t(14;18) translocation involving BCL2. Our method is the only one to provide genome-wide information of absolute DNA copy numbers. Moreover, the method represents a significant improvement compared to existing methods in the study of gene dosages and intratumor copy number heterogeneities. The robustness of GeneCount implies that the method can be utilized widely in the genomic exploration of both hematopoietic and solid tumors, addressing DNA copy number aspects in a reliable manner, regardless of possible translocations. This may lead to improved assays for disease classification and outcome prediction and aid the identification of efficient targets for new cancer therapies.
Materials and methods
Tumor samples, DNA index, and tumor cell fraction
Samples from 94 patients with B-cell non-Hodgkin's lymphoma and 99 patients with squamous cell carcinoma of the uterine cervix were analyzed. We used fresh frozen lymphoma cell suspensions for which the tumor subtype, stage, patient treatment, and follow-up have been presented previously . The cervical cancers were of FIGO (Fédération Internationale des Gynaecologistes et Obstetristes) stage 1b-4b, treated with radiotherapy. Tumor biopsies taken before the start of treatment were used.
The DI of the lymphomas and cervical cancers and the tumor cell fraction of the lymphomas were determined by use of flow cytometry, and most of these data have been published earlier [14, 23, 25]. The lymphoma cells were labeled with phycoerythrin-labeled antibodies to the tumor characteristic light chains for identifying the tumor cells and Hoechst 33258 for assessment of DNA content. The DI was determined from the G1 peak position of the light chain positive cells relative to the light chain negative cells. Tumor cell fraction was determined as the fraction of light chain positive cells. The DI of the cervical cancers was assessed by preparing clean nuclei, stained with propidium iodide, using the detergent-trypsin method . Cells from a diploid cell line were used as an internal reference. Samples showing two distinct G1 peaks in the DNA histogram were classified as aneuploid, and the DI was determined from the position of the G1 peak of the aneuploid cells relative to the corresponding peak of the diploid cells. Samples with a single G1 peak were classified as near diploid. An estimate of the tumor cell fraction was achieved for each cervical cancer sample by histological examination of hematoxylin and eosin stained sections derived from the middle part of the biopsies. These values were used to compare with the tumor cell fractions estimated by GeneCount.
Genomic array slides produced by the Microarray Facility at the Norwegian Radium Hospital were used . The arrays contained 4,549 unique genomic clones of BACs and P1 artificial chromosomes (PACs) (Wellcome Trust Sanger Institute, Cambridge, UK) that covered the whole genome with a resolution of approximately 1 Mb. The 1 Mb clone collection was supplemented with tiling path probes between 1q12 and 1q25, using overlapping BACs and PACs. The clones were from the RPCI-11 (BAC) and the RPC1-1, -3, -4, and -5 (PAC) libraries. Each clone was printed in 4-8 array spots. The genes covered by the clones were found from Ensembl .
Genomic DNA was isolated from the lymphoma cell suspensions and cervical cancer biopsies according to a standard protocol, including proteinase K, phenol, chloroform, and isoamylalcohol . DNA (1 μg) was digested overnight, using DpnII endonuclease (New England Biolabs, Beverly, MA, USA), and purified using the QIAquick PCR Purification Kit (Qiagen, Valencia, CA, USA). Digested and purified DNA and normal reference DNA (0.5 μg each) were labeled by a random primer reaction (BioPrime DNA Labeling System, Invitrogen, Carlsbad, CA, USA) with Cy3-dCTP and Cy5-dCTP (Perkin-Elmer Life Sciences, Foster City, CA, USA), respectively, and co-hybridized to the array slides . Scanning and image analysis were performed by use of an Agilent scanner (Agilent Technologies Inc., Palo Alto, CA, USA) and the GenePix 6.0 image analysis software (Axon Instruments Inc., Union City, CA, USA). The microarray management and preprocessing software BASE  was used for spot filtering and ratio normalization. The mean value of the 4-8 spots of each genomic clone was used, provided that the standard deviation was less than 0.2. Lowess normalization was performed so that the mean log-transformed ratio of all clones was equal to 0. The GLAD and CGH-Explorer algorithms were used for ratio smoothing and breakpoint detection [9, 11]. Default values of 8 (GLAD) and 1.5 (CGH-Explorer) for the statistical penalty, λ, were used. The smoothed ratios were inputs to GeneCount.
Principle of GeneCount
where Rideal is the aCGH ratio of a sample with n cell populations, and N i , DI i , and F i are the DNA copy number, DNA index, and tissue fraction of cell population i, respectively. We assume that: the reference sample is normal DNA with a copy number of 2 throughout the genome, except for the X and Y chromosomes in males; sex-matched hybridizations are performed; and DI is given relative to the DNA content of normal cells.
where N T , DI T , and F T are the DNA copy number, DNA index, and fraction of malignant cells in the sample, respectively. 1 - F T represents the fraction of normal cells, which have a DI of 1 and DNA copy number (N) of 2.
The dynamic factor represents the systematic, non-random reduction in the log-transformed ratios caused by the experimental bias and has a value between 0 and 1, where the latter value occurs in the ideal situation without any reduction in the ratio dynamics. The factor is a characteristic of the array slide series and the laboratory protocol and was determined from the ratio of the X chromosome in a control experiment hybridizing male versus female normal DNA (Figure 1e). Equation 4 was used in GeneCount to calculate F T and N T from the ratio profile of the sample.
Intratumor heterogeneity in the DNA copy numbers, that is, the cases of several populations of malignant cells in addition to the normal cells, was identified by selecting the tumors for which one or more of the aCGH ratio levels were different from that corresponding to an integer value by visual inspection. The ratio distributions of the potential heterogeneous regions were compared to the distributions of the adjacent homogeneous aberrations by ANOVA analysis, and a P-value of 0.05 was required to classify the aberration as heterogeneous. The fraction of tumor cells with a heterogeneous aberration was calculated, employing the more general Equation 1. The DI was assumed to be the same for all subpopulations of malignant cells.
Implementation of GeneCount in BASE
We used BASE as a platform for GeneCount and linked the algorithm to the output of the GLAD and CGH-Explorer packages, which were implemented in our BASE version. The method can also be developed as a separate program or integrated in other aCGH analysis packages. The algorithm consists of three major steps: data input for all samples; estimation of tumor cell fraction in the cases when this parameter is unknown; and estimation of DNA copy number for each array probe (panel A in Additional data file 1). The smoothed aCGH ratios served as input, together with the DI, the q-value from control experiments with its lower and upper limits (qmin, qmax) and, if available, the tumor cell fraction.
In cases of unknown tumor cell fraction, this value was estimated in a simulation procedure based on two selected ratio levels, using the tumor cell fraction and DNA copy numbers as independent and q as dependent variables. The copy numbers and tumor cell fraction were increased in steps of 1 and 0.01, respectively, and the corresponding q-value was calculated (panel B in Additional data file 1). To ensure high accuracy in the estimated fractions, it was required that the absolute value of the selected ratio levels was larger than 0.15. This implied that samples with a tumor cell fraction lower than 24% in diploid and 36% in tetraploid tumors could not be analyzed when only aberrations involving one copy number change existed (Additional data file 12). Moreover, a minimum absolute difference of 0.2 - that is, approximately two times the standard deviation of the log-transformed ratio levels (Additional data file 6) - between the two selected ratio levels was needed. To further increase the reliability of the estimation, only ratio levels with more than ten probes were selected. We optimized q for each tumor by allowing the value to vary within the limited range of qmin to qmax, typically q ± 10%, leading to fairly stable estimates of the tumor cell fraction. The mean tumor cell fraction based on these estimates and the corresponding mean q-value was used in Equation 4 to estimate the DNA copy numbers of the tumor. In cases of known tumor cell fraction, this fraction and q from control experiments were used in Equation 4. The source code of the module is provided by communication to the authors. A demo version of GeneCount in BASE is also available .
Fluorescence in situhybridization
GeneCount estimates for the lymphomas were compared with direct assessments of gene copy numbers by use of FISH. All FISH analyses have been published previously [20–25]. Dual-color FISH was applied to all 94 tumors. We used spectrum orange labeled locus-specific propidium iodide DNA probes for genes commonly aberrant in lymphomas (CCND3, BMP6, PIM1, MYC, CDKN2A, RB1, TP53, PMAIP1, and MALT1) and spectrum green labeled centromer probes (centromere 1, 6, 8, 17, and 18) (Vysis Inc., Downers Grove, IL, USA) for assessing the quality of the experiment. For exploring DNA copy number calculations in translocated chromosomal regions, BCL2, which is frequently involved in the translocation t(14;18)(q32;q21) in lymphomas, was considered. A dual-color translocation probe involving BCL2 and covering the breakpoint region was used (LSI IGH Spectrum Green/LSI BCL2 Spectrum Orange, Vysis Inc.). Due to splitting of the probe signal in cases of translocation, erroneous high BCL2 copy numbers were derived directly with this probe. The BCL2 copy number was therefore corrected based on the signals from the IGH and centromere 18 probes, as described .
Gene expression microarrays
Gene expressions were determined by microarray analysis of 89 of the cervical cancers and related to the GeneCount estimates. We used array slides produced at the Microarray Facility at the Norwegian Radium Hospital, containing 15,000 cDNA clones. The data from 48 of the patients, with a detailed description of the experimental procedures, have been presented . Cy3- and Cy5-labeled cDNA was synthesized from total RNA by anchored oligo(dT)-primed reverse transcription and co-hybridized with a reference sample (Universal Human Reference RNA, Stratagene, La Jolla, CA, USA) to the array slides overnight at 65°C. Scanning and image analysis were performed with an Agilent scanner and the GenePix 4.1 image analysis software, respectively. Data preprocessing, including correction of saturated intensities, filtering of weak and bad spots, and lowess normalization, was performed in BASE. All hybridizations were performed twice in a dye-swap design, and the average expression ratio based on the two experiments was used in the further analyses.
The array CGH raw data have been deposited to the ArrayExpress repository (E-TABM-398, E-TABM-399).
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure showing the calculation steps in GeneCount. Additional data file 2 is a figure showing an example of FISH probe locations. Additional data file 3 is a figure comparing FISH DNA copy numbers and smoothed aCGH ratio levels in non-Hodgkin's lymphomas. Additional data file 4 is a figure illustrating discrepancies between GeneCount and FISH DNA copy numbers. Additional data file 5 is a figure comparing tumor cell fractions derived by histological examination and by GeneCount estimation in cervical cancers. Additional data file 6 is a figure showing the standard deviation (noise) of the log-transformed aCGH ratios. Additional data file 7 is a figure comparing results from ratio level and GeneCount analyses in cervical cancers. Additional data file 8 is a table listing regions with DNA copy number heterogeneity in non-Hodgkin's lymphomas. Additional data file 9 is a table listing regions with DNA copy number heterogeneity in cervical cancers. Additional data file 10 is a figure showing tumor cell fraction required for detection of heterogeneous copy number changes. Additional data file 11 is a figure illustrating analysis of the evolutionary sequence of subpopulations in heterogeneous tumors. Additional data file 12 is a figure showing the minimum tumor cell fraction that can be calculated in GeneCount. Additional data file 13 is a figure showing the tumor cell fraction required for detection of homogeneous copy number changes.
array comparative genomic hybridization
bacterial artificial chromosome
Bioarray Software Environment
fluorescence in situ hybridization
P1 artificial chromosome.
We would like to acknowledge Vegard Nygaard from The Norwegian Microarray Consortium for help with implementing the BASE plug-in module. The study was supported by The Norwegian Cancer Society and the Microarray Platform of The National Programme for Research in Functional Genomics (FUGE) in the Research Council of Norway.
- Mantripragada KK, Buckley PG, de Stahl TD, Dumanski JP: Genomic microarrays in the spotlight. Trends Genet. 2004, 20: 87-94. 10.1016/j.tig.2003.12.008.PubMedView ArticleGoogle Scholar
- Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nat Genet. 2005, 37 (Suppl): S11-S17. 10.1038/ng1569.PubMedView ArticleGoogle Scholar
- Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet. 2003, 34: 369-376. 10.1038/ng1215.PubMedView ArticleGoogle Scholar
- Albertson DG: Gene amplification in cancer. Trends Genet. 2006, 22: 447-455. 10.1016/j.tig.2006.06.007.PubMedView ArticleGoogle Scholar
- Knuutila S, Aalto Y, Autio K, Bjorkqvist AM, El-Rifai W, Hemmer S, Huhta T, Kettunen E, Kiuru-Kuhlefelt S, Larramendy ML, Lushnikova T, Monni O, Pere H, Tapper J, Tarkkanen M, Varis A, Wasenius VM, Wolf M, Zhu Y: DNA copy number losses in human neoplasms. Am J Pathol. 1999, 155: 683-694.PubMedPubMed CentralView ArticleGoogle Scholar
- Chin SF, Teschendorff AE, Marioni JC, Wang Y, Barbosa-Morais NL, Thorne NP, Costa JL, Pinder SE, Wiel van de MA, Green AR, Ellis IO, Porter PL, Tavare S, Brenton JD, Ylstra B, Caldas C: High-resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol. 2007, 8: R215-10.1186/gb-2007-8-10-r215.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen W, Houldsworth J, Olshen AB, Nanjangud G, Chaganti S, Venkatraman ES, Halaas J, Teruya-Feldstein J, Zelenetz AD, Chaganti RS: Array comparative genomic hybridization reveals genomic copy number changes associated with outcome in diffuse large B-cell lymphomas. Blood. 2006, 107: 2477-2485. 10.1182/blood-2005-07-2950.PubMedPubMed CentralView ArticleGoogle Scholar
- The Mitelman Database of Chromosome Aberrations in Cancer. [http://cgap.nci.nih.gov/Chromosomes/Mitelman]
- Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20: 3413-3422. 10.1093/bioinformatics/bth418.PubMedView ArticleGoogle Scholar
- Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004, 20: 3636-3637. 10.1093/bioinformatics/bth355.PubMedView ArticleGoogle Scholar
- Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, Borresen-Dale AL: CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics. 2005, 21: 821-822. 10.1093/bioinformatics/bti113.PubMedView ArticleGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.PubMedView ArticleGoogle Scholar
- Rosenberg C, Schut TB, Mostert MC, Tanke HJ, Raap AK, Oosterhuis JW, Looijenga LH: Comparative genomic hybridization in hypotriploid/hyperdiploid tumors. Cytometry. 1997, 29: 113-121. 10.1002/(SICI)1097-0320(19971001)29:2<113::AID-CYTO3>3.0.CO;2-E.PubMedView ArticleGoogle Scholar
- Lyng H, Beigi M, Svendsrud DH, Brustugun OT, Stokke T, Kristensen GB, Sundfor K, Skjonsberg A, De Angelis PM: Intratumor chromosomal heterogeneity in advanced carcinomas of the uterine cervix. Int J Cancer. 2004, 111: 358-366. 10.1002/ijc.20258.PubMedView ArticleGoogle Scholar
- Harada K, Nishizaki T, Ozaki S, Kubota H, Ito H, Sasaki K: Intratumoral cytogenetic heterogeneity detected by comparative genomic hybridization and laser scanning cytometry in human gliomas. Cancer Res. 1998, 58: 4694-4700.PubMedGoogle Scholar
- Harada T, Okita K, Shiraishi K, Kusano N, Kondoh S, Sasaki K: Interglandular cytogenetic heterogeneity detected by comparative genomic hybridization in pancreatic cancer. Cancer Res. 2002, 62: 835-839.PubMedGoogle Scholar
- Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, Du J, Kau T, Thomas RK, Shah K, Soto H, Perner S, Prensner J, Debiasi RM, Demichelis F, Hatton C, Rubin MA, Garraway LA, Nelson SF, Liau L, Mischel PS, Cloughesy TF, Meyerson M, Golub TA, Lander ES, Mellinghoff IK, et al: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci USA. 2007, 104: 20007-20012. 10.1073/pnas.0710052104.PubMedPubMed CentralView ArticleGoogle Scholar
- Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA, Shah K, Sato M, Thomas RK, Barletta JA, Borecki IB, Broderick S, Chang AC, Chiang DY, Chirieac LR, Cho J, Fujii Y, Gazdar AF, Giordano T, Greulich H, Hanna M, Johnson BE, Kris MG, Lash A, Lin L, Lindeman N, et al: Characterizing the cancer genome in lung adenocarcinoma. Nature. 2007, 450: 893-898. 10.1038/nature06358.PubMedPubMed CentralView ArticleGoogle Scholar
- Mohapatra G, Moore DH, Kim DH, Grewal L, Hyun WC, Waldman FM, Pinkel D, Feuerstein BG: Analyses of brain tumor cell lines confirm a simple model of relationships among fluorescence in situ hybridization, DNA index, and comparative genomic hybridization. Genes Chromosomes Cancer. 1997, 20: 311-319. 10.1002/(SICI)1098-2264(199712)20:4<311::AID-GCC1>3.0.CO;2-4.PubMedView ArticleGoogle Scholar
- Galteland E, Holte H, Stokke T: c-MYC, RB-1, TP53, and centromere 8 and 17 copy number in B-cell non-Hodgkin's lymphomas assessed by dual-color fluorescence in situ hybridization. Cytometry. 1999, 38: 53-60. 10.1002/(SICI)1097-0320(19990415)38:2<53::AID-CYTO2>3.0.CO;2-0.PubMedView ArticleGoogle Scholar
- Galteland E, Smedshammer L, Suo Z, DeAngelis P, Stokke T: Proliferation-dependent expression and phosphorylation of pRB in B cell non-Hodgkin's lymphomas: dependence on RB1 copy number. Leukemia. 2002, 16: 1549-1555. 10.1038/sj.leu.2402644.PubMedView ArticleGoogle Scholar
- Galteland E, Sivertsen EA, Svendsrud DH, Smedshammer L, Kresse SH, Meza-Zepeda LA, Myklebost O, Suo Z, Mu D, DeAngelis PM, Stokke T: Translocation t(14;18) and gain of chromosome 18/BCL2: effects on BCL2 expression and apoptosis in B-cell non-Hodgkin's lymphomas. Leukemia. 2005, 19: 2313-2323. 10.1038/sj.leu.2403954.PubMedView ArticleGoogle Scholar
- Sivertsen EA, Galteland E, Mu D, Holte H, Meza-Zepeda L, Myklebost O, Patzke S, Smeland EB, Stokke T: Gain of chromosome 6p is an infrequent cause of increased PIM1 expression in B-cell non-Hodgkin's lymphomas. Leukemia. 2006, 20: 539-542. 10.1038/sj.leu.2404094.PubMedView ArticleGoogle Scholar
- Stokke T, Galteland E, Holte H, Smedshammer L, Suo Z, Smeland EB, Borresen-Dale AL, DeAngelis P, Steen HB: Oncogenic aberrations in the p53 pathway are associated with a high S phase fraction and poor patient survival in B-cell Non-Hodgkin's lymphoma. Int J Cancer. 2000, 89: 313-324. 10.1002/1097-0215(20000720)89:4<313::AID-IJC1>3.0.CO;2-D.PubMedView ArticleGoogle Scholar
- Stokke T, DeAngelis P, Smedshammer L, Galteland E, Steen HB, Smeland EB, Delabie J, Holte H: Loss of chromosome 11q21-23.1 and 17p and gain of chromosome 6p are independent prognostic indicators in B-cell non-Hodgkin's lymphoma. Br J Cancer. 2001, 85: 1900-1913. 10.1054/bjoc.2001.2164.PubMedPubMed CentralView ArticleGoogle Scholar
- Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, Peterson C: BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol. 2002, 3: SOFTWARE0003-10.1186/gb-2002-3-8-software0003.PubMedPubMed CentralView ArticleGoogle Scholar
- Lyng H, Brovig RS, Svendsrud DH, Holm R, Kaalhus O, Knutstad K, Oksefjell H, Sundfor K, Kristensen GB, Stokke T: Gene expressions and copy numbers associated with metastatic phenotypes of uterine cervical cancer. BMC Genomics. 2006, 7: 268-10.1186/1471-2164-7-268.PubMedPubMed CentralView ArticleGoogle Scholar
- Davies JJ, Wilson IM, Lam WL: Array CGH technologies and their applications to cancer genomes. Chromosome Res. 2005, 13: 237-248. 10.1007/s10577-005-2168-x.PubMedView ArticleGoogle Scholar
- Kallioniemi OP, Kallioniemi A, Piper J, Isola J, Waldman FM, Gray JW, Pinkel D: Optimizing comparative genomic hybridization for analysis of DNA sequence copy number changes in solid tumors. Genes Chromosomes Cancer. 1994, 10: 231-243. 10.1002/gcc.2870100403.PubMedView ArticleGoogle Scholar
- Mitelman F, Johansson B, Mertens F: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer. 2007, 7: 233-245. 10.1038/nrc2091.PubMedView ArticleGoogle Scholar
- Kim SW, Kim JW, Kim YT, Kim JH, Kim S, Yoon BS, Nam EJ, Kim HY: Analysis of chromosomal changes in serous ovarian carcinoma using high-resolution array comparative genomic hybridization: Potential predictive markers of chemoresistant disease. Genes Chromosomes Cancer. 2007, 46: 1-9. 10.1002/gcc.20384.PubMedView ArticleGoogle Scholar
- Nigro JM, Misra A, Zhang L, Smirnov I, Colman H, Griffin C, Ozburn N, Chen M, Pan E, Koul D, Yung WK, Feuerstein BG, Aldape KD: Integrated array-comparative genomic hybridization and expression array profiles identify clinically relevant molecular subtypes of glioblastoma. Cancer Res. 2005, 65: 1678-1686. 10.1158/0008-5472.CAN-04-2921.PubMedView ArticleGoogle Scholar
- Pierga JY, Reis-Filho JS, Cleator SJ, Dexter T, Mackay A, Simpson P, Fenwick K, Iravani M, Salter J, Hills M, Jones C, Ashworth A, Smith IE, Powles T, Dowsett M: Microarray-based comparative genomic hybridisation of breast cancer patients receiving neoadjuvant chemotherapy. Br J Cancer. 2007, 96: 341-351. 10.1038/sj.bjc.6603483.PubMedPubMed CentralView ArticleGoogle Scholar
- Ruano Y, Mollejo M, Ribalta T, Fiano C, Camacho FI, Gomez E, de Lope AR, Hernandez-Moneo JL, Martinez P, Melendez B: Identification of novel candidate target genes in amplicons of Glioblastoma multiforme tumors detected by expression and CGH microarray profiling. Mol Cancer. 2006, 5: 39-10.1186/1476-4598-5-39.PubMedPubMed CentralView ArticleGoogle Scholar
- van Wieringen WN, Belien JA, Vosse SJ, Achame EM, Ylstra B: ACE-it: a tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics. 2006, 22: 1919-1920. 10.1093/bioinformatics/btl269.PubMedView ArticleGoogle Scholar
- van Duin M, van Marion R, Watson JE, Paris PL, Lapuk A, Brown N, Oseroff VV, Albertson DG, Pinkel D, de Jong P, Nacheva EP, Dinjens W, van Dekken H, Collins C: Construction and application of a full-coverage, high-resolution, human chromosome 8q genomic microarray for comparative genomic hybridization. Cytometry A. 2005, 63: 10-19.PubMedView ArticleGoogle Scholar
- Di VA, Infusini E, Peveri C, Sciutto A, Orecchia R, Geido E, Monaco R, Giaretti W: Intratumor heterogeneity of chromosome 1, 7, 17, and 18 aneusomies obtained by FISH and association with flow cytometric DNA index in human colorectal adenocarcinomas. Cytometry. 1999, 35: 369-375. 10.1002/(SICI)1097-0320(19990401)35:4<369::AID-CYTO10>3.0.CO;2-6.View ArticleGoogle Scholar
- Jenkins RB, Qian J, Lieber MM, Bostwick DG: Detection of c-myc oncogene amplification and chromosomal anomalies in metastatic prostatic carcinoma by fluorescence in situ hybridization. Cancer Res. 1997, 57: 524-531.PubMedGoogle Scholar
- Lengauer C, Kinzler KW, Vogelstein B: Genetic instabilities in human cancers. Nature. 1998, 396: 643-649. 10.1038/25292.PubMedView ArticleGoogle Scholar
- De Angelis PM, Stokke T, Beigi M, Flatberg G, Enger M, Haug K, Aass HC, Schjolberg A, Andresen PA, Ariansen S, Bo AS, Mjaland O, Clausen OP: Chromosomal 20q gain in the DNA diploid component of aneuploid colorectal carcinomas. Int J Cancer. 2007, 120: 2734-2738. 10.1002/ijc.22537.PubMedView ArticleGoogle Scholar
- Vindelov LL, Christensen IJ, Nissen NI: A detergent-trypsin method for the preparation of nuclei for flow cytometric DNA analysis. Cytometry. 1983, 3: 323-327. 10.1002/cyto.990030503.PubMedView ArticleGoogle Scholar
- Meza-Zepeda LA, Kresse SH, Barragan-Polania AH, Bjerkehagen B, Ohnstad HO, Namlos HM, Wang J, Kristiansen BE, Myklebost O: Array comparative genomic hybridization reveals distinct DNA copy number differences between gastrointestinal stromal tumors and leiomyosarcomas. Cancer Res. 2006, 66: 8984-8993. 10.1158/0008-5472.CAN-06-1972.PubMedView ArticleGoogle Scholar
- Ensembl. [http://www.ensembl.org/Homo_sapiens/searchview]
- De Angelis PM, Clausen OP, Schjolberg A, Stokke T: Chromosomal gains and losses in primary colorectal carcinomas detected by CGH and their associations with tumour DNA ploidy, genotypes and phenotypes. Br J Cancer. 1999, 80: 526-535. 10.1038/sj.bjc.6690388.PubMedView ArticleGoogle Scholar
- Demo version of GeneCount in BASE. [http://alba.uio.no/base/local/genecountdemo/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.