Correction of technical bias in clinical microarray data improves concordance with known biological information
 Aron C Eklund^{1, 2} and
 Zoltan Szallasi^{1, 2}Email author
DOI: 10.1186/gb200892r26
© Eklund and Szallasi; licensee BioMed Central Ltd. 2008
Received: 1 October 2007
Accepted: 4 February 2008
Published: 04 February 2008
Abstract
The performance of gene expression microarrays has been well characterized using controlled reference samples, but the performance on clinical samples remains less clear. We identified sources of technical bias affecting many genes in concert, thus causing spurious correlations in clinical data sets and false associations between genes and clinical variables. We developed a method to correct for technical bias in clinical microarray data, which increased concordance with known biological relationships in multiple data sets.
Background
The introduction of massively parallel measurement techniques such as gene expression microarrays has led to a much disputed paradigm shift in experimental design [1]. When a single biochemical entity is quantified it is customary and expected to include replicates and carefully selected controls such as calibration curves, which increase the confidence in the validity of the measurements. However, due to the relatively high expense of microarrays and the scarcity of starting material, many microarray data sets feature only a single measurement per sample. Beyond financial pressure the main justification for this approach has been the notion that the overall behavior of a large number or all probes on a given gene chip provides a reliable estimate of the systematic measurement bias associated with any given probe on the microarray. For example, the widely used microarray normalization methods are based on the assumption that if the average probe intensity tends to be brighter on a given chip than on others, then the estimated expression value of any given probe will be overestimated; therefore, individual probe intensities on the brighter chip have to be adjusted accordingly, bringing the average intensity in line with other chips.
Controlled reference data sets featuring spikeins or mixtures have demonstrated the possibility of high accuracy in measurements and have been valuable for comparison of normalization algorithms and analytical techniques [2–4]. However, it is unclear whether these data sets are representative of realworld clinical data, or whether an algorithm that performs well on these data sets also performs well on clinical data.
Results and discussion
In an effort to increase the number of replicate pairs clustering together, we tried various thresholds for filtering out lowvariance genes. This common practice is intended to reduce noise by removing genes that are essentially unchanged between samples [7]. As expected, we found that using increasingly stringent thresholds yielded more replicate pairs clustered together (Figure 1d). When we removed all but the most variable 100 genes, each of the 9 replicate pairs clustered together (Figure 1b). Satisfyingly, seven normal (nontumor) samples also clustered together, which had not occurred with the unfiltered data. Thus, at least in this data set, strong filtering of expression data results in dendrograms that more accurately reflect known relationships between samples. Such filtering might be appropriate if the goal is to extract robust clinical markers. However, for other types of analysis (for example, when seeking information about a predetermined set of genes with known biological function) such extreme filtering may eliminate valuable information, in which case correction of noisy data would be preferable to its removal.
Although it is remotely possible that the dendrogram resulting from the original, unfiltered data (Figure 1a) indicates biologically relevant relationships, we assumed that the disruption of replicate pairs is an artifact caused by noise in the filteredout genes. If this noise were entirely random, filtering out genes would be the only way to reduce noise. However, if the noise were systematically biased, it might be possible to characterize the bias and correct the measured expression values. Notably, even a small amount of systematic bias could have substantial influence on clustering, if the bias affects a large enough subset of genes in a coordinated fashion.
Because PSQN only partially resolved the misclustering of replicates in the Signoretti data, we sought to understand the underlying causes of the bias. Two sources of variation, those arising during hybridization and those intrinsic to the starting RNA, seem especially likely to influence many probe sets in a coordinated manner. During hybridization, a subset of probes is susceptible to nonlinear effects caused by saturation at high intensities or by the noise floor at low intensities [15]. For these probes, changes in target concentration have a diminished effect on probe intensity compared to probes in the linear regime. Thus, if samples are loaded onto arrays at slightly different concentrations, or are hybridized or washed under slightly different conditions, any subsequent normalization is likely to overcorrect probes in the nonlinear range and undercorrect those in the linear range. The use of multiple probes and robust normalization schemes may reduce the effect of these nonlinear probes, but our observation of intensitydependent bias (Figure 2a) indicates that many normalization schemes that do not account for systematic bias may be inadequate in some cases.
A second potential source of bias is the variable quality and quantity of starting RNA, which may be especially problematic in data sets featuring samples from surgical specimens [16]. For an individual transcript to contribute to the measured signal, it must be intact between the region targeted by microarray probes and the polyA tail, and the polymerases involved in Eberwinetype amplification/labeling steps must complete their processes over this entire distance. Therefore, any variation in mRNA integrity or in polymerase processivity should affect the measured intensity by a factor proportional to the number of bases between the probe target and the 3' polyA tail. Also, variable amounts of starting mRNA could affect the final diversity of the amplified cRNA, such that a sample with less starting mRNA would tend to have fewer reliably measured genes and, therefore, a higher signaltonoise ratio.
In our current analysis we hypothesized that a set of four metrics could characterize the relative amount of bias affecting each sample in a data set. To capture the saturation and noise floor of the raw probe intensities, we used the median and interquartile range (IQR) of the perfectmatch probes on the array ('PM median' and 'PM IQR'). The effect of RNA degradation was estimated by the average decrease in expression between 5' probes and 3' probes ('degradation'). Finally, the diversity of starting mRNA was characterized by the IQR of the RMAsummarized expression values ('RMA IQR').
As an alternative to our bias correction method, we considered batch adjustment, in which the expression matrix is adjusted, one probe set at a time, to remove any difference in means between the batches. From the scan dates embedded in the CEL files, we inferred that the Signoretti breast cancer data were scanned in three batches. We applied batch adjustment to this data set and found that the number of correctly paired replicates increased from four to five (data not shown). Thus, on this data set, batch adjustment is less effective than our bias correction method. The other data sets used in this study were each scanned in a relatively large number of batches (median, 14; range, 831), so batch adjustment was not attempted.
Conclusion
Various analytical approaches are susceptible to errors caused by technical bias. For simpler analyses of individual genes (for example, detection of differentially expressed genes between previously defined groups), technical bias may manifest as an unmodeled factor that complicates significance testing [20]. However, if a bias metric is correlated with a biological or clinical variable of interest, separating bias effects from biologically relevant effects may be difficult. A strong correlation may indicate that the groups of samples were collected or processed under inconsistent conditions, a situation that should be avoided [21].
More complex analyses have inferred functional relationships between genes based on their coordinated expression [22]. Such approaches may be especially sensitive to technical bias, because the coordinated effect of technical bias on multiple probe sets causes spurious correlations between these probe sets. On the other hand, coordinated bias on multiple genes also affects correlation between samples, such that two samples with dissimilar levels of bias may appear to have very different expression profiles. We have demonstrated that this bias affects sample clustering, but other visualization methods, such as principal components analysis, can be similarly distorted (data not shown). Similarly, recent work has shown that cellular network reconstruction from gene expression data can be strongly influenced by artifacts introduced during normalization [23].
We have found that technical bias can substantially affect clinical microarray data, and our results indicate that it should be corrected, or at least considered as a potential confounding effect, in any analysis. We have presented a simple approach for correction of biased data, but it may not be optimal for all data sets. Our choice of bias metrics was based on consideration of the processes involved in the generation of microarray data, but other variables, such as those used to justify the exclusion of outlying arrays, may also quantify technical bias [24]. As more clinical data sets with known biological relationships (for example, replicates) become available, it will be possible to explore more sophisticated models for bias correction.
Materials and methods
All analysis was performed using the R statistical environment [25], with use of the 'affy' package from Bioconductor [26]. Each data set was normalized/summarized individually using RMA [6], unless otherwise noted. All correlations are Pearson productmoment correlations. Hierarchical clustering was performed using complete linkage and Pearson correlation distance. An R package implementing our method, as well as raw data and scripts to reproduce our results, are available on our website [27].
Postsummary quantile normalization
A reference distribution was calculated by averaging the sorted expression values across all samples. Then, for each sample, the original expression values were replaced by those of the reference distribution in the same order, with ties broken at random. This results in all samples in a data set having exactly the same distribution of expression values.
Intensitydependent correlation bias
First, the n rows (probe sets) of the expression matrix were sorted according to mean expression value. The rows were then split into 50 bins of approximately equal size. Thus, the first bin contains the n/50 probe sets having the lowest mean signal, and the last bin contains the n/50 probe sets having the highest mean signal. For each possible pair of bins, all (n/50) × (n/50) pairwise Pearson correlation coefficients (between rows) were calculated, and the median correlation coefficient was recorded. These median correlation coefficients for each pair of bins were plotted as a colorcoded 50 × 50 matrix. Note that for the diagonals of the colorcoded matrix, which corresponds to the correlation between a bin and itself, only correlations between nonidentical probe sets were counted.
Bias metrics
The four bias metrics defined here were calculated for each microarray sample: PM Median isthe median of the log_{2}transformed perfectmatch probe intensities; PM IQR is the interquartile range of the log_{2}transformed perfectmatch probe intensities; RMA IQR is the interquartile range of all summarized RMA values (which are already log_{2}transformed); and 'degradation' is the slope of the leastsquares regression of the log_{2}transformed perfectmatch probe intensities versus their relative position (within the probe set) along the targeted transcript  this was calculated with the AffyRNAdeg function in the Bioconductor package affy [26].
Correlation between bias metrics and expression values
To calculate the random distributions in Figure 3, each bias metric was randomly permuted 100 times. We computed Pearson correlation coefficients between each permutation and expression values of each of the 12,625 probe sets; thus, each density curve represents 100 × 12,625 correlation coefficients. This procedure was repeated for each of the four bias metrics.
Bias correction
Correlationofcorrelations
For each data set, we calculated the Pearson correlation coefficient of each probe set with the binary ER status. The CC is the Pearson correlation coefficient of the resulting vector with the corresponding vector from a different data set. Thus, the CC indicates how well two data sets agree on the extent to which each probe set is correlated with the ER status.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure showing that intensitydependent correlation bias is present in all data sets tested.
Abbreviations
 CC:

correlation of correlations
 ER:

estrogen receptor
 GCRMA:

GeneChip robust multiarray average
 IQR:

interquartile range
 MAS5:

Microarray Suite 5
 MBEI:

modelbased expression index
 PSQN:

postsummary quantile normalization
 RMA:

robust multiarray average.
Declarations
Acknowledgements
We thank Andrea Richardson, Hanni Willenbrock, Simon Kasif, Chris Workman, and Andrea Vala for helpful suggestions and reading of the manuscript. This work was supported in part by the National Institutes of Health through grants P50 CA089393, and R21LM00882301A1, the Department of Defense through grant W81XWH0410549, and by the Charlotte Geyer Foundation.
Authors’ Affiliations
References
 Lee ML, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA. 2000, 97: 98349839. 10.1073/pnas.97.18.9834.PubMedPubMed CentralView ArticleGoogle Scholar
 Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005, 6: R1610.1186/gb200562r16.PubMedPubMed CentralView ArticleGoogle Scholar
 Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP: A benchmark for Affymetrix GeneChip expression measures. Bioinformatics. 2004, 20: 323331. 10.1093/bioinformatics/btg410.PubMedView ArticleGoogle Scholar
 MAQC Consortium, Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, ThierryMieg J, Wang C, Wilson M, et al: The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006, 24: 11511161. 10.1038/nbt1239.View ArticleGoogle Scholar
 Signoretti S, Di Marcotullio L, Richardson A, Ramaswamy S, Isaac B, Rue M, Monti F, Loda M, Pagano M: Oncogenic role of the ubiquitin ligase subunit Skp2 in human breast cancer. J Clin Invest. 2002, 110: 633641. 10.1172/JCI200215795.PubMedPubMed CentralView ArticleGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
 Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, BørresenDale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001, 98: 1086910874. 10.1073/pnas.191367098.PubMedPubMed CentralView ArticleGoogle Scholar
 Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D, Theriault RL, Buzdar AU, Dempsey PJ, Rouzier R, Sneige N, Ross JS, Vidaurre T, Gómez HL, Hortobagyi GN, Pusztai L: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006, 24: 42364244. 10.1200/JCO.2006.05.6861.PubMedView ArticleGoogle Scholar
 Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA. 2005, 102: 1355013555. 10.1073/pnas.0506230102.PubMedPubMed CentralView ArticleGoogle Scholar
 Minn AJ, Gupta GP, Siegel PM, Bos PD, Shu W, Giri DD, Viale A, Olshen AB, Gerald WL, Massague J: Genes that mediate breast cancer metastasis to lung. Nature. 2005, 436: 518524. 10.1038/nature03799.PubMedPubMed CentralView ArticleGoogle Scholar
 Pawitan Y, Bjöhle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedrén S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts. Breast Cancer Res. 2005, 7: R953964. 10.1186/bcr1325.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet. 2005, 365: 671679.PubMedView ArticleGoogle Scholar
 Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 3136. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
 Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data. Nat Biotechnol. 2004, 22: 656658. 10.1038/nbt0604656b.PubMedView ArticleGoogle Scholar
 Skvortsov D, Abdueva D, Curtis C, Schaub B, Tavare S: Explaining differences in saturation levels for Affymetrix GeneChip arrays. Nucleic Acids Res. 2007, 35: 41544163. 10.1093/nar/gkm348.PubMedPubMed CentralView ArticleGoogle Scholar
 Lin DW, Coleman IM, Hawley S, Huang CY, Dumpit R, Gifford D, Kezele P, Hung H, Knudsen BS, Kristal AR, Nelson PS: Influence of surgical manipulation on prostate gene expression: implications for molecular correlates of treatment effects and disease prognosis. J Clin Oncol. 2006, 24: 37633770. 10.1200/JCO.2005.05.1458.PubMedView ArticleGoogle Scholar
 Freije WA, CastroVargas FE, Fang Z, Horvath S, Cloughesy T, Liau LM, Mischel PS, Nelson SF: Gene expression profiling of gliomas strongly predicts survival. Cancer Res. 2004, 64: 65036510. 10.1158/00085472.CAN040452.PubMedView ArticleGoogle Scholar
 Parmigiani G, GarrettMayer ES, Anbazhagan R, Gabrielson E: A crossstudy comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res. 2004, 10: 29222927. 10.1158/10780432.CCR030490.PubMedView ArticleGoogle Scholar
 Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006, 98: 262272.PubMedView ArticleGoogle Scholar
 Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3: 17241735. 10.1371/journal.pgen.0030161.PubMedView ArticleGoogle Scholar
 Ransohoff DF: Bias as a threat to the validity of cancer molecularmarker research. Nat Rev Cancer. 2005, 5: 142149. 10.1038/nrc1550.PubMedView ArticleGoogle Scholar
 Stuart JM, Segal E, Koller D, Kim SK: A genecoexpression network for global discovery of conserved genetic modules. Science. 2003, 302: 249255. 10.1126/science.1087447.PubMedView ArticleGoogle Scholar
 Lim WK, Wang K, Lefebvre C, Califano A: Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics. 2007, 23: i282288. 10.1093/bioinformatics/btm201.PubMedView ArticleGoogle Scholar
 Jones L, Goldstein DR, Hughes G, Strand AD, Collin F, Dunnett SB, Kooperberg C, Aragaki A, Olson JM, Augood SJ, Faull RL, LuthiCarter R, Moskvina V, Hodges AK: Assessment of the relationship between prechip and postchip quality measures for Affymetrix GeneChip expression data. BMC Bioinformatics. 2006, 7: 21110.1186/147121057211.PubMedPubMed CentralView ArticleGoogle Scholar
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2007, Vienna: R Foundation for Statistical ComputingGoogle Scholar
 Gautier L, Cope L, Bolstad BM, Irizarry RA: affy  analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20: 307315. 10.1093/bioinformatics/btg405.PubMedView ArticleGoogle Scholar
 Supplementary Material. [http://www.cbs.dtu.dk/suppl/biascorr/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.