Exon arrays provide accurate assessments of gene expression
© Kapur et al.; licensee BioMed Central Ltd. 2007
Received: 11 December 2006
Accepted: 15 May 2007
Published: 15 May 2007
We have developed a strategy for estimating gene expression on Affymetrix Exon arrays. The method includes a probe-specific background correction and a probe selection strategy in which a subset of probes with highly correlated intensities across multiple samples are chosen to summarize gene expression. Our results demonstrate that the proposed background model offers improvements over the default Affymetrix background correction and that Exon arrays may provide more accurate measurements of gene expression than traditional 3' arrays.
Microarray technology is a widely used high-throughput tool for measuring gene expression [1–3]. One of the most popular platforms is the Affymetrix GeneChip microarray. In this platform, gene-level expression indices are computed based on hybridization intensity measurements from multiple perfect match (PM) and mismatch (MM) probes targeting the 3' end of the mRNA sequence. The current generation of such 3' expression arrays includes the Human U133 Plus 2.0 array and the Mouse 430 2.0 array.
In this paper, we propose a strategy for computing gene expression indices on Exon arrays, called GeneBASE (Gene-level Background Adjusted Selected probe Expression). We employ a probe-specific background correction motivated by several recent studies using model-based approaches on 3' arrays  and on tiling arrays . We combine a background correction model with our recently developed probe selection strategy  to construct a gene-level expression index for each transcript region. Although the ultimate goal of the new Exon array design is to provide exon-level expression information, the construction of a gene-level expression index is an important first task as it will provide a baseline against which to judge the expression of individual exons. To address the question of whether Exon arrays can provide improved gene-level expression measurements, we conducted a systematic comparison of Exon arrays and 3' arrays based on SAGE data. The results suggest that Exon arrays offer improved sensitivity and specificity of absence/presence (A/P) calls, and may allow more accurate quantitative measurement of the level of gene expression.
Strategy for estimation of gene-level expression
Exon arrays differ from traditional 3' arrays in the approach to modeling non-specific hybridization. Exon arrays contain a set of 37,687 genomic and antigenomic background probes that are not matched to any putative transcript region. By default, the Affymetrix software estimates probe background signal (that is, its response to non-specific hybridization to off-target sequences) by the median response of all background probes with matching GC content to the probe in question. This background signal is then subtracted from the probe intensity to yield a background-corrected intensity. To explore whether it is possible to improve this background correction, we adapted a statistical model recently developed for Affymetrix genome tiling arrays  to predict background signal on Exon arrays. The model is a simple linear model of the log-probe-intensity that incorporates position specific nucleotide indicators and quadratic terms for nucleotide counts of the probe sequence (see Materials and methods for details). This model is referred to as the MAT model, named after the tiling array software in .
Training the MAT background model using different sets of probes
Train on background probes, test on background probes R2
Train on full probes, test on background probes R 2
Comparison of MAT and Affymetrix GC bin background models
Train on background probes, test on full probes R2
After background correction, normalization and probe selection, we are ready to compute a gene-level expression index. We do this by adapting a model, first proposed in  for 3' arrays, to fit the gene expression index for each gene (see Materials and methods). This approach requires multiple arrays (four or more are recommended) and the gene expression indices are estimated more accurately when there is substantial variation in gene-level expression across the different arrays. We have found that the data from the Exon array platform is sufficiently uniform after normalization so that arrays from different studies may be combined for model fitting. Thus, for small scale studies with a limited number of arrays, we recommend combining the arrays with the Affymetrix tissue panel arrays to compute the expression index.
Finally, in any experiment, a substantial number of genes may be expressed at such a low level that they are undetectable by the array. It is often desirable to remove such 'absent' genes from further quantitative analyses . To identify the absent genes, we compare the observed probe intensities to their background levels predicted by the MAT model. A gene is called 'absent' if the statistical test shows that the probe intensities are not significantly different from background (see Materials and methods).
Evaluation of exon array gene-level expression measurements
We used existing tissue panel expression datasets to study the performance of Exon arrays and 3' arrays. The datasets consist of expression profiles from Human Exon 1.0 ST arrays and U133 Plus 2.0 arrays on 11 human tissues. For each array type, three replicates are available for each tissue. To evaluate the accuracy of A/P calls from Exon arrays as discussed above, we first construct gold-standard sets of 'present' genes and gold-standard sets of 'absent' genes using an independent data source. For each tissue type, these gold-standard sets are constructed based on an analysis of available SAGE libraries of the same tissue type (see Materials and methods).
The recently released Exon arrays differ from the previous 3' arrays in terms of the number, placement and annotational confidence of oligonucleotide probes. As a result, new methods that take advantage of Exon array design features can improve gene-level expression estimates. In this manuscript we propose a strategy for computing gene expression indices on Exon arrays that combines a probe-specific background correction with a probe selection procedure. Analysis of independent SAGE data demonstrates that A/P calls generated from the MAT background model offer substantial improvements. This improvement is likely due to both the increased number of probes per gene and the improvement of the MAT background model over the default Affymetrix background correction. Furthermore, we observed that Exon array gene expression indices show a high degree of correlation between human and mouse orthologs.
This work represents a step in the continued development of accurate gene-level expression estimates from microarray data. However, other approaches to estimating gene-level expression are possible, including model-based approaches [8, 12–14], and methods based on physical models [15, 16]. Various methods for estimating probe background intensities have also been proposed, incorporating probe content [17, 18] or using physical models . With so many probes interrogating background signal, Exon arrays present the opportunity to fit improved background models. Extensions of the MAT model  may offer future improvements in estimating background intensities. Alternative methods for probe selection are also possible. For example, Affymetrix proposes an iterative probe selection strategy, IterPLIER , in which a subset of probes of fixed size (11 probes) is iteratively chosen.
Accurate gene-level expression estimates are useful not only for high-level analysis but also serve as a first step for detecting alternative splicing [21–25]. For example, to detect alternative splicing we can track the exon-level expression relative to overall gene expression. Accurate baseline gene expressions will enable more sensitive detection of alternative splicing.
We describe a strategy for estimating gene-level expression indices on Exon arrays that incorporates a probe-specific background correction with a probe selection procedure. We validated our approach using independent SAGE data and cross-species comparisons.
Materials and methods
MAT background model
where PM i is the intensity of the perfect match probe i, n ik is the count of nucleotides of type k in the probe sequence, I ijk is the indicator of nucleotide of type k in position j, α, β jk , γ k are parameters in the model, and ε i is a probe-specific error term. Parameters were estimated by least squares using a set of probe intensities believed to be detecting mostly background. In this paper MAT parameters were fit using either the set of background probes or the set of full probes on Exon arrays.
R2model fit statistic
We compared the effect of training the MAT background model using different sets of probes. The MAT parameters were estimated using the given set of probes on one array. On a separate array, a single scale parameter was fit to adjust for overall background abundance. The R2 statistic was reported for the set of estimated background probe intensities on the separate array.
We compared the MAT background model with the Affymetrix GC bin background correction by computing an R2 statistic as follows. Each background model was trained using the background probes on a given array. On the same array, we computed the R2 statistic from the estimated intensities of full probes. The R2 statistic is given in terms of the log probe intensities.
Tissue panel dataset
We downloaded the public Human and Mouse Exon 1.0 ST Array and U133Plus2.0 Array tissue panel dataset  consisting of 11 tissues (breast, cerebellum, heart, kidney, liver, muscle, pancreas, prostate, spleen, testes, and thyroid), each with three replicates.
Probe selection was performed using the Affymetrix tissue panel dataset. The details of the probe selection algorithm have been detailed previously .
Exon array gene-level expression estimation
Background corrected, normalized, selected probe intensities were used to summarize gene expression using the model described in .
Mapping genes between 3' arrays and exon arrays
We used the mapping file provided by Affymetrix  to match genes between the U133 Plus 2.0 array and the Human Exon array. We required that a gene had at least one core probe on the exon array. The filtering resulted in a set of 17,165 genes mapped between the two arrays.
For the 3' arrays, the Affymetrix MAS 5.0 algorithm  uses the PM and MM probes to calculate a discrimination score for each probe pair:
R = (PM - MM)/(PM + MM)
First, the algorithm tests whether the PM and MM probe pairs are saturated. If all probe pairs are saturated, then the probeset is automatically declared present. Otherwise, the set of discrimination scores corresponding to probe pairs within a probeset are used to carry out a one-sided Wilcoxon's signed rank test, with null hypothesis parameter τ = 0.015 as the default value. Each probeset is assigned a detection p value used to make A/P calls.
For the Exon arrays, Affymetrix implements the detection above background (DABG) method for making A/P calls. Each probe is assigned an empirical p value, obtained by comparing the probe intensity to the distribution of background probe intensities with identical GC content. The p values are transformed via the formula x = -2log(p). Under the null hypothesis that the probes are detecting background, the p values follow a uniform [0,1] distribution and the transformed p values follow a distribution. The sum of the transformed p values follows a distribution, where k equals the number of probes.
We applied the MAT model to generate A/P calls. After fitting the MAT model, we use the linear model to compute an estimate of standard error for the predicted values. Each probe intensity is given a Z-score. The sum of the Z-scores is standardized to follow a standard Normal distribution.
Gold-standard A/P genes from SAGE data
For each tissue, SAGE data were used to construct gold-standard expressed and unexpressed sets of genes. We used the NCBI SAGEmap tool  to select tags present in the Cancer Genome Anatomy Project (CGAP) SAGE normal tissue libraries (GEO accession: GSE14). Eight of the eleven tissues surveyed in the tissue panel dataset had corresponding libraries (breast, cerebellum, heart, kidney, liver, pancreas, prostate, and thyroid). SAGE tags were mapped to the most likely corresponding UniGene cluster, which was subsequently mapped to genes on the U133 Plus 2.0 and Exon arrays. A gene was declared to belong to the reference set of expressed genes if it had a value of at least 100 tags per million (tpm). A gene was declared to belong to the reference set of unexpressed genes if: it had no SAGE tags observed in the given tissue; and it was expressed (that is, tpm ≥ 100) in at least one other SAGE tissue library.
Absence/presence ROC curves
We downloaded the publicly available human and mouse tissue panel datasets from Affymetrix. Exon array gene expressions were computed using the background correction, scale normalization and probe selection procedure described here. For each tissue, gene-level expression indices were averaged over the three replicates.
Using NCBI's HomoloGene database , we extracted 16,044 non-redundant orthologous gene pairs between human and mouse. A total of 10,480 gene pairs were represented on human and mouse Exon arrays. For a given tissue, we computed the cross-species correlation as the Pearson or Spearman correlation computed over the set of ortholog gene pairs.
We provide freely available software to compute gene expression indices on Exon arrays. We implement an additive MAT background correction and a simple normalization scheme where the background corrected intensities of core probes are scaled to have median intensity of 100. Additional background correction and normalization options are provided. See  for further details.
We would like to thank Hongkai Ji, Quntian Wang, Erik Miller, Eric Chiao, and Matthew Scott for helpful discussions and Wesley Miaw and Sebastian Maerkl for computational assistance. This research was supported by NSF grant DMS0505732 and NIH grant R01HG002341.
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 55-65. 10.1038/nrg1749.PubMedView ArticleGoogle Scholar
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470. 10.1126/science.270.5235.467.PubMedView ArticleGoogle Scholar
- Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW: Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA. 1996, 93: 10614-10619. 10.1073/pnas.93.20.10614.PubMedPubMed CentralView ArticleGoogle Scholar
- Affymetrix: Exon Array Design Datasheet. [http://www.affymetrix.com/support/technical/datasheets/exon_arraydesign_datasheet.pdf]
- Wu Z: A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc. 2004, 99: 909-917. 10.1198/016214504000000683.View ArticleGoogle Scholar
- Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS: Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci USA. 2006, 103: 12457-12462. 10.1073/pnas.0601180103.PubMedPubMed CentralView ArticleGoogle Scholar
- Xing Y, Kapur K, Wong WH: Probe selection and expression index computation of affymetrix exon arrays. PLoS ONE. 2006, 1: e88-10.1371/journal.pone.0000088.PubMedPubMed CentralView ArticleGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- McClintick JN, Edenberg HJ: Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006, 7: 49-10.1186/1471-2105-7-49.PubMedPubMed CentralView ArticleGoogle Scholar
- Liao BY, Zhang J: Evolutionary conservation of expression profiles between human and mouse orthologous genes. Mol Biol Evol. 2006, 23: 530-540. 10.1093/molbev/msj054.PubMedView ArticleGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Millenaar FF, Okyere J, May ST, van Zanten M, Voesenek LA, Peeters AJ: How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results. BMC Bioinformatics. 2006, 7: 137-10.1186/1471-2105-7-137.PubMedPubMed CentralView ArticleGoogle Scholar
- Affymetrix: Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. [http://www.affymetrix.com/support/technical/technotes/plier_technote.pdf]
- Hekstra D, Taussig AR, Magnasco M, Naef F: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. 2003, 31: 1962-1968. 10.1093/nar/gkg283.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol. 2003, 21: 818-821. 10.1038/nbt836.PubMedView ArticleGoogle Scholar
- Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 011906-PubMedView ArticleGoogle Scholar
- Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data. Nat Biotechnol. 2004, 22: 656-658. 10.1038/nbt0604-656b.PubMedView ArticleGoogle Scholar
- Marcelino LA, Backman V, Donaldson A, Steadman C, Thompson JR, Preheim SP, Lien C, Lim E, Veneziano D, Polz MF: Accurately quantifying low-abundant targets amid similar sequences by revealing hidden correlations in oligonucleotide microarray data. Proc Natl Acad Sci USA. 2006, 103: 13629-13634. 10.1073/pnas.0601476103.PubMedPubMed CentralView ArticleGoogle Scholar
- Affymetrix: Gene signal estimates from exon arrays. [http://www.affymetrix.com/support/technical/whitepapers/exon_gene_signal_estimate_whitepaper.pdf]
- Clark TA, Sugnet CW, Ares M: Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science. 2002, 296: 907-910. 10.1126/science.1069415.PubMedView ArticleGoogle Scholar
- Affymetrix: Alternative Transcript Analysis Methods for Exon Arrays. [http://www.affymetrix.com/support/technical/whitepapers/exon_alt_transcript_analysis_whitepaper.pdf]
- Cuperlovic-Culf M, Belacel N, Culf AS, Ouellette RJ: Microarray analysis of alternative splicing. Omics. 2006, 10: 344-357. 10.1089/omi.2006.10.344.PubMedView ArticleGoogle Scholar
- Li HR, Wang-Rodriguez J, Nair TM, Yeakley JM, Kwon YS, Bibikova M, Zheng C, Zhou L, Zhang K, Downs T, et al: Two-dimensional transcriptome profiling: identification of messenger RNA isoform signatures in prostate cancer from archived paraffin-embedded cancer specimens. Cancer Res. 2006, 66: 4079-4088. 10.1158/0008-5472.CAN-05-4264.PubMedView ArticleGoogle Scholar
- Lee C, Roy M: Analysis of alternative splicing with microarrays: successes and challenges. Genome Biol. 2004, 5: 231-10.1186/gb-2004-5-7-231.PubMedPubMed CentralView ArticleGoogle Scholar
- Affymetrix: Affymetrix Tissue Panel Exon Array Data. [http://www.affymetrix.com/support/technical/sample_data/exon_array_data.affx]
- Affymetrix: HuEx-1_0-st-v2 Transcript Cluster ID Mappings to HG-U133 Plus 2.0. [http://www.affymetrix.com/support/technical/byproduct.affx?product=huexon-st]
- Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, et al: Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics. 2002, 18: 1593-1599. 10.1093/bioinformatics/18.12.1593.PubMedView ArticleGoogle Scholar
- Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF: SAGEmap: a public gene expression resource. Genome Res. 2000, 10: 1051-1060. 10.1101/gr.10.7.1051.PubMedPubMed CentralView ArticleGoogle Scholar
- HomoloGene. [ftp://ftp.ncbi.nih.gov/pub/HomoloGene/]
- GeneBASE. [http://biogibbs.stanford.edu/~kkapur/GeneBASE/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.