Detecting transcriptionally active regions using genomic tiling arrays
© Halasz et al.; licensee BioMed Central Ltd. 2006
Received: 26 September 2005
Accepted: 5 July 2006
Published: 19 July 2006
We have developed a method for interpreting genomic tiling array data, implemented as the program TranscriptionDetector. Probed loci expressed above background are identified by combining replicates in a way that makes minimal assumptions about the data. We performed medium-resolution Anopheles gambiae tiling array experiments and found extensive transcription of both coding and non-coding regions. Our method also showed improved detection of transcriptional units when applied to high-density tiling array data for ten human chromosomes.
A complete understanding of an organism's biology requires identification of the complete set of RNA transcripts it expresses. Elucidating this 'transcriptome' has proven challenging for two reasons. First, even when a complete genome sequence is available, it has proven difficult to define the exact location and number of protein-coding genes . Second, many transcripts are non-coding RNAs, which are thought to play a largely regulatory role, and are often active at relatively low levels, or in a tissue-specific manner. Expressed sequence tag (EST) sequencing and similar techniques will, therefore, often fail to detect them.
To fully catalog transcripts, several groups have used genomic microarrays, which assay expression with probes spaced more or less evenly along the genome [2–15]. These tools have higher sensitivity than EST sequencing, and provide a high-throughput way of measuring RNAs from different samples and cellular contexts. Whole-genome array studies of Arabidopsis thaliana [12, 14], Drosophila melanogaster , Saccharomyces cerevisiae [4, 10], Oryza sativa , Mus musculus  and Homo sapiens [2, 3, 6, 7, 9, 11, 15] all detect a great deal of transcription outside known protein-coding regions.
Despite the usefulness and recent popularity of whole-genome arrays, to date there is no standard way to perform such experiments or analyze their data . Existing studies vary, among others, in their method of finding a threshold above which transcripts are considered to be expressed, in their choice of negative controls (if any) to obtain this threshold, and in their manner of combining information from multiple arrays. One feature that is usually shared, however, is the inference of transcriptional activity based on the signal intensities of multiple adjacent probes [2–9, 11, 15, 17].
Various approaches are also used to account for background intensity (cross-hybridization to probes by partially complementary transcripts), probe sequence features that systematically bias signal measurements, and variability in the range of intensities between different arrays. Several studies have explicitly modeled signal intensities to distinguish signal from background noise. These models incorporate parameters for transcript concentration and probe-specific affinities [18, 19] and array- and dye-associated variability in signal intensity , or explain signal intensity for a probe as a function of its sequence using statistical and thermodynamic models [21–25]. They usually differentiate between signal arising from hybridization of cognate transcript to the probe (specific hybridization) and signal arising from cross-hybridization. Finally, normalization procedures have been developed to remove non-biological variability between replicate microarray experiments .
In this paper, we introduce a strategy for designing and interpreting genome-wide tiling experiments, the final result of our analysis being a list of probed loci that are putatively expressed. Like some other methods [3, 6, 7, 10, 14], we make use of negative control probes that represent non-specific background hybridization to evaluate the significance of expression of individual probed loci. However, we combine information from replicates in a way that makes minimal assumptions about the distribution of signal intensities and avoids putting a threshold on individual replicates. In addition, we model the dependence of non-specific hybridization on probe sequence; subtracting the systematic bias explained by these models greatly improves our ability to detect transcripts. For high-density arrays, the signal of neighboring probes can be combined to take advantage of the fact that the same transcript will contribute to the intensity of multiple probes, but this is not essential to our approach, which can, therefore, be successfully applied to low-density tiling array data as well.
Correcting for the effect of probe sequence on non-specific hybridization
Summary of sequence correction models
Number of parameters
Average adjusted R2
Number of transcriptionally active regions
log I = β0 + β GC (N C + N G )
log I = β0 + β A N A + β C N C + β G N G
log I = β0 +
41 = 36 + 4 + 1
log I = β0 +
109 = 36 × 3 + 1
Dealing with variation in signal intensity across channels
Our approach solves this problem by pooling data from different channels in a fully non-parametric way, thereby avoiding any assumptions about how the different channels relate to each other. The only assumption we make is that of a monotonic relationship between signal intensity and transcript abundance for a given channel once the intensities have been corrected for probe sequence bias, as described above. The first step in this process assigns a channel-specific 'single-channel' p value to each probe, defined as the fraction of NCPs with signal intensity larger than that of the probe within the same channel. The second step combines the single-channel p values for each probe into a single 'multi-channel p value' (MCPV), reflecting the likelihood that the set of intensities observed for that probe can be interpreted as background signal. This approach obviates the need to explicitly model dye- and array-specific effects .
Residual bias of negative control probes after sequence correction
In a classic approach to combining the result from multiple, independent statistical tests performed for the same feature, the product of individual p values is interpreted as a new test statistic, and transformed to a variable that is uniformly distributed between zero and one under the null assumption of independent tests for that feature, using a property of the χ2 distribution  or an equivalent geometric approach . We will refer to the resulting p value as a 'Fisher p value'.
Multi-channel pvalues: integrating evidence for transcription across channels
The existence of a subtle correlation between channels, presumably due to specific off-target hybridization, makes it impossible to use Fisher p values to integrate single-channel p values across multiple channels. However, we do want to integrate weak evidence for transcription from individual channels for the EP and NEP probes. This goal can be achieved by first computing the product of single-channel p values (derived from the NCP intensity distribution) for both NCP and EP/NEP probes. Multi-channel p values (MCPV) for EP/NEP are then defined as the fraction of NCPs with a p value product smaller than that for the probe in question (see Materials and methods). Comparison of Figure 1 with Figure 4b shows the increased separation between NCP, NEP, and EP distributions when evidence for transcription is integrated across channels.
Application to low-density genomic array data for mosquito
Application to high-density human tiling array data
On high-resolution tiling arrays, where probes are spaced closely together, a given transcript will contribute to the signal intensity of multiple consecutive probes. The more probes with a low MCPV we encounter in a given genomic region, the more confident we are that the region is transcribed. This reasoning is in direct analogy with that used to derive MCPVs in the first place: instead of integrating evidence across channels, we now wish to integrate evidence across adjacent probes. We achieved this by adding a 'smoothing' step, in which the MCPV of each probe is replaced by the Fisher p value obtained by combining its MCPV with that of its nearby neighbors. It is crucial that only non-overlapping neighboring probes be included in this neighborhood set, to guarantee the statistical independence of the various MCPVs that are being combined.
We compared the results of our method to that obtained by Cheng et al.  in their analysis of 10 human chromosomes using 25 base-pair (bp) probes at 5 bp resolution. This study lacked NCPs specifically designed not to match any genomic region, so we used a set of 2,634 non-spiked-in bacterial probe pairs instead. When smoothing using n probes on either side of the central probe (that is, combining 2n + 1 MCPVs), we found that performance increased up to n = 5 and then stabilized, so we settled on that value, which corresponds to a region of approximately 275 bp. Applying a threshold to the resulting smoothed MCPVs classifies each probe as 'expressed' or 'not expressed'. Optionally, we applied the 'minrun' and 'maxgap' criteria used by Cheng et al.  (see Materials and methods).
We have described a method for designing and interpreting genomic tiling array data that makes minimal assumptions about intensity distribution and variation between replicates. Combining the results from any number of hybridizations to a microarray whose design includes a set of NCPs, our algorithm assigns one MCPV to each probe, which can be used to determine which probed loci are transcriptionally active. Applying a signal intensity threshold only after the evidence from multiple channels has been combined enhances the sensitivity of our method. Including NCPs in the design of our microarray allowed us to quantitatively model the dependence of background signal intensity on probe sequence, without the need to simultaneously parameterize specific and non-specific contributions to signal intensity [21–24]. Reducing the variance of the NCP probe intensities by accounting for sequence bias using this model greatly increased the number of transcripts detected. More sophisticated sequence models could further improve our method's sensitivity.
The probe sequence correction (and for high-density tiling arrays the size of the smoothing neighborhood) is the only parametric component of our method. Beyond that, our algorithm uses a completely non-parametric approach to the problem of signal variability across channels; no assumptions are made about the distribution of signal intensities in each channel. Of course, there is the risk of decreased statistical power when using non-parametric methods when a parametric one would be justified. To address this issue explicitly, we calculated channel-specific Z-scores for each probe based on the mean and standard deviation of NCP intensity for each channel, and averaged these across channels for each probe. Alternatively, we performed quantile normalization , and then averaged intensities across channels for each probe. In both cases, the normalized and averaged intensities were subsequently used to derive a multi-channel p value for each probe. These parametric variants of our method gave results very similar to the approach defined in Figure 4. The Z-score-based approach identifies 96% to 99% of the probes reported in Table 1, while reporting 1% to 10% novel probes, depending on the sequence correction used; the corresponding ranges for the normalization-based scheme are 94% to 97% and 1% to 2%, respectively. In summary, this comparison shows that we are not sacrificing statistical power for the sake of simplicity.
Our initial attempt at integrating evidence across channels using Fisher p values uncovered a systematic probe-specific bias in NCP signal that persists across channels even after sequence correction (compare Figure 4a). It is interesting to note that this bias also manifests itself in the Z-score representation: if we compute the mean Z-score for each NCP probe across channels, the standard deviation of these means (0.638) is about twice as large as the inverse square root of 10, that is, the value that would be expected for 10 independent channels. Presumably, this effect is due to the sequence-specific partial hybridization between each control probe and a subset of the RNA transcripts present in the cell. This underscores the fact that, despite being designed to have at least three mismatches, NCPs are subject to substantial cross-hybridization. While it cannot be excluded that tiling probes experience a somewhat different spectrum of cross-hybridization contributions due to internal similarities within the genome, it seems reasonable to use the NCP intensities to estimate their variance.
The fraction of significantly expressed probed loci found for A. gambiae is considerably lower than the figure we reported for D. melanogaster in . We attribute this discrepancy to an improvement in our analysis, specifically: a change in the definition of negative control probes; and our more stringent way of computing MCPVs. Repeating our analysis of A. gambiae using Fisher p values caused 43% of probed non-exonic loci and 75% of exonic loci to be classified as transcriptionally active, numbers that are very similar to those reported in .
Given the relatively sparse placement of probes on the A. gambiae arrays, and to avoid making assumptions about the structure or size of transcribed regions, we determined the significance of each probed locus independently of its neighbors. As we demonstrate using a high-density human data set, our method can be readily extended to take advantage of the fact that, at higher probe densities, a single transcript can contribute to the signal intensity of multiple adjacent probes. It is, therefore, useful for interpreting both high-density tiling arrays, where spatial dependencies can be exploited, and low-density arrays, where adjacent probes are too far apart to yield such information.
Materials and methods
The NASA Oligonucleotide Probe Selection Algorithm (NOPSA) was used to select optimal 36-mer probes measuring expression from EPs and NEPs. Coding and non-coding regions were identified based on annotations from the Ensembl database (file anopheles_gambiae_core_15_2). As a control for non-specific EP and NEP hybridization, 4,000 dodecanucleotides absent from the A. gambiae genome were identified computationally. NCPs were then formed by random concatenation of three such 12-mers, guaranteeing that each NCP had at least three mismatches relative to any 36 nucleotide stretch of the Anopheles genome. Five microarrays, each containing an identical set of 76,782 EPs, 94,469 NEPs and 1,000 NCPs were synthesized using Maskless array synthesizer (MAS) technology .
Samples and hybridization
Three to five day old A. gambiae adults (G3 strain) were sorted by sex and homogenized in Trizol. Total RNA was isolated using Heavy phase lock gel columns (Invitrogen, Carlsbad, CA, USA) and polyadenylated RNA was extracted using oligodT chromatography columns (BioRad, Hercules, CA, USA). We labeled 3 μg of each experimental sample by chemical coupling of Cy3 or Cy5 dyes (Amersham, Piscataway, NJ, USA) to the aminoallyl nucleotide introduced during cDNA synthesis (Powerscript reverse transcriptase, BD Biosciences, Franklin Lakes, NJ, USA). Labeled samples were purified using RNeasy columns (Qiagen, Valencia, CA, USA) and hybridized overnight at 52°C to high density oligonucleotide microarrays. The arrays were scanned using an Axon scanner (Molecular Devices Corporation, Sunnyvale, CA, USA). Males were labeled twice with Cy3 and three times with Cy5; the reverse was done for females. Each array measured RNA from both sexes.
Probe sequence bias correction
Five different models were used to relate NCP sequence to signal intensity (Table 1). The most basic is the 'GC model', which assumes a linear relationship between signal log-intensity and GC content. The 'Nucleotide-specific model' is slightly more complex, explaining the signal in terms of the representation of each base, not just G and C. The remaining two models take position dependencies into account by allowing different segments of the probe to make independent contributions to binding, and are described below.
The 'Bilinear model' derives both base- and position-specific parameters, under the assumption that these two variable types are independent. The signal intensity of each probe is then given by:
where γ i is the weight for position i along the probe, β b is the weight for base b, b(i) is the base at position i, and n is the length of the probe. The values for the two sets of model parameters were determined by iterating between regression of γ and β until convergence.
The 'Full Position-specific model' combines the base and position weights into a single parameter δi,b, reflecting the weight associated with having base b at position i. The signal log-intensity is then simply given by:
This last model is essentially that of , who explained most of the variance in signal intensity with weights associated with a particular base at a particular position, and found that terms modeling features of secondary structure were less important. Other studies have used very similar models, but parameterize the positional dependence for each base as a polynomial  or using a spline .
Computing Fisher Pvalues for putatively independent channels
For each probe k, we first computed a test statistic τ k equal to the product of all single-channel p values P kc :
where c labels the channel and n is the total number of channels. Fisher p values were then computed as the probability that uniformly distributed independent random variables would yield a product of p values as high as that observed for a given probe. This probability is given by:
See  for details.
Multi-channel pvalues and false discovery rate procedure
Since cross-hybridizing transcripts invalidate the independence assumption, MCPVs were ultimately used in our procedure. These were obtained by comparing the τ statistic (as defined above) for each probe to a null distribution composed of the τ-values for the NCPs. A significance threshold was derived using a false discovery rate (FDR) procedure , using an FDR of 5%. Briefly, MCPVs were ranked in strictly increasing order: P1 ≤ P2...≤ P n . The largest i for which:
where α = 0.05, represents the largest MCPV that is still significant. Probes with MCPV less than or equal to P i are, therefore, considered to detect loci expressed above background.
Evidence integration for adjacent probes on high-density tiling arrays
For each probe, Fisher p values were calculated over its MCPVs and those of up to n upstream and n downstream probes. If there were fewer than 2n probes within 30 × (n) nucleotides of the central probe, only these were used in the calculation. Because overlapping probes are not independent, only completely non-overlapping probes were used. The Fisher p value itself was calculated in exactly the same way as for putatively independent channels - the test statistic is now:
where k labels the central probe being evaluated and P i is the MCPV for probe i.
Analyisis of Affymetrix high-density human tiling array data
Affymetrix CEL expression files, CDF probe annotatation files, and negative control probe data were downloaded from . An array-specific p value was computed for each tiling path probe by comparing its log(PM/MM) value to a negative control distribution of non-spiked-in bacterial probe pairs. P values for different replicates were combined into a single MCPV, which in turn were smoothed as described in the previous section, using n = 5. To keep our comparison with Cheng et al.  focused, we did not sequence correct probe intensities and applied the same minrun (50 bp) and maxgap (30 bp) criteria as described in that study (probes above a certain smoothed MCPV threshold were considered positive; if two such positive probes were within maxgap bases of each other, all probes between them were also considered positive; a contiguous stretch of positive probes must be at least minrun bases in length, otherwise the probes in the 'failed' run are considered negative).
ROC curve analysis
Transcribed regions ('transfrags') predicted by Cheng et al.  (cytosolic/polyA+ samples only) were downloaded from , and a union was taken across all cell lines. UCSC genome annotation files for ESTs, mRNAs, and annotated ('known') genes were downloaded from . Probes overlapping any part of these UCSC regions were taken to be our gold standard, relative to which sensitivity and specificity were calculated. For Cheng et al. , the predicted probes were considered to be those overlapping their predicted transfrags. For our analysis, predicted probes were obtained as described in the previous section, using a range of MCPV thresholds.
Raw expression data for the present study has been submitted to the NCBI Gene Expression Omnibus as series GSE5196.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 contains probe sequence and raw signal intensities for exon probes. Additional data file 2 contains probe sequence and raw signal intensities for non-exon probes. Additional data file 3 contains probe sequence and raw signal intensities for negative control probes. Additional data file 4 contains genomic coordinates for regions measured by exon probes. Additional data file 5 contains genomic coordinates for regions measured by non-exon probes. Additional data file 6 contains four supplementary figures: supplementary Figure 1 demonstrates that signal variability between different probe populations on the same channel is not explained by probe sequence composition; supplementary Figure 2 shows Q-Q plots for NCP signal intensities in different channels, showing that these have heterogeneous and non-normal distributions; supplementary Figure 3 demonstrates that signal variability between negative control probes on different channels is not explained by probe sequence composition; supplementary Figure 4 has two ROC curves showing true positive rate versus false positive rate relative to (a) mRNA and (b) EST transcripts annotated in the UCSC database (the '+' symbol corresponds to the transfrags as defined by Cheng et al. ; and lines correspond to our algorithm as applied with/without neighborhood smoothing and with/without minrun/maxgap post-processing).
We are grateful to an anonymous reviewer for valuable and detailed comments. HJB was supported by grants from the National Institutes of Health (HG003008, CA121852). KPW was supported by grants from the WM Keck Foundation, the Arnold and Mabel Beckman Foundation, and the NIH/NHGRI. MFvB was supported by grant BMI-050.50.201 from the Netherlands Organization for Scientific Research (NWO). GH was supported by an NIH training program in molecular biophysics (GM08281).
- Hogenesch JB, Ching KA, Batalov S, Su AI, Walker JR, Zhou Y, Kay SA, Schultz PG, Cooke MP: A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell. 2001, 106: 413-415. 10.1016/S0092-8674(01)00467-6.PubMedView ArticleGoogle Scholar
- Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, et al: Global identification of human transcribed sequences with genome tiling arrays. Science. 2004, 306: 2242-2246. 10.1126/science.1103388.PubMedView ArticleGoogle Scholar
- Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, et al: Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005, 308: 1149-1154. 10.1126/science.1108625.PubMedView ArticleGoogle Scholar
- David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM: A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci USA. 2006, 103: 5320-5325. 10.1073/pnas.0601091103.PubMedPubMed CentralView ArticleGoogle Scholar
- Frey BJ, Mohammad N, Morris QD, Zhang W, Robinson MD, Mnaimneh S, Chang R, Pan Q, Sat E, Rossant J, et al: Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs. Nat Genet. 2005, 37: 991-996. 10.1038/ng1630.PubMedView ArticleGoogle Scholar
- Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, Drenkow J, Piccolboni A, Bekiranov S, Helt G, et al: Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004, 14: 331-342. 10.1101/gr.2094104.PubMedPubMed CentralView ArticleGoogle Scholar
- Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR: Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002, 296: 916-919. 10.1126/science.1068597.PubMedView ArticleGoogle Scholar
- Li L, Wang X, Stolc V, Li X, Zhang D, Su N, Tongprasit W, Li S, Cheng Z, Wang J, Deng XW: Genome-wide transcription analyses in rice using tiling microarrays. Nat Genet. 2006, 38: 124-129. 10.1038/ng1871.PubMedView ArticleGoogle Scholar
- Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P, Gerstein M, et al: The transcriptional activity of human Chromosome 22. Genes Dev. 2003, 17: 529-540. 10.1101/gad.1055203.PubMedPubMed CentralView ArticleGoogle Scholar
- Samanta MP, Tongprasit W, Sethi H, Chin CS, Stolc V: Global identification of noncoding RNAs in Saccharomyces cerevisiae by modulating an essential RNA processing pathway. Proc Natl Acad Sci USA. 2006, 103: 4192-4197. 10.1073/pnas.0507669103.PubMedPubMed CentralView ArticleGoogle Scholar
- Schadt EE, Edwards SW, GuhaThakurta D, Holder D, Ying LVS, Svetnik V, Hart KW, Russell A, Li G, Cavet C, et al: A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biol. 2004, 5: R73-10.1186/gb-2004-5-10-r73.PubMedPubMed CentralView ArticleGoogle Scholar
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU: A gene expression map of Arabidopsis thaliana development. Nat Genet. 2005, 37: 501-506. 10.1038/ng1543.PubMedView ArticleGoogle Scholar
- Stolc V, Gauhar Z, Mason C, Halasz G, van Batenburg MF, Rifkin SA, Hua S, Herreman T, Tongprasit W, Barbano PE, et al: A gene expression map for the euchromatic genome of Drosophila melanogaster. Science. 2004, 306: 655-660. 10.1126/science.1101312.PubMedView ArticleGoogle Scholar
- Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu HC, Kim C, Nguyen M, et al: Empirical analysis of transcriptional activity in the Arabidopsis genome. Science. 2003, 302: 842-846. 10.1126/science.1088305.PubMedView ArticleGoogle Scholar
- Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD, Loerch PM, Leonardson A, Lum PY, Cavet G, et al: Experimental annotation of the human genome using microarray technology. Nature. 2001, 409: 922-927. 10.1038/35057141.PubMedView ArticleGoogle Scholar
- Royce TE, Rozowsky JS, Bertone P, Samanta M, Stolc V, Weissman S, Snyder M, Gerstein M: Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. Trends Genet. 2005, 21: 466-475. 10.1016/j.tig.2005.06.007.PubMedPubMed CentralView ArticleGoogle Scholar
- Frey BJ, Morris QD, Zhang W, Mohammad N, Hughes TR: Genrate: a generative model that finds and scores new genes and exons in genomic microarray data. Pac Symp Biocomput. 2005, 495-506.Google Scholar
- Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis. Bioinformatics. 2002, 18: 1585-1592. 10.1093/bioinformatics/18.12.1585.PubMedView ArticleGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci USA. 2001, 98: 8961-8965. 10.1073/pnas.161273698.PubMedPubMed CentralView ArticleGoogle Scholar
- Hekstra D, Taussig AR, Magnasco M, Naef F: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. 2003, 31: 1962-1968. 10.1093/nar/gkg283.PubMedPubMed CentralView ArticleGoogle Scholar
- Held GA, Grinstein G, Tu Y: Modeling of DNA microarray data by using physical properties of hybridization. Proc Natl Acad Sci USA. 2003, 100: 7575-7580. 10.1073/pnas.0832500100.PubMedPubMed CentralView ArticleGoogle Scholar
- Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM, Lu G, Fang J, Liu WM, Ryder T, et al: Probe selection for high-density oligonucleotide arrays. Proc Natl Acad Sci USA. 2003, 100: 11237-11242. 10.1073/pnas.1534744100.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays [see comment]. Nature Biotechnol. 2003, 21: 818-821. 10.1038/nbt836.View ArticleGoogle Scholar
- Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F: A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Department of Biostatistics Working Papers. 2004, Baltimore, MD: John Hopkins UniversityGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.PubMedView ArticleGoogle Scholar
- Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 011906-PubMedView ArticleGoogle Scholar
- Fisher RA: Statistical Methods for Research Workers. 1950, Edinburgh: Oliver & Boyd, 11Google Scholar
- Bailey TL, Gribskov M: Estimating and evaluating the statistics of gapped local-alignment scores. J Comput Biol. 2002, 9: 575-593. 10.1089/106652702760138637.PubMedView ArticleGoogle Scholar
- Huang JC, Morris QD, Hughes TR, Frey BJ: GenXHC: a probabilistic generative model for cross-hybridization compensation in high-density genome-wide microarray data. Bioinformatics. 2005, 21 (Suppl 1): i222-i231. 10.1093/bioinformatics/bti1045.PubMedView ArticleGoogle Scholar
- TranscriptionDetector Information and Software. [http://bussemakerlab.org/software/TranscriptionDetector/]
- Nuwaysir EF, Huang W, Albert TJ, Singh J, Nuwaysir K, Pitas A, Richmond T, Gorski T, Berg JP, Ballin J, et al: Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res. 2002, 12: 1749-1755. 10.1101/gr.362402.PubMedPubMed CentralView ArticleGoogle Scholar
- Benjamini YH, Yosef : Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc. 1995, 57: 289-300.Google Scholar
- Affymetrix Human Transcriptome Project. [http://transcriptome.affymetrix.com/publication/transcriptome_10chromosomes/]
- UCSC Genome Annotation Database. [http://hgdownload.cse.ucsc.edu/goldenpath/10april2003/database/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.