Observation of intermittency in gene expression on cDNA microarrays
© BioMed Central Ltd 2002
Received: 27 May 2002
Published: 29 May 2002
We used scaled factorial moments to search for intermittency in the log expression ratios (LERs) for thousands of genes spotted on cDNA microarrays (gene chips). Results indicate varying levels of intermittency in gene expression. The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques.
Scaled factorial moments have found widespread use in high-energy physics for detecting intermittency in particle production [1,2,3,4,5,6,7,8,9,10]. The presence of jet-like structures and perhaps quark-gluon plasma phase in particle production can result in clustering of data in bins leading to holes and spikes in the rapidity distribution. This investigation is based on the type of intermittency defined as nonstatistical fluctuations invariant over the scale of resolution of particle rapidity [11,12,13,14,15,16,17,18]. We do not consider, for example, the type of intermittency found in turbulence which produces non-Gaussian tails in temperature distributions . In high energy physics, Bialas and Peschanksi  reported that the true bin probabilities can only be observed with infinite statistics. In the case of finite particles, the observed distribution of particles Q(p1, ..., p M ) smears out the Bernoulli component, as shown in (5.2) of . To overcome this, scaled factorial moments of the observed data are used to measure the scaled moments of the true distribution. If only statistical fluctuations are present in the rapidity d istribution of particles, then there will be no intermittency. The added value of scaled factorial moments is that they also remove the Poissonian noise to reveal dynamical fluctuations which may be present. This paper describes the use of scaled factorial moments to search for intermittency in gene expression on complimentary DNA (cDNA) microarrays (gene chips) which contain simultaneous expression levels for thousands of genes. Intermittency in this context implies that we are in search of abundances of gene expression values within the Gaussian-like distribution of expression. If there is no intermittency, then we would expect smooth Gaussian-like distributions of expression with only statistical variation.
During nucleic acid transcription, each "coding" gene synthesizes a messenger ribonucleic acid (mRNA) using a deoxyribonucleic acid (DNA) template. mRNA transits from the cell nucleus to the ribosome which resides in the cellular cytoplasm. In the ribosome, mRNAs are translated into proteins consisting of amino acids. For each gene, there are usually multiple copies of mRNA found in the cytoplasm. In the laboratory, mRNA is extracted from treated (diseased) and control (normal) cells. Reverse transcription (RT) is then used to generate a complimentary copy of DNA (cDNA) from each RNA. During RT, cDNAs from experimental cells are labelled with Cy5 dye which fluoresces in the red wavelength, and cDNAs from normal cells are labelled with Cy3 dye which fluouresces in the green wavelength. The labelled cDNAs are then aliquoted onto a cDNA microarray which has been spotted with DNA targets that are complimentary to the cDNA. During hybridization, the red and green-labelled cDNAs competitively bind with the spotted DNAs. Following drying, excimer laser scanning and computer image processing of a microarray, the pixel-averaged intensity of red and green signals for each spot (DNA) reveals the level of expression of a particular gene in the treated and normal cells. The logarithm of the ratio of intensities is known as the log expression ratio (LER), which compares gene expression in treatment tissue with normal tissue. (Spot intensities are normalized for total treated and normal mRNA used for the hybridization). Positive values of LER indicate greater gene expression (upregulation) in treated (or diseased) cells, whereas negative values of LER indicate lower expression (downregulation) in treated cells when compared with normal cells. In summary, cDNA microarrays use competitive hybridization to compare concentration levels of thousands of genes simultaneously expressed in treated and normal cells.
Let N represent the total number of genes spotted on a cDNA microarray. Let the range of LERs (y) on a microarray be Δy = y max - y min . Consider M non-overlapping equally-spaced bins with width Δy = y/M. The qth (q = 2, ..., 5) scaled factorial moment  is de-fined as
where M is the total number of bins, n m is the number of genes whose LER value falls within bin m, and N = Σ m n m is the total number of genes on the array.
We considered two sets of data for our analysis. The first was based on expression of 2,466 genes in the yeast S. Cerevisiae at different times following various experimental treatments (79 arrays) available at web site http://genome-www4.stanford.edu/MicroArray/SMD/publications . These data reflect gene expression of S. Cerevisiae during experimental treatment with alpha factor arrest ("alpha"), centrifugal elutriation ("elu"), temperature sensitive mutation ("cdc15"), sporulation ("spo"), high temperature ("heat"), reducing agent dithiothrietol ("dtt"), low temperature ("cold"), and diauaxic shift ("diau"). The second data set consisted of expression for 9,706 genes in 60 cancer cell lines available at web site http://discover.nci.nih.gov/nature2000 . Cancers represented are melanoma ("ME"), lung ("LC"), central nervous system ("CNS"), colorectal ("CO"), leukemia ("LE"), ovarian ("OV"), renal ("RE"), prostate ("PR"), and breast ("BR").
Calculations began by first determining "base" bin counts for the maximum number of bins possible for eacharray, M max = Δy/0.01, where 0.01 was the precision of the data. We observed that F2 increases rapidly when the bin size is smaller than 0.01, mostly likely due to round-off error in the creation of the LER values from the raw data. Round-off error can create artificial holes and spikes in the data. We, therefore, only consider bin sizes larger than 0.01. F q was calculated for observed and simulated LERs at total bin numbers M = M max /L (L = 2, 3, ..., M max /2). A lower bound of 30 was used for M in all calculations. Bin counts n m for observed LERs were tabulated using M equally spaced bins of width y = 0.01 × L. Bin counts for simulated data were based on kernel density estimation  in the form
where f(m) is the simulated bin count for the mth bin, N is the total number of LERs, h = 1.06 N-0.2 is the bandwidth, and is the standard deviation of LERs on the array. K is the Epanechnikov kernel function  defined as
where u = (LER i - y m )/h, LER i is the value of each LER, and y m is the lower bound of the mth bin. The simulation is essentially a smoothed function of the data with attendant statistical fluctuation. A second round of F q calculations were made after subtracting 0.1 from y min and y max , redetermining bin cutoffs, and recalculating n m and f(m) in order to shift the scale of the phase space. This allowed us to look more closely at statistical fluctuations and also provided twice as many values of F q for observed and simulated LERs. Plots of ln F q vs. ln M were constructed for each array and each value of q. The difference between slopes for observed and simulated data was based on fitting the linear model
The observed intermittency in the data considered may suggest correlations in the abundances of expression levels within the Gaussian-like distributions of LERs. In the S. Cerevisiae sporulation experiments, a majority of genes whose LERs fell within the spikes in Figure 1 had dramatically altered expression values later on in the sporulation experiments. Chu et al.  reported temporal changes in expression among a large number of genes throughout the sporulation process. In the cancer cell lines whose LER distributions were investigated, changes in intermittency over the arrays are likely due to cancer-specific alterations in cell-cycle control, DNA repair, oncogenesis, tumor supression, apoptosis, and angiogenenesis, all of which affect tumor growth, severity and evasion from attack by the immune system . Cancers vary in their cause and severity and there may be a wide range of unknown gene-gene and gene-environment interactions which impact gene expression.
Errors in reproducibility among the LERs considered were not provided by the groups that generated the data. However, several recent reports [26,27,28,29,30] give errors from various sources (probe preparation, spot size variability, scanning errors, software sophistication, etc.). Wildsmith et al.  reported a 28% standard error of the common logarithm of expression based on 64 replicate arrays containing 1248 duplicate spots. Lee et al.  reported a maximum misclassification of 9% based on three replicate arrays containing 288 genes. In this study, the spikes, for example, in Figure 1 are not likely due to misclassification. Error, on the other hand, affects the bin size and we have seen the effect throughout the bulk of bin-size range.
This study was a first step to search for intermittency without establishing biological relevance. There is a growing literature on the identification of microarray-based regulatory gene networks [31,32,33,34]. We have already begun looking at individual genes and their contribution to F2 on a single array. Our current effort to develop microarray-based promoter models for co-expressed genes based on the Werner approach  will facilitate our understanding of regulatory control of genes with high contribution to F2. Progress in this effort is limited by the rate at which we can manually select genes with high contribution to F2, exon map the genes, fetch their upstream DNA promoter sequences, and then search for common transcription binding sites among the multiple promoter sequences in order to infer coregulation.
The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques. Biologists often focus on strongly downregulated or upregulated genes which are characterized by large negative and positive LERs. Our method of looking at intermittency in gene expression focused on the clustering of LERs independent of their absolute expression value. Thus, we were able to detect large density fluctuations among small LERs. As an example, spikes near the center of the binned distribution in Figure 1 whose LER-values were low greatly increased the factorial moments. Therefore, fold-change analysis, which focuses on large negative and positive LERs, or other multivariate statistical methods such as hierarchical cluster and principal component analyses, can miss unique density fluctuations at low LER values which are detected by factorial moments.
L.E.P. acknowledges the support of grant CA-78199-04 of the National Cancer Institute, and C. Aime, F. Kun, A. Loctionov, M. Patra, R. Peschanki, and E. Sarkisyan-Grinbaum for helpful discussions on intermittency. K.L. is supported in part by the U.S. Department of Energy, Grant no. DE-FG03-96ER41004, and in part by the Texas Advanced Research Program, Grant no. 3652-0023-1999.
- OPAL Collaboration. Eur. Phys. J. C. 1999, 11: 239-10.1007/s100520050629.
- EHS/NA22 Collaboration. Phys. Lett. B. 1996, 382: 305-10.1016/0370-2693(96)00749-6.
- EMU-01 Collaboration. Z. Phys. C. 1997, 76: 659-10.1007/s002880050588.
- SLDCollaboration, SLAC-PUB-95-7027. 1996
- WA80 Collaboration. Nuc. Phys. A. 1992, 545: 311c-10.1016/0375-9474(92)90471-U.
- Kun F, Sorge H, Sailer K, Bardos G, Greinber W: Phys. Lett. B. 1995, 355: 349-10.1016/0370-2693(95)00697-J.View ArticleGoogle Scholar
- Jie Z, Shaoshun W: Phys. Lett. B. 1996, 370: 159-10.1016/0370-2693(96)00068-8.View ArticleGoogle Scholar
- Gary JW: Nuc. Phys. B. 1999, 71S: 158-Google Scholar
- Bozek P, Ploszajczak M: Nuc. Phys. A. 1992, 545: 297c-10.1016/0375-9474(92)90470-5.View ArticleGoogle Scholar
- Bialas A, Peschanksi R: Nuc. Phys. B. 1986, 273: 703-10.1016/0550-3213(86)90386-X.View ArticleGoogle Scholar
- Bialas A, Peschanksi R: Nuc. Phys. B. 1988, 308: 857-10.1016/0550-3213(88)90131-9.View ArticleGoogle Scholar
- Bialas A: Nuc. Phys. A. 1991, 525: 345c-10.1016/0375-9474(91)90344-6.View ArticleGoogle Scholar
- Blazek M: Int. J. Mod. Phys. A. 1997, 12: 839-10.1142/S0217751X97000645.View ArticleGoogle Scholar
- Bozek P, Ploszajczak M, Botet R: Phys. Rep. 1995, 252: 101-10.1016/0370-1573(94)00075-E.View ArticleGoogle Scholar
- Fu J, Wu Y, Liu L: Phys. Lett. B. 2000, 472: 161-10.1016/S0370-2693(99)01380-5.View ArticleGoogle Scholar
- Shaoshun W, Ran L, Zhaomin W: Phys. Lett. B. 1998, 438: 353-10.1016/S0370-2693(98)01129-0.View ArticleGoogle Scholar
- Sarcevic I: Nuc. Phys. A. 1991, 525: 361c-10.1016/0375-9474(91)90345-7.View ArticleGoogle Scholar
- DeWolf EA, Dremin IM, Kittle W: Phys. Rep. 1996, 270: 1-10.1016/0370-1573(95)00069-0.View ArticleGoogle Scholar
- Kadanoff LP: Physics Today, August. 2001, 34:Google Scholar
- Spellman PT, Sherlock G, Zhang MQ, et al: Mol. Biol. Cell. 1998, 12: 3273-View ArticleGoogle Scholar
- Ross DT, Scherf U, Eisen MB, et al: Nat. Genet. 2000, 24: 208-10.1038/73400.View ArticleGoogle Scholar
- Fadda D, Slezak E, Bijaoui A: Astron. Astrophys. Suppl. Ser. 1998, 127: 335-10.1051/aas:1998355.View ArticleGoogle Scholar
- Epanechnikov VA: Theor. Prob. Appl. 1969, 14: 163-View ArticleGoogle Scholar
- Chu S, DeRisi JL, Eisen MB, et al: Science. 1998, 282: 699-10.1006/jmbi.1998.2134.PubMedView ArticleGoogle Scholar
- Baylin SB: Science. 1997, 277: 1948-10.1126/science.277.5334.1948.PubMedView ArticleGoogle Scholar
- Wildsmith SE, Archer GE, Winkley AJ, et al: Biotechniques. 2001, 30: 202-PubMedGoogle Scholar
- Brazma A, Hingcamp P, Quackenbush J, et al: Nature Genet. 2001, 29: 365-10.1038/ng1201-365.PubMedView ArticleGoogle Scholar
- Lee M-L.T, Kuo FC, Whitmore GA, et al: P.N.A.S. 2000, 97: 9834-10.1073/pnas.97.18.9834.PubMedPubMed CentralView ArticleGoogle Scholar
- Brown CS, Goodwin PC, Sorger PK: P.N.A.S. 2001, 98: 8944-10.1073/pnas.161242998.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang X, Ghosh S, Guo S-W: Nucl. Acids. Res. 2001, 29: e75-10.1093/nar/29.15.e75.PubMedPubMed CentralView ArticleGoogle Scholar
- Wuensche A: Pac. Symp. Biocomp. 1998, 3: 89-Google Scholar
- Lockhart DJ, Winzeler EA: Nature. 2000, 405: 827-10.1038/35015701.PubMedView ArticleGoogle Scholar
- Werner T: Biomol. Eng. 2001, 17: 87-10.1016/S1389-0344(00)00071-X.PubMedView ArticleGoogle Scholar
- Pilpel Y, Sudarsanam P, Church GM: Nat. Genet. 2001, 29: 153-10.1038/ng724.PubMedView ArticleGoogle Scholar