- Deposited research article
Observation of intermittency in gene expression on cDNA microarrays
Genome Biology volume 3, Article number: preprint0005.1 (2002)
We used scaled factorial moments to search for intermittency in the log expression ratios (LERs) for thousands of genes spotted on cDNA microarrays (gene chips). Results indicate varying levels of intermittency in gene expression. The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques.
Scaled factorial moments have found widespread use in high-energy physics for detecting intermittency in particle production [1,2,3,4,5,6,7,8,9,10]. The presence of jet-like structures and perhaps quark-gluon plasma phase in particle production can result in clustering of data in bins leading to holes and spikes in the rapidity distribution. This investigation is based on the type of intermittency defined as nonstatistical fluctuations invariant over the scale of resolution of particle rapidity [11,12,13,14,15,16,17,18]. We do not consider, for example, the type of intermittency found in turbulence which produces non-Gaussian tails in temperature distributions . In high energy physics, Bialas and Peschanksi  reported that the true bin probabilities can only be observed with infinite statistics. In the case of finite particles, the observed distribution of particles Q(p1, ..., p M ) smears out the Bernoulli component, as shown in (5.2) of . To overcome this, scaled factorial moments of the observed data are used to measure the scaled moments of the true distribution. If only statistical fluctuations are present in the rapidity d istribution of particles, then there will be no intermittency. The added value of scaled factorial moments is that they also remove the Poissonian noise to reveal dynamical fluctuations which may be present. This paper describes the use of scaled factorial moments to search for intermittency in gene expression on complimentary DNA (cDNA) microarrays (gene chips) which contain simultaneous expression levels for thousands of genes. Intermittency in this context implies that we are in search of abundances of gene expression values within the Gaussian-like distribution of expression. If there is no intermittency, then we would expect smooth Gaussian-like distributions of expression with only statistical variation.
During nucleic acid transcription, each "coding" gene synthesizes a messenger ribonucleic acid (mRNA) using a deoxyribonucleic acid (DNA) template. mRNA transits from the cell nucleus to the ribosome which resides in the cellular cytoplasm. In the ribosome, mRNAs are translated into proteins consisting of amino acids. For each gene, there are usually multiple copies of mRNA found in the cytoplasm. In the laboratory, mRNA is extracted from treated (diseased) and control (normal) cells. Reverse transcription (RT) is then used to generate a complimentary copy of DNA (cDNA) from each RNA. During RT, cDNAs from experimental cells are labelled with Cy5 dye which fluoresces in the red wavelength, and cDNAs from normal cells are labelled with Cy3 dye which fluouresces in the green wavelength. The labelled cDNAs are then aliquoted onto a cDNA microarray which has been spotted with DNA targets that are complimentary to the cDNA. During hybridization, the red and green-labelled cDNAs competitively bind with the spotted DNAs. Following drying, excimer laser scanning and computer image processing of a microarray, the pixel-averaged intensity of red and green signals for each spot (DNA) reveals the level of expression of a particular gene in the treated and normal cells. The logarithm of the ratio of intensities is known as the log expression ratio (LER), which compares gene expression in treatment tissue with normal tissue. (Spot intensities are normalized for total treated and normal mRNA used for the hybridization). Positive values of LER indicate greater gene expression (upregulation) in treated (or diseased) cells, whereas negative values of LER indicate lower expression (downregulation) in treated cells when compared with normal cells. In summary, cDNA microarrays use competitive hybridization to compare concentration levels of thousands of genes simultaneously expressed in treated and normal cells.
Let N represent the total number of genes spotted on a cDNA microarray. Let the range of LERs (y) on a microarray be Δy = y max - y min . Consider M non-overlapping equally-spaced bins with width Δy = y/M. The qth (q = 2, ..., 5) scaled factorial moment  is de-fined as
where M is the total number of bins, n m is the number of genes whose LER value falls within bin m, and N = Σ m n m is the total number of genes on the array.
We considered two sets of data for our analysis. The first was based on expression of 2,466 genes in the yeast S. Cerevisiae at different times following various experimental treatments (79 arrays) available at web site http://genome-www4.stanford.edu/MicroArray/SMD/publications . These data reflect gene expression of S. Cerevisiae during experimental treatment with alpha factor arrest ("alpha"), centrifugal elutriation ("elu"), temperature sensitive mutation ("cdc15"), sporulation ("spo"), high temperature ("heat"), reducing agent dithiothrietol ("dtt"), low temperature ("cold"), and diauaxic shift ("diau"). The second data set consisted of expression for 9,706 genes in 60 cancer cell lines available at web site http://discover.nci.nih.gov/nature2000 . Cancers represented are melanoma ("ME"), lung ("LC"), central nervous system ("CNS"), colorectal ("CO"), leukemia ("LE"), ovarian ("OV"), renal ("RE"), prostate ("PR"), and breast ("BR").
Calculations began by first determining "base" bin counts for the maximum number of bins possible for eacharray, M max = Δy/0.01, where 0.01 was the precision of the data. We observed that F2 increases rapidly when the bin size is smaller than 0.01, mostly likely due to round-off error in the creation of the LER values from the raw data. Round-off error can create artificial holes and spikes in the data. We, therefore, only consider bin sizes larger than 0.01. F q was calculated for observed and simulated LERs at total bin numbers M = M max /L (L = 2, 3, ..., M max /2). A lower bound of 30 was used for M in all calculations. Bin counts n m for observed LERs were tabulated using M equally spaced bins of width y = 0.01 × L. Bin counts for simulated data were based on kernel density estimation  in the form
where f(m) is the simulated bin count for the mth bin, N is the total number of LERs, h = 1.06N-0.2 is the bandwidth, and is the standard deviation of LERs on the array. K is the Epanechnikov kernel function  defined as
where u = (LER i - y m )/h, LER i is the value of each LER, and y m is the lower bound of the mth bin. The simulation is essentially a smoothed function of the data with attendant statistical fluctuation. A second round of F q calculations were made after subtracting 0.1 from y min and y max , redetermining bin cutoffs, and recalculating n m and f(m) in order to shift the scale of the phase space. This allowed us to look more closely at statistical fluctuations and also provided twice as many values of F q for observed and simulated LERs. Plots of ln F q vs. ln M were constructed for each array and each value of q. The difference between slopes for observed and simulated data was based on fitting the linear model
ln F q = 0 + 1 (ln M), (4)
separately for observed and simulated data, where
0 is the y-intercept and 1 is the slope. Slope difference was calculated as
1 = 1, observed - 1, simulated. (5)
Positive values of Δ
1 implies intermittency among the observed LERs.
Figure 1 shows the frequency histogram for N = 2, 402 LERs binned in M = 106 bins of width y = 0.02 for the spo0 microarray in the sporulation experiments on S. Cerevisiae . Several large spikes and holes are visible in the LER distribution, suggesting clustering of LERs at the scale of the bin size y = 0.02. The greatest contribution to F q is from the spikes near the central peak of the distribution. Holes near the tails contribute little to F q . Figure 2 illustrates plots of ln F q vs. ln M for the same spo0 array whose binned LERs are shown in Figure 1. Replicate values of ln F q are also plotted at various values of ln M as a result of calculating F q after a 0.1 shift in the scale of y. The spread in the data shows the level of bias in the choice of the binning. The slope of simulated data is usually very small, indicative of lack of intermittency in the simulated data. Slightly negative slopes for simulated data were also observed for smooth gaussian-like distributions, where F q ~ Constant which changes with M.
Figure 3 shows values of Δ1 for the 79 arrays in the S. Cerevisiae data set . Overall, the Δ1 values are positive, with an average value of 0.03-0.04, indicative of intermittency in the data. The greatest signals were observed in the experiments for sporulation (spo0 array) and diauxic shift (diaua array). For the alpha factor arrest (alpha) experiment, a slightly elevated signal was seen early on, dropped, and then picked up again toward the latter time points. Interestingly, the signal fluctuated over time periods in the centrifugal elutriation experiment to the extent that periodicity can be noticed. For the experiment involving the temperature-sensitive mutant cdc15, the signal was similar to that observed in the centrifugal elutriation experiment. During sporulation, the signal was largest at the beginning, dropped thereafter, and continued to be jumpy until the cold and diauxic shift experiment. There is no discernible trend within a given treatment, and there is no abrupt transition from one array to the next. The different treatments are not ordered in any particular way in Figure 3. The higher factorial moments follow the same trend as that of F2, only with higher values. This is not surprising because Δ1 for the qth moment is about q/2 times that of F2 if the only source of intermittency is bin-by-bin fluctuations. The plotted error bars for each data point in Figure 3 are based on the error in the fitting of the slope, as well as the error of the simulated data. As one can see, the significance of the signal is usually more than 3 standard deviations.
Figure 4 shows the Δ1 for the 60 cancer cell lines. Among them, the greatest intermittency signal was detected for the colon cancer cell line CO-HT29 and central nervous system cancer cell line CNS-U251, since they both had the greatest values of Δ1 for all values of q. Wide variation in Δ1 can be observed across all of the data, with no consistent signal occurring for each type of cancer. The most jumpy transition in Δ1 was seen for central nervous system cancer cell lines. It is noteworthy that the average level of intermittency, as measured by Δ1 for F2, is about 0.01, about 3-4 times lower than that of the yeast data. The main difference is that there are about 10,000 genes in each cancer array, about 4 times more genes than the yeast array. The significance of the signal (signal/error), however, is comparable to that of the yeast data.
The observed intermittency in the data considered may suggest correlations in the abundances of expression levels within the Gaussian-like distributions of LERs. In the S. Cerevisiae sporulation experiments, a majority of genes whose LERs fell within the spikes in Figure 1 had dramatically altered expression values later on in the sporulation experiments. Chu et al.  reported temporal changes in expression among a large number of genes throughout the sporulation process. In the cancer cell lines whose LER distributions were investigated, changes in intermittency over the arrays are likely due to cancer-specific alterations in cell-cycle control, DNA repair, oncogenesis, tumor supression, apoptosis, and angiogenenesis, all of which affect tumor growth, severity and evasion from attack by the immune system . Cancers vary in their cause and severity and there may be a wide range of unknown gene-gene and gene-environment interactions which impact gene expression.
Errors in reproducibility among the LERs considered were not provided by the groups that generated the data. However, several recent reports [26,27,28,29,30] give errors from various sources (probe preparation, spot size variability, scanning errors, software sophistication, etc.). Wildsmith et al.  reported a 28% standard error of the common logarithm of expression based on 64 replicate arrays containing 1248 duplicate spots. Lee et al.  reported a maximum misclassification of 9% based on three replicate arrays containing 288 genes. In this study, the spikes, for example, in Figure 1 are not likely due to misclassification. Error, on the other hand, affects the bin size and we have seen the effect throughout the bulk of bin-size range.
This study was a first step to search for intermittency without establishing biological relevance. There is a growing literature on the identification of microarray-based regulatory gene networks [31,32,33,34]. We have already begun looking at individual genes and their contribution to F2 on a single array. Our current effort to develop microarray-based promoter models for co-expressed genes based on the Werner approach  will facilitate our understanding of regulatory control of genes with high contribution to F2. Progress in this effort is limited by the rate at which we can manually select genes with high contribution to F2, exon map the genes, fetch their upstream DNA promoter sequences, and then search for common transcription binding sites among the multiple promoter sequences in order to infer coregulation.
The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques. Biologists often focus on strongly downregulated or upregulated genes which are characterized by large negative and positive LERs. Our method of looking at intermittency in gene expression focused on the clustering of LERs independent of their absolute expression value. Thus, we were able to detect large density fluctuations among small LERs. As an example, spikes near the center of the binned distribution in Figure 1 whose LER-values were low greatly increased the factorial moments. Therefore, fold-change analysis, which focuses on large negative and positive LERs, or other multivariate statistical methods such as hierarchical cluster and principal component analyses, can miss unique density fluctuations at low LER values which are detected by factorial moments.
OPAL Collaboration. Eur. Phys. J. C. 1999, 11: 239-10.1007/s100520050629.
EHS/NA22 Collaboration. Phys. Lett. B. 1996, 382: 305-10.1016/0370-2693(96)00749-6.
EMU-01 Collaboration. Z. Phys. C. 1997, 76: 659-10.1007/s002880050588.
SLDCollaboration, SLAC-PUB-95-7027. 1996
WA80 Collaboration. Nuc. Phys. A. 1992, 545: 311c-10.1016/0375-9474(92)90471-U.
Kun F, Sorge H, Sailer K, Bardos G, Greinber W: Phys. Lett. B. 1995, 355: 349-10.1016/0370-2693(95)00697-J.
Jie Z, Shaoshun W: Phys. Lett. B. 1996, 370: 159-10.1016/0370-2693(96)00068-8.
Gary JW: Nuc. Phys. B. 1999, 71S: 158-
Bozek P, Ploszajczak M: Nuc. Phys. A. 1992, 545: 297c-10.1016/0375-9474(92)90470-5.
Bialas A, Peschanksi R: Nuc. Phys. B. 1986, 273: 703-10.1016/0550-3213(86)90386-X.
Bialas A, Peschanksi R: Nuc. Phys. B. 1988, 308: 857-10.1016/0550-3213(88)90131-9.
Bialas A: Nuc. Phys. A. 1991, 525: 345c-10.1016/0375-9474(91)90344-6.
Blazek M: Int. J. Mod. Phys. A. 1997, 12: 839-10.1142/S0217751X97000645.
Bozek P, Ploszajczak M, Botet R: Phys. Rep. 1995, 252: 101-10.1016/0370-1573(94)00075-E.
Fu J, Wu Y, Liu L: Phys. Lett. B. 2000, 472: 161-10.1016/S0370-2693(99)01380-5.
Shaoshun W, Ran L, Zhaomin W: Phys. Lett. B. 1998, 438: 353-10.1016/S0370-2693(98)01129-0.
Sarcevic I: Nuc. Phys. A. 1991, 525: 361c-10.1016/0375-9474(91)90345-7.
DeWolf EA, Dremin IM, Kittle W: Phys. Rep. 1996, 270: 1-10.1016/0370-1573(95)00069-0.
Kadanoff LP: Physics Today, August. 2001, 34:
Spellman PT, Sherlock G, Zhang MQ, et al: Mol. Biol. Cell. 1998, 12: 3273-
Ross DT, Scherf U, Eisen MB, et al: Nat. Genet. 2000, 24: 208-10.1038/73400.
Fadda D, Slezak E, Bijaoui A: Astron. Astrophys. Suppl. Ser. 1998, 127: 335-10.1051/aas:1998355.
Epanechnikov VA: Theor. Prob. Appl. 1969, 14: 163-
Chu S, DeRisi JL, Eisen MB, et al: Science. 1998, 282: 699-10.1006/jmbi.1998.2134.
Baylin SB: Science. 1997, 277: 1948-10.1126/science.277.5334.1948.
Wildsmith SE, Archer GE, Winkley AJ, et al: Biotechniques. 2001, 30: 202-
Brazma A, Hingcamp P, Quackenbush J, et al: Nature Genet. 2001, 29: 365-10.1038/ng1201-365.
Lee M-L.T, Kuo FC, Whitmore GA, et al: P.N.A.S. 2000, 97: 9834-10.1073/pnas.97.18.9834.
Brown CS, Goodwin PC, Sorger PK: P.N.A.S. 2001, 98: 8944-10.1073/pnas.161242998.
Wang X, Ghosh S, Guo S-W: Nucl. Acids. Res. 2001, 29: e75-10.1093/nar/29.15.e75.
Wuensche A: Pac. Symp. Biocomp. 1998, 3: 89-
Lockhart DJ, Winzeler EA: Nature. 2000, 405: 827-10.1038/35015701.
Werner T: Biomol. Eng. 2001, 17: 87-10.1016/S1389-0344(00)00071-X.
Pilpel Y, Sudarsanam P, Church GM: Nat. Genet. 2001, 29: 153-10.1038/ng724.
L.E.P. acknowledges the support of grant CA-78199-04 of the National Cancer Institute, and C. Aime, F. Kun, A. Loctionov, M. Patra, R. Peschanki, and E. Sarkisyan-Grinbaum for helpful discussions on intermittency. K.L. is supported in part by the U.S. Department of Energy, Grant no. DE-FG03-96ER41004, and in part by the Texas Advanced Research Program, Grant no. 3652-0023-1999.
About this article
Cite this article
Peterson, L.E., Lau, K. Observation of intermittency in gene expression on cDNA microarrays. Genome Biol 3, preprint0005.1 (2002). https://doi.org/10.1186/gb-2002-3-7-preprint0005