Observation of intermittency in gene expression on cDNA microarrays

Peterson, Leif E; Lau, Kwong

doi:10.1186/gb-2002-3-7-preprint0005

Deposited research article
Published: 29 May 2002

Observation of intermittency in gene expression on cDNA microarrays

Leif E Peterson¹ &
Kwong Lau²

Genome Biology volume 3, Article number: preprint0005.1 (2002) Cite this article

6196 Accesses
5 Citations
Metrics details

Abstract

We used scaled factorial moments to search for intermittency in the log expression ratios (LERs) for thousands of genes spotted on cDNA microarrays (gene chips). Results indicate varying levels of intermittency in gene expression. The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques.

PACS: 87.10.+e

Scaled factorial moments have found widespread use in high-energy physics for detecting intermittency in particle production [1,2,3,4,5,6,7,8,9,10]. The presence of jet-like structures and perhaps quark-gluon plasma phase in particle production can result in clustering of data in bins leading to holes and spikes in the rapidity distribution. This investigation is based on the type of intermittency defined as nonstatistical fluctuations invariant over the scale of resolution of particle rapidity [11,12,13,14,15,16,17,18]. We do not consider, for example, the type of intermittency found in turbulence which produces non-Gaussian tails in temperature distributions [19]. In high energy physics, Bialas and Peschanksi [10] reported that the true bin probabilities can only be observed with infinite statistics. In the case of finite particles, the observed distribution of particles Q(p₁, ..., p_M) smears out the Bernoulli component, as shown in (5.2) of [10]. To overcome this, scaled factorial moments of the observed data are used to measure the scaled moments of the true distribution. If only statistical fluctuations are present in the rapidity d istribution of particles, then there will be no intermittency. The added value of scaled factorial moments is that they also remove the Poissonian noise to reveal dynamical fluctuations which may be present. This paper describes the use of scaled factorial moments to search for intermittency in gene expression on complimentary DNA (cDNA) microarrays (gene chips) which contain simultaneous expression levels for thousands of genes. Intermittency in this context implies that we are in search of abundances of gene expression values within the Gaussian-like distribution of expression. If there is no intermittency, then we would expect smooth Gaussian-like distributions of expression with only statistical variation.

During nucleic acid transcription, each "coding" gene synthesizes a messenger ribonucleic acid (mRNA) using a deoxyribonucleic acid (DNA) template. mRNA transits from the cell nucleus to the ribosome which resides in the cellular cytoplasm. In the ribosome, mRNAs are translated into proteins consisting of amino acids. For each gene, there are usually multiple copies of mRNA found in the cytoplasm. In the laboratory, mRNA is extracted from treated (diseased) and control (normal) cells. Reverse transcription (RT) is then used to generate a complimentary copy of DNA (cDNA) from each RNA. During RT, cDNAs from experimental cells are labelled with Cy5 dye which fluoresces in the red wavelength, and cDNAs from normal cells are labelled with Cy3 dye which fluouresces in the green wavelength. The labelled cDNAs are then aliquoted onto a cDNA microarray which has been spotted with DNA targets that are complimentary to the cDNA. During hybridization, the red and green-labelled cDNAs competitively bind with the spotted DNAs. Following drying, excimer laser scanning and computer image processing of a microarray, the pixel-averaged intensity of red and green signals for each spot (DNA) reveals the level of expression of a particular gene in the treated and normal cells. The logarithm of the ratio of intensities is known as the log expression ratio (LER), which compares gene expression in treatment tissue with normal tissue. (Spot intensities are normalized for total treated and normal mRNA used for the hybridization). Positive values of LER indicate greater gene expression (upregulation) in treated (or diseased) cells, whereas negative values of LER indicate lower expression (downregulation) in treated cells when compared with normal cells. In summary, cDNA microarrays use competitive hybridization to compare concentration levels of thousands of genes simultaneously expressed in treated and normal cells.

Let N represent the total number of genes spotted on a cDNA microarray. Let the range of LERs (y) on a microarray be Δy = y_max - y_min. Consider M non-overlapping equally-spaced bins with width Δy = y/M. The qth (q = 2, ..., 5) scaled factorial moment [10] is de-fined as

where M is the total number of bins, n_m is the number of genes whose LER value falls within bin m, and N = Σ_mn_m is the total number of genes on the array.

We considered two sets of data for our analysis. The first was based on expression of 2,466 genes in the yeast S. Cerevisiae at different times following various experimental treatments (79 arrays) available at web site http://genome-www4.stanford.edu/MicroArray/SMD/publications [20]. These data reflect gene expression of S. Cerevisiae during experimental treatment with alpha factor arrest ("alpha"), centrifugal elutriation ("elu"), temperature sensitive mutation ("cdc15"), sporulation ("spo"), high temperature ("heat"), reducing agent dithiothrietol ("dtt"), low temperature ("cold"), and diauaxic shift ("diau"). The second data set consisted of expression for 9,706 genes in 60 cancer cell lines available at web site http://discover.nci.nih.gov/nature2000 [21]. Cancers represented are melanoma ("ME"), lung ("LC"), central nervous system ("CNS"), colorectal ("CO"), leukemia ("LE"), ovarian ("OV"), renal ("RE"), prostate ("PR"), and breast ("BR").

Calculations began by first determining "base" bin counts for the maximum number of bins possible for eacharray, M_max = Δy/0.01, where 0.01 was the precision of the data. We observed that F₂ increases rapidly when the bin size is smaller than 0.01, mostly likely due to round-off error in the creation of the LER values from the raw data. Round-off error can create artificial holes and spikes in the data. We, therefore, only consider bin sizes larger than 0.01. F_q was calculated for observed and simulated LERs at total bin numbers M = M_max/L (L = 2, 3, ..., M_max/2). A lower bound of 30 was used for M in all calculations. Bin counts n_m for observed LERs were tabulated using M equally spaced bins of width y = 0.01 × L. Bin counts for simulated data were based on kernel density estimation [22] in the form

where f(m) is the simulated bin count for the mth bin, N is the total number of LERs, h = 1.06N^-0.2 is the bandwidth, and is the standard deviation of LERs on the array. K is the Epanechnikov kernel function [23] defined as

where u = (LER_i - y_m)/h, LER_i is the value of each LER, and y_m is the lower bound of the mth bin. The simulation is essentially a smoothed function of the data with attendant statistical fluctuation. A second round of F_q calculations were made after subtracting 0.1 from y_min and y_max, redetermining bin cutoffs, and recalculating n_m and f(m) in order to shift the scale of the phase space. This allowed us to look more closely at statistical fluctuations and also provided twice as many values of F_q for observed and simulated LERs. Plots of ln F_q vs. ln M were constructed for each array and each value of q. The difference between slopes for observed and simulated data was based on fitting the linear model

ln F_q = ₀ + ₁ (ln M), (4)

separately for observed and simulated data, where

₀ is the y-intercept and ₁ is the slope. Slope difference was calculated as

Δ

₁ = _{1, observed} - _{1, simulated}. (5)

Positive values of Δ

₁ implies intermittency among the observed LERs.

Figure 1 shows the frequency histogram for N = 2, 402 LERs binned in M = 106 bins of width y = 0.02 for the spo0 microarray in the sporulation experiments on S. Cerevisiae [20]. Several large spikes and holes are visible in the LER distribution, suggesting clustering of LERs at the scale of the bin size y = 0.02. The greatest contribution to F_q is from the spikes near the central peak of the distribution. Holes near the tails contribute little to F_q. Figure 2 illustrates plots of ln F_q vs. ln M for the same spo0 array whose binned LERs are shown in Figure 1. Replicate values of ln F_q are also plotted at various values of ln M as a result of calculating F_q after a 0.1 shift in the scale of y. The spread in the data shows the level of bias in the choice of the binning. The slope of simulated data is usually very small, indicative of lack of intermittency in the simulated data. Slightly negative slopes for simulated data were also observed for smooth gaussian-like distributions, where F_q ~ Constant which changes with M.

Figure 3 shows values of Δ₁ for the 79 arrays in the S. Cerevisiae data set [20]. Overall, the Δ₁ values are positive, with an average value of 0.03-0.04, indicative of intermittency in the data. The greatest signals were observed in the experiments for sporulation (spo0 array) and diauxic shift (diaua array). For the alpha factor arrest (alpha) experiment, a slightly elevated signal was seen early on, dropped, and then picked up again toward the latter time points. Interestingly, the signal fluctuated over time periods in the centrifugal elutriation experiment to the extent that periodicity can be noticed. For the experiment involving the temperature-sensitive mutant cdc15, the signal was similar to that observed in the centrifugal elutriation experiment. During sporulation, the signal was largest at the beginning, dropped thereafter, and continued to be jumpy until the cold and diauxic shift experiment. There is no discernible trend within a given treatment, and there is no abrupt transition from one array to the next. The different treatments are not ordered in any particular way in Figure 3. The higher factorial moments follow the same trend as that of F₂, only with higher values. This is not surprising because Δ₁ for the qth moment is about q/2 times that of F₂ if the only source of intermittency is bin-by-bin fluctuations. The plotted error bars for each data point in Figure 3 are based on the error in the fitting of the slope, as well as the error of the simulated data. As one can see, the significance of the signal is usually more than 3 standard deviations.

Figure 4 shows the Δ₁ for the 60 cancer cell lines. Among them, the greatest intermittency signal was detected for the colon cancer cell line CO-HT29 and central nervous system cancer cell line CNS-U251, since they both had the greatest values of Δ₁ for all values of q. Wide variation in Δ₁ can be observed across all of the data, with no consistent signal occurring for each type of cancer. The most jumpy transition in Δ₁ was seen for central nervous system cancer cell lines. It is noteworthy that the average level of intermittency, as measured by Δ₁ for F₂, is about 0.01, about 3-4 times lower than that of the yeast data. The main difference is that there are about 10,000 genes in each cancer array, about 4 times more genes than the yeast array. The significance of the signal (signal/error), however, is comparable to that of the yeast data.

The observed intermittency in the data considered may suggest correlations in the abundances of expression levels within the Gaussian-like distributions of LERs. In the S. Cerevisiae sporulation experiments, a majority of genes whose LERs fell within the spikes in Figure 1 had dramatically altered expression values later on in the sporulation experiments. Chu et al. [24] reported temporal changes in expression among a large number of genes throughout the sporulation process. In the cancer cell lines whose LER distributions were investigated, changes in intermittency over the arrays are likely due to cancer-specific alterations in cell-cycle control, DNA repair, oncogenesis, tumor supression, apoptosis, and angiogenenesis, all of which affect tumor growth, severity and evasion from attack by the immune system [25]. Cancers vary in their cause and severity and there may be a wide range of unknown gene-gene and gene-environment interactions which impact gene expression.

Errors in reproducibility among the LERs considered were not provided by the groups that generated the data. However, several recent reports [26,27,28,29,30] give errors from various sources (probe preparation, spot size variability, scanning errors, software sophistication, etc.). Wildsmith et al. [26] reported a 28% standard error of the common logarithm of expression based on 64 replicate arrays containing 1248 duplicate spots. Lee et al. [28] reported a maximum misclassification of 9% based on three replicate arrays containing 288 genes. In this study, the spikes, for example, in Figure 1 are not likely due to misclassification. Error, on the other hand, affects the bin size and we have seen the effect throughout the bulk of bin-size range.

This study was a first step to search for intermittency without establishing biological relevance. There is a growing literature on the identification of microarray-based regulatory gene networks [31,32,33,34]. We have already begun looking at individual genes and their contribution to F₂ on a single array. Our current effort to develop microarray-based promoter models for co-expressed genes based on the Werner approach [33] will facilitate our understanding of regulatory control of genes with high contribution to F₂. Progress in this effort is limited by the rate at which we can manually select genes with high contribution to F₂, exon map the genes, fetch their upstream DNA promoter sequences, and then search for common transcription binding sites among the multiple promoter sequences in order to infer coregulation.

The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques. Biologists often focus on strongly downregulated or upregulated genes which are characterized by large negative and positive LERs. Our method of looking at intermittency in gene expression focused on the clustering of LERs independent of their absolute expression value. Thus, we were able to detect large density fluctuations among small LERs. As an example, spikes near the center of the binned distribution in Figure 1 whose LER-values were low greatly increased the factorial moments. Therefore, fold-change analysis, which focuses on large negative and positive LERs, or other multivariate statistical methods such as hierarchical cluster and principal component analyses, can miss unique density fluctuations at low LER values which are detected by factorial moments.

References

OPAL Collaboration. Eur. Phys. J. C. 1999, 11: 239-10.1007/s100520050629.
EHS/NA22 Collaboration. Phys. Lett. B. 1996, 382: 305-10.1016/0370-2693(96)00749-6.
EMU-01 Collaboration. Z. Phys. C. 1997, 76: 659-10.1007/s002880050588.
SLDCollaboration, SLAC-PUB-95-7027. 1996
WA80 Collaboration. Nuc. Phys. A. 1992, 545: 311c-10.1016/0375-9474(92)90471-U.
Kun F, Sorge H, Sailer K, Bardos G, Greinber W: Phys. Lett. B. 1995, 355: 349-10.1016/0370-2693(95)00697-J.
Article CAS Google Scholar
Jie Z, Shaoshun W: Phys. Lett. B. 1996, 370: 159-10.1016/0370-2693(96)00068-8.
Article Google Scholar
Gary JW: Nuc. Phys. B. 1999, 71S: 158-
Google Scholar
Bozek P, Ploszajczak M: Nuc. Phys. A. 1992, 545: 297c-10.1016/0375-9474(92)90470-5.
Article Google Scholar
Bialas A, Peschanksi R: Nuc. Phys. B. 1986, 273: 703-10.1016/0550-3213(86)90386-X.
Article Google Scholar
Bialas A, Peschanksi R: Nuc. Phys. B. 1988, 308: 857-10.1016/0550-3213(88)90131-9.
Article Google Scholar
Bialas A: Nuc. Phys. A. 1991, 525: 345c-10.1016/0375-9474(91)90344-6.
Article Google Scholar
Blazek M: Int. J. Mod. Phys. A. 1997, 12: 839-10.1142/S0217751X97000645.
Article CAS Google Scholar
Bozek P, Ploszajczak M, Botet R: Phys. Rep. 1995, 252: 101-10.1016/0370-1573(94)00075-E.
Article CAS Google Scholar
Fu J, Wu Y, Liu L: Phys. Lett. B. 2000, 472: 161-10.1016/S0370-2693(99)01380-5.
Article CAS Google Scholar
Shaoshun W, Ran L, Zhaomin W: Phys. Lett. B. 1998, 438: 353-10.1016/S0370-2693(98)01129-0.
Article Google Scholar
Sarcevic I: Nuc. Phys. A. 1991, 525: 361c-10.1016/0375-9474(91)90345-7.
Article Google Scholar
DeWolf EA, Dremin IM, Kittle W: Phys. Rep. 1996, 270: 1-10.1016/0370-1573(95)00069-0.
Article Google Scholar
Kadanoff LP: Physics Today, August. 2001, 34:
Google Scholar
Spellman PT, Sherlock G, Zhang MQ, et al: Mol. Biol. Cell. 1998, 12: 3273-
Article Google Scholar
Ross DT, Scherf U, Eisen MB, et al: Nat. Genet. 2000, 24: 208-10.1038/73400.
Article Google Scholar
Fadda D, Slezak E, Bijaoui A: Astron. Astrophys. Suppl. Ser. 1998, 127: 335-10.1051/aas:1998355.
Article Google Scholar
Epanechnikov VA: Theor. Prob. Appl. 1969, 14: 163-
Article Google Scholar
Chu S, DeRisi JL, Eisen MB, et al: Science. 1998, 282: 699-10.1006/jmbi.1998.2134.
Article PubMed CAS Google Scholar
Baylin SB: Science. 1997, 277: 1948-10.1126/science.277.5334.1948.
Article PubMed CAS Google Scholar
Wildsmith SE, Archer GE, Winkley AJ, et al: Biotechniques. 2001, 30: 202-
PubMed CAS Google Scholar
Brazma A, Hingcamp P, Quackenbush J, et al: Nature Genet. 2001, 29: 365-10.1038/ng1201-365.
Article PubMed CAS Google Scholar
Lee M-L.T, Kuo FC, Whitmore GA, et al: P.N.A.S. 2000, 97: 9834-10.1073/pnas.97.18.9834.
Article PubMed CAS PubMed Central Google Scholar
Brown CS, Goodwin PC, Sorger PK: P.N.A.S. 2001, 98: 8944-10.1073/pnas.161242998.
Article PubMed CAS PubMed Central Google Scholar
Wang X, Ghosh S, Guo S-W: Nucl. Acids. Res. 2001, 29: e75-10.1093/nar/29.15.e75.
Article PubMed CAS PubMed Central Google Scholar
Wuensche A: Pac. Symp. Biocomp. 1998, 3: 89-
Google Scholar
Lockhart DJ, Winzeler EA: Nature. 2000, 405: 827-10.1038/35015701.
Article PubMed CAS Google Scholar
Werner T: Biomol. Eng. 2001, 17: 87-10.1016/S1389-0344(00)00071-X.
Article PubMed CAS Google Scholar
Pilpel Y, Sudarsanam P, Church GM: Nat. Genet. 2001, 29: 153-10.1038/ng724.
Article PubMed CAS Google Scholar

Download references

Acknowledgments

L.E.P. acknowledges the support of grant CA-78199-04 of the National Cancer Institute, and C. Aime, F. Kun, A. Loctionov, M. Patra, R. Peschanki, and E. Sarkisyan-Grinbaum for helpful discussions on intermittency. K.L. is supported in part by the U.S. Department of Energy, Grant no. DE-FG03-96ER41004, and in part by the Texas Advanced Research Program, Grant no. 3652-0023-1999.

Author information

Authors and Affiliations

Departments of Medicine, Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, 77030, USA
Leif E Peterson
Department of Physics, University of Houston, Texas, 77204, USA
Kwong Lau

Authors

Leif E Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Kwong Lau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leif E Peterson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peterson, L.E., Lau, K. Observation of intermittency in gene expression on cDNA microarrays. Genome Biol 3, preprint0005.1 (2002). https://doi.org/10.1186/gb-2002-3-7-preprint0005

Download citation

Received: 27 May 2002
Published: 29 May 2002
DOI: https://doi.org/10.1186/gb-2002-3-7-preprint0005

Observation of intermittency in gene expression on cDNA microarrays

Abstract

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Genome Biology

Contact us

Observation of intermittency in gene expression on cDNA microarrays

Abstract

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us