Observations on shifted cumulative regulation
© BioMed Central Ltd 2011
Published: 27 April 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 27 April 2011
A response to Dynamic cumulative activity of transcription factors as a mechanism of quantitative gene regulation by F He, J Buer, AP Zeng and R Balling. Genome Biol 2007, 8:R181.
Comment on He et al.: http://genomebiology.com/2007/8/9/R181
Studying the collaborative effects of multiple regulators is a key to understanding the basic principles of gene regulation. He et al.  proposed a shifted cumulative model to dissect combinatorial gene regulation. They discovered significant correlations between the combined expression profiles of regulators and the time series of expression of their target gene. The work highlighted the importance of identifying integrative effects of multiple transcription factors and showed that this identification was possible. We did a series of experiments to study possible combinatorial regulatory mechanisms following their strategy, but we found that the correlation among three genes can increase significantly after time-shifted combination no matter whether there are regulatory relationships. Our observations led to the conclusion that such increases are not sufficient to infer cumulative regulation relations.
where m is the number of regulators (m = 2 in our study as we only considered the combination of two regulators). We used the Pearson correlation coefficient (PCC) as the measurement of the correlation between a transcription factor (TF) or the combined profile and their target gene. We adjust τ i to get the combined profile that has the largest correlation with the target gene. The analysis of He et al.  indicates that a notable increase in the correlation of a target gene with the combined profile after time-shifting could indicate the existence of collaborative regulation.
One can understand the reason for the above observation using the framework of vector decomposition. Any time series of n points can be treated as a vector in this n-dimensional space so that it can be expressed as a weighted sum of any n linearly independent vectors. When considering two regulators and their target gene, the time-shifting procedure is equivalent to searching through all combinations of two vectors to best represent the target vector. It can be expected that such searching will improve the correlation between the combined profile and the target even if the genes are unrelated. This can also be viewed as an overfitting problem as there are too many parameters in the model. If we can further restrict the number of parameters or their search space by properly introducing extra knowledge or hypotheses, the overfitting problem may be eased or solved.
In conclusion, our experiments illustrate that the observed significant correlation after time-shifting may not be able to be used to infer shifted cumulative regulation. Although we believe that there can be dynamic cumulative regulations in cells, we still need further data and other methods of data analysis to identify such regulations.
The observations reported by Ye et al. above describe the well-known problem of overfitting in computational biology. The experiments carried out by them seem to indicate that the shifted cumulative model reported by us  of using combinatorial expression profiles based on the integration of conversion efficiencies and of time delays may not be able to be used to infer shifted cumulative gene regulation.
In Figure 4b, equations (i) and (ii) require that the time when a given regulator starts to function is independent of its different individual target genes in the corresponding convergence mode. Note that the starting time for different individual regulators in a given convergence mode might be distinct from each other. This is also applied to the constraints concerning the conversion efficiency and the latest starting time of different regulators. Equations (iii) and (iv) ensure that the conversion efficiency used for a given regulator is the same for different target genes in the corresponding convergence mode. In addition to restricting our analysis to convergence modes with more than one target gene (equation (viii)), we have also included the requirement that the target genes are not activated (or suppressed) earlier than the time when the regulators start to function (equation (v)). Furthermore, the time when a given regulator starts to function is constrained to be within one cell cycle (we used ten time points in the data of Cho et al. , which cover approximately one cycle) by equations (vi) and (vii). We explained all the constraints used in our work in the sections ' Quantification of shifted cumulative regulation of gene expression: principle of the approach' and 'Conversion efficiency and time delay among regulators' of our original paper . All eight equations were used as constraints to optimize correlation between the combinatorial expression profile of the two regulators and the profiles of all their target genes at the same time (defined in paragraph 2 of the section ' Time delay from regulators to target genes' of ). The same constraints were also used for randomized networks (see the sections 'Significant difference between results for the original and randomly generated expression data and between results for the original network and randomly generated networks' and 'Multiple hypothesis testing' in ).
Ye et al. state, 'In these experiments, we removed regulator pairs that have only a single target ...', which indicates that they have used the constraint indicated by equation (viii). They also write, 'The time shift between two regulators is fixed among their multiple targets'. This does not necessarily mean that the time when a given regulator starts to function is fixed among the multiple targets. Even if they fixed the time when the given regulator starts to function (indicated by our equation (i) and (ii)), all the other five important constraints (equations (iii), (iv), (v), (vi) and (vii)) out of the eight equations were apparently not used in their approach. It is also not clear whether they have used the same definition of optimal correlation as we did.
After using the eight constraints and the definition of optimal correlation, the success percentage at each corresponding threshold in a relatively high score range is significantly higher in the original network than that in random networks (for details see the section 'Significant difference between results for the original and randomly generated expression data and between results for the original network and randomly generated networks' in ). The average ratio of the success percentages between the original network and random networks in the range of significant correlation thresholds (≥13) is 1.865.
In addition, it seems to us from Figures 1 and 2 that Ye et al. have mixed the low scores and high scores together, which dilutes the contribution of high scores to the average values. This leads to a loss of information about the proportion of high scores and should not be done. In contrast to Ye et al., we used only the scores in a relatively high range because those high scores might indicate biological relevance and cannot be easily obtained by chance. We therefore successfully reduced the overfitting problem, as shown in Figure 2b,d of the original paper .
The overfitting problem is one of the key issues in computational/systems biology and is often not appropriately addressed. In almost all modeling approaches attempts are made to strike a balance between the appropriate number of variables and constraints. We tried to integrate as many constraints as possible to maintain the biological relevance of the model. It seems to us that the inability of Ye et al. to derive significant differences between the experimental and random networks is due to the fact they have used far fewer constraints, leading to overfitting.