Circular reasoning rather than cyclic expression
© BioMed Central Ltd 2008
Published: 23 June 2008
Skip to main content
© BioMed Central Ltd 2008
Published: 23 June 2008
A response to Combined analysis reveals a core set of cycling genes by Y Lu, S Mahony, PV Benos, R Rosenfeld, I Simon, LL Breeden and Z Bar-Joseph. Genome Biol 2007, 8:R146.
Transcriptome analyses have identified hundreds of genes that are periodically expressed during the mitotic cell cycle in each of four distantly related eukaryotes (budding yeast [1–3], fission yeast [4–6], human  and Arabidopsis thaliana ). In a paper published in Genome Biology, Lu and co-workers  challenge the results of earlier comparative studies [4, 10–15] by claiming that cell-cycle-regulated transcription is much more conserved at the level of individual genes than previously thought. However, we question the validity of their analysis as it relies on circular reasoning, allows evidence from homologous genes to overrule experimental evidence from a gene itself, assesses conservation on the basis of homology rather than orthology, and equates cell-cycle function with cell-cycle regulation.
Previous global studies of cell-cycle-regulated expression analyzed the microarray data from each organism individually and then used orthology relationships derived from sequence homology to compare the regulation of conserved genes. By contrast, Lu and co-workers also use sequence homology to transfer the evidence for periodic expression between sequence homologs within and between organisms [9, 16]. If a conserved gene appears periodic in, say, the two yeasts and the plant, then the algorithm may transfer this evidence to the human ortholog of the gene and conclude that it too is periodically expressed. A simplified interpretation of the method is thus that it averages the evidence for and against periodic expression across homologous genes. However, homology transfer is only valid if the transferred property is indeed highly conserved, and it logically follows that one cannot use a method that transfers a property to assess how conserved the property is. The main conclusion of Lu et al. , namely that cell-cycle regulation is more conserved than suggested by earlier studies, is thus based on circular reasoning as it is a built-in assumption of their method.
Nonetheless, Lu et al. say that only "5% to 7% of cycling genes in each of four species have cycling homologs in all other species" and thus agree with previous studies that the vast majority of the cycling genes in an organism do not have cycling homologs in other eukaryotes. When taking into account the limited sensitivity of microarray experiments, we estimate on the basis of our genome-wide comparison that 2% to 8% of the genes in an organism (5 to 22 orthologous groups) belong to the core set of conserved cycling genes (see Supplementary Information of our earlier paper ). Whether this is much or little is clearly in the eye of the beholder.
The opposite scenario is illustrated by the genes mcm3 and mcm5, both of which are mentioned specifically by Lu et al.  and are even included on the list of fission-yeast genes whose periodicity is supposedly conserved across all four organisms (a class designated by Lu et al. as CCC4). These genes exhibit only low-amplitude oscillations in one of ten timecourses, and this is unlikely to be due to active regulation . In fact, mcm5 is among the 30% least cycling genes in fission yeast according to our analysis [13, 14, 17]. The combined algorithm by Lu and co-workers thus produces both false negatives and false positives by letting evidence transferred by sequence homology overrule experimental data on the gene itself.
Fission yeast mcm3 and mcm5 belong to a group of six genes, each encoding a distinct subunit of the hexameric MCM complex, which is involved in initiation of DNA replication. The MCM genes are all conserved as 1:1:1:1 orthologs across the four organisms studied [14, 18, 19]. However, although Lu et al. have all six MCM genes from budding yeast as "conserved cycling genes" (CCC4), only mcm5 is present on all four CCC4 lists. The underlying problem is that their algorithm [9, 16], unlike earlier global analyses [4, 11, 13, 14], does not distinguish between orthologs and homologs. A gene cluster may thus contain paralogous genes that arose from gene duplication before the last common ancestor of present-day eukaryotes. This is well illustrated in Figure 1d of , in which the four orthologous CDC6 genes form a cluster that also contains ORC1 from human and budding yeast (but not from fission yeast and A. thaliana). Although both CDC6 and ORC1 are presumed to share ancestry with archeal cdc6 [19, 20], they perform distinct, conserved functions in eukaryotes . We consider it questionable to make inferences about, for example, the expression of human ORC1 based on expression data from budding yeast CDC6.
The orthology problem affects many proteins, including probably the most studied of all cell-cycle proteins, the cyclins (Figure 1c in Lu et al. ). Whereas we agree that the periodic expression of B-type cyclins is conserved , the list of human conserved cycling genes from Lu et al. also includes those encoding A-, E- and F-type cyclins, although these do not exist in yeasts [18, 19]. Tubulins are also listed as conserved cycling genes for each of the four organisms, but the cycling tubulins listed for A. thaliana are beta-tubulins, whereas none of the human beta-tubulins cycles. It logically follows that none of the tubulins has periodically expressed orthologs in all four organisms. Systematic, manual checking of all genes on the CCC4 lists reveals that the orthology problem affects almost half of them. The use of the term "conserved cycling gene" is, in our view, therefore misleading, as it does not imply that cyclic expression is conserved between functionally equivalent, orthologous genes.
Given the problems described above, how then can it be that the numerous comparisons with other data presented by Lu and co-workers all point in the direction that their lists are better than existing ones? The answer lies in the subtle but important distinction between 'cell-cycle function' and 'cell-cycle regulation'. Figure 1 of this Correspondence exemplifies the difference: whereas all six genes are involved in the cell cycle, only four of them (Plo1, CDC5, Sid2, and DBF2) are transcriptionally regulated during the cell cycle. Many of the tests performed by Lu et al. to support the validity of their proposed cycling genes do not assess cycling expression per se. Datasets from conditions such as stationary-phase budding yeast, nonproliferating human tissues, developmentally arrested A. thaliana and nitrogen-starved fission yeast are measures of downregulation in non-proliferating cells, which do not necessarily correlate with cyclic expression. The problem is that any gene involved in the cell cycle should be down-regulated under these conditions -whether it is expressed in a phase-specific manner or not. The authors also analyze the enrichment for essential genes and genes annotated with relevant Gene Ontology terms; however, no statistical analysis can change the fact that these are inherently related to the phenotype or function of a gene rather than to its regulation. The vast majority of the comparisons by Lu et al. only show that their set of conserved cycling genes is enriched for genes with cell-cycle function, but not that they are subject to transcriptional cell-cycle regulation. Indeed, we have previously observed that methods with good performance on a benchmark set based on functional evidence often perform poorly on more reliable benchmark sets based on regulatory evidence .
Lu and co-workers  also compared their list of cycling genes from budding yeast with the targets of nine cell-cycle transcription factors [23, 24]. This is, in our view, a much better gold standard as it is based on experimental evidence that is directly linked to cell-cycle regulation and not to cell-cycle function. However, this benchmark showed that the list proposed by Lu et al.  and the original list proposed by Spell-man et al.  are equally enriched for targets of cell-cycle transcription factors. Similar benchmarks based on regulatory evidence from the three other organisms even suggest that transfer of evidence between homologous genes leads to a decrease in performance .
In summary, homology-based transfer of expression data and other experimental evidence is a powerful strategy for function prediction , as protein function is often conserved over long evolutionary distances . However, several studies have shown that the regulation of genes and proteins changes much more rapidly during evolution than their function [4, 10–14, 26–32]. We have previously shown that, despite the lack of conserved regulation at the single-gene level, the organisms regulate the same protein complexes, but do so via different subunits [14, 15]. By transferring cell-cycle expression data between distantly related genes, Lu et al. were thus able to identify genes with cell-cycle function that cannot be identified as such on the basis of the expression of the genes themselves (for example, fission yeast mcm3 and mcm5; Figure 1). Selecting the correct evolutionary timescale for the property in question - be that function or regulation - is the key to success for any homology-based method.
Yong Lu, Shaun Mahony, Panayiotis V Benos, Roni Rosenfeld, Itamar Simon, Linda L Breeden and Ziv Bar-Joseph respond:
The central claim Jensen et al. raise in this Correspondence is that our method is circular. We believe that they confuse assumptions with circularity. Any computational method relies on specific assumptions and, if these assumptions are wrong, the conclusions of that method may be wrong as well. For example, sequence alignment relies heavily on assumptions regarding the parameters used for match, mismatch and gaps. As Dewey et al.  nicely show, these parameters can have a big impact on the results of aligning non-coding regions. Nonetheless, researchers have been using these methods for a long time with specific parameter choices and have arrived at very specific biological conclusions. Like our method, these findings are dependent, at least in part, on the choice of parameters for matches that are directly related to the conclusions drawn. Yet they have proved both useful and accurate when validating with independent data.
One of the major difficulties in identifying genes whose cell-cycle-regulated transcription is conserved across evolution is that cell-cycle microarray data are noisy and often contradictory. Jensen et al.  identified the top 300 periodic transcripts from each of four human datasets and found only 63 transcripts in common to all four. With only a 20% overlap between the most periodic 300 transcripts in four data-sets from the same organism, there is little doubt that a comparison across four highly diverged species is problematic. The approach of Jensen et al.  was to use thresholds that are "more conservative than those originally proposed" and to analyze a smaller, more reliable subset of cyclic transcripts. Our goal was not to exclude, but to capture as many cyclic transcripts as possible, with the view that interesting candidates could be subjected to further verification.
Our approach was motivated by the plot in Figure 2, which shows that fission-yeast orthologs of cycling budding-yeast genes fall just below the fission-yeast threshold for periodicity far more than expected from chance (p-value < 0.01 using Wilcoxon rank-sum test, p-value < 0.03 using Kolmogorov-Smirnov double-sided test). We have attempted to capture these borderline genes by lowering the threshold for borderline genes if their homologs in other species are cyclic and raising them if they are not cyclic. This strategy will certainly lead to more false assignments, but it has also allowed us to identify hundreds of promising candidates for further investigation. Still, almost all the genes that are elevated to a cyclic status by our method have a rather high cyclic expression score to begin with. Figure 3 shows the difference between the initial score (based on expression alone) and the posterior score from our method. As can be seen in the plot, the ranks for most genes do not change much.
Jensen et al. also question the complementary datasets we used to validate the CCC sets identified by our algorithm. They claim that the complementary datasets we used only point to cell-cycle function rather than cell-cycle regulation. However, the 'functional rather than regulatory identification' claim does not provide an explanation as to how our algorithm was able to identify these 'functional' cell-cycle genes. In our analysis we used controls for both types of data (expression and sequence). Specifically, for the essentiality analysis we show that only 16% of cycling yeast genes are essential. If one uses sequence data, so that only genes with conserved homologs in other species are retained, this percentage increases to 27%. If what we find is indeed functional rather than regulatory signal, cyclic expression in other species would not have been a factor and the only advantage we would have would come from using sequence data. However, when we use both sequence conservation and conserved cyclic expression, as determined by our method the percentage rises to 46%, a more than 70% increase over sequence alone. Similar results were obtained for the human conserved set. We have repeated this type of positive control for the other types of complementary analysis and have shown that expression conservation leads to much stronger cell-cycle characteristics.
We have also carried out direct regulatory analysis. Table 1 in our original paper  presents the result of motif search methods for genes in CCC2, the set of cycling genes conserved between the two yeasts. We show that these genes have a remarkably well conserved motif for G1 and some of the S-phase transcription factors. In sharp contrast, non-cycling homologs of genes in CCC2 do not have these motifs conserved. The fact that motif conservation agrees with our expression conservation findings is a strong support for the CCC2 set assignment.
The other major issue raised here by Jensen et al. relates to the problem of identifying conserved periodic genes whose products carry out the same function in all four of these highly divergent species. Jensen et al.  used a combination of sequence similarity and manual curation to identify orthologous groups. In most cases, it cannot be determined whether these groups are really functionally equivalent or whether all such groups have been identified. Nevertheless, on the basis of these assignments, only a quarter of all the cycling genes they studied had orthologs in all four species and these form the basis for their comparison. Of the 60 cycling genes in Arabidopsis with orthologs in all four species, one-third of their orthologs also cycle in pairwise comparisons with each of the other three species, but only five cycle in all four species. All five of these orthology groups represent well studied genes and nothing new was identified.
We purposely avoided restricting our analysis only to genes with clear orthologs across species. Rather, we used BLAST analysis followed by a Markov cluster algorithm , which leads to the identification of multi-domain homologous proteins. This difference between the definitions of homologs impacts on the conclusions reached by us and Jensen et al. Our method results in large families that show high homology overall but cannot be parsed into one-to-one orthologous pairs across species. In our original paper , we presented analysis of the results of this procedure for the CCC2 set of conserved cycling genes. We found that 82% of budding yeast genes in CCC2 are indeed curated homologs of the fission yeast CCC2 genes , a very high rate that indicates the accuracy of the resulting CCC2 set.
As we compare the genes from more divergent species, we are much less likely to be able to ascribe functional equivalence to any given pair. This is especially true for signaling and regulatory proteins that often arise from duplicated genes, and which cannot be forced into functionally equivalent orthology groups until we have a complete understanding of what they do in every species. Jensen et al. are correct that there is no cyclin E ortholog in yeast. There is also no cyclin E in Arabidopsis . However, all four species encode related cyclin genes carrying out functions in late G1 that are important for the transition to S phase, and most of these cyclins are cell-cycle-regulated at the transcriptional level. These are the very types of gene products that we are most interested in identifying.
Towards this end we used an objective and comprehensive strategy for identifying multi-domain sequence homologies across all four genomes. In so doing, we have identified groups of genes that share some truly remarkable properties. The 72 conserved cyclic budding-yeast genes that are also conserved in fission yeast and humans (CCC3) are eight times more likely to be targets of cyclin-dependent kinases than those tested at random, and six times more likely to be involved in protein-protein interactions. Some of these genes encode unexpected proteins (for example, alkaline phosphatase and metal transporters) and there are others about which nothing is known. To further study this set we carried out new experiments  to identify the set of cycling genes in primary human cells (our previous analysis as well as that analysis of Jensen et al.  is based on expression data from transformed (HeLa) cells). As we discuss in , the set of genes cycling in primary cells is significantly more enriched than the HeLa set for orthologs of cycling genes in budding and fission yeast. We hope that our study will spur the collection of more cell-cycle data and the development of new strategies for identifying conserved periodically transcribed genes.
Correspondence should be sent to Ziv Bar-Joseph: Department of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. Email: firstname.lastname@example.org