The determinants of gene order conservation in yeasts
© Poyatos and Hurst; licensee BioMed Central Ltd. 2007
Received: 10 July 2007
Accepted: 5 November 2007
Published: 05 November 2007
Why do some groups of physically linked genes stay linked over long evolutionary periods? Although several factors are associated with the formation of gene clusters in eukaryotic genomes, the particular contribution of each feature to clustering maintenance remains unclear.
We quantify the strength of the proposed factors in a yeast lineage. First we identify the magnitude of each variable to determine linkage conservation by using several comparator species at different distances to Saccharomyces cerevisiae. For adjacent gene pairs, in line with null simulations, intergenic distance acts as the strongest covariate. Which of the other covariates appear important depends on the comparator, although high co-expression is related to synteny conservation commonly, especially in the more distant comparisons, these being expected to reveal strong but relatively rare selection. We also analyze those pairs that are immediate neighbors through all the lineages considered. Current intergene distance is again the best predictor, followed by the local density of essential genes and co-regulation, with co-expression and recombination rate being the weakest predictors. The genome duplication seen in yeast leaves some mark on linkage conservation, as adjacent pairs resolved as single copy in all post-whole genome duplication species are more often found as adjacent in pre-duplication species.
Current intergene distance is consistently the strongest predictor of synteny conservation as expected under a simple null model. Other variables are of lesser importance and their relevance depends both on the species comparison in question and the fate of the duplicates following genome duplication.
The precise location of genes in eukaryotic genomes was assumed to be largely random not so long ago . This was motivated by the understanding that, unlike in bacteria, there need not be chromosomal domains associated with high rates of gene transcription. Common reports of chromosome inversions with little effect on phenotype confirmed the picture of random placement of genes and a lack of selective constraint on gene order .
However, recent studies in diverse eukaryotes challenge this initial intuition . Indeed, in all well studied eukaryotic genomes, genes of similar expression tend to cluster more commonly than expected by chance . For example, in humans both broadly [4, 5] and highly [6, 7] expressed genes cluster, while in yeast highly co-expressed genes are neighboring more commonly than expected . The same tendency for genes that are physically close to be co-expressed might additionally explain why genes whose proteins are close in either the metabolic [9, 10] or protein-protein interaction network [11, 12] are in close chromosomal proximity more commonly than expected. More subtle organizations have also been claimed, such as periodicity in gene location , but this appears to be caused by data biases . Not all patterns are necessarily associated with co-expression of some variety. Most notably, in yeast, essential genes cluster into domains of low recombination . The clustering of essential genes may be more to do with ensuring precise control over expression (that is, minimal noise), rather than co-expression per se .
While all these previous analyses helped to clarify some of the factors associated with gene order, they also opened new questions. Most particularly, how important, in absolute and relative terms, are all of these features? If we take a pair of genes adjacent to each other in Saccharomyces cerevisiae, we can then ask whether the same two genes are also adjacent or not in a different species. How relevant are the above parameters in explaining which genes are adjacent in both species? In yeast, intergene distance and co-expression have been shown to be two independent determinants of gene order conservation using Candida albicans as comparator species . Intergene distance is expected under the simplest neutral null model of gene order evolution. This is because we suppose that a re-arrangement that disrupts a gene will not be tolerated, hence those genes currently with a large intergene distance between them are more likely to be affected by viable gene re-ordering events, all else being equal (note that under this simplest null model it follows that overlapping genes are impossible to break up). Likewise, a pair of genes that currently have a large intergene distance between them are more likely to have had in the past a large intergene distance, even if they were not immediately next to the genes that are currently their neighbors.
Other evidence suggests that this null model alone is not adequate. Notably, essential genes tend to stay together more commonly than expected by chance, although their mean intergenic distance is unexceptional [15, 18]. Whether this is owing to selection per se or simply a reduced probability of chromosomal re-arrangements in domains of low recombination  remains unclear. We can then ask a series of questions. First, if we treat each parameter in isolation, we can ask whether that parameter explains a significant proportion of conservation of gene order. Second, in a fuller model we can ask how relatively important and independent each of the parameters might be. Third, are the results of the above analyses sensitive to which comparator species we employ to compare with S. cerevisiae? Fourth, can we predict the characteristics of those gene pairs that through all lineages of yeast have remained physically together? Finally, what will be the effect of the differential gene silencing associated with the whole genome duplication in the yeast lineage?
To address these questions, we computed a group of potential determinants in S. cerevisiae and quantified how they determined linkage conservation in a full yeast lineage.
Results and discussion
A neutral model of gene order evolution
While in principle a relationship between intergene distance and conservation rates of genes that are immediate neighbors seems reasonable, a problem in demonstrating this derives from the fact that intergene distance data that we can directly obtain from genome sequencing describe the situation after the process of evolution from an ancestor. If we assume that DNA is neither lost nor gained, then a gene pair with a small intergene distance in S. cerevisiae may have a small distance either because the pair have always resided together and the intergene distance has not changed or because the pair came together following an inversion and this inversion just happened to bring with it a small intergene spacer. Moreover, two genes may be together in both S. cerevisiae and a relatively distant comparator, for example, C. albicans, not because they have always been immediate neighbors but because repeated events broke them up but also re-positioned them, bringing them back together. To further investigate the extent to which intergene distance might differ between genes that are immediate neighbors in any two species and those that are immediate neighbors only in one of the two species, we performed a set of neutral simulations.
In these simulations we consider a chromosome with 400 genes. The intergene distance between any gene pair is randomly selected from intergene distances currently observed in S. cerevisiae (after removal of overlapping transcripts). We then randomly select a position on the chromosome and accept this position if it is an intergene spacer. We then pick a point that is approximately 5 kb upstream or downstream of the selected chromosome location and accept it as the end point of the inversion if in intergene spacer. This distance approximately matches the mean size of the small inversions seen in yeasts . We then invert the sequence, thereby altering intergene distance between, at the most, two pairs of genes. We then carry on evolving the new chromosome over numerous rounds of inversions. We repeat the simulation for 1,000 inversions 100 times.
Predictors of gene order conservation
We consider specifically seven factors either previously associated with the formation of clusters or that could predict linkage conservation in S. cerevisiae. These predictors are the following: met, metabolic relationship [9, 10, 20]; cex, gene co-expression ; igd, physical proximity (that is, intergenic distance ); let, density of lethals (that is, local essential gene density) ; rec, recombination rate ; cre, gene co-regulation (number of common regulatory motifs between two genes) ; and pro, distance in the protein-protein interaction network [11, 12].
In asking whether the above parameters predict gene order conservation, we could be asking two different questions. First, we could ask whether gene pairs of a particular class, given they are of the same class, are preserved as immediate neighbors more than those not in the same class. Second, we can ask whether, in determining which genes are preserved in linkage, the fact of belonging to the same class can explain much of the variation. The difference in analysis can be easily illustrated. Consider that there were just two genes in the genome that belonged to a given class (perhaps there are just two genes involved in a given cell process, process X). Consider also that these two genes were always preserved in linkage for some reason. At first sight process X looks like a strong predictor, as, given that two genes both belong to class X and are neighbors in one species, we can be sure they are neighbors in another. If we approach the analysis using the first method we would conclude that belonging to class X was important. However, as a variable to explain patterns of conservation or not of gene pairs in general, it explains very little of the conservation of gene order (just one pair) and most conservation of synteny has nothing to do with belonging to class X or not. The second mode of analysis would report that belonging to class X is not an important variable. A priori then, we expect the answers to depend on precisely what questions we ask.
We concentrate predominantly on the second mode of analysis. We take two broad approaches. First, for given pairwise comparisons (S. cerevisiae versus other species) we ask about statistical models that act to explain the variation between gene pairs as to whether they are syntenic (immediate neighbors) in both species or not. Second, we ask about the properties of gene pairs that are syntenic in all of the species concerned. While the first question allows us to ask whether predictors of conservation of gene order are dependent on the taxa compared and their phylogenetic distance, the second mode permits us to distil the properties that enable gene order conservation in the long term.
Determinants of gene order conservation
Quantifying predictor relevance in single species
We use multivariate analysis to disentangle the contribution of each of the previous factors to gene order conservation. The general idea is to describe the relationship between a dependent variable, the response, and a group of independent variables, the predictors, by means of a multiple regression model. In our case the response variable takes two discrete values; an adjacent gene pair in S. cerevisiae could be found as adjacent or nonadjacent in a given comparator species, so we apply logistic regression . We consider several complementary strategies to estimate the relevance of each linkage predictor. In addition, since the correlation between some of the determinants could also be relevant, for example, essential clusters having low recombination rates , a phenomenon termed multicollinearity in regression modeling, we compute the correlation matrix of the parameter estimates in the logistic equation to quantify this effect (Materials and methods). The final outcome of all these combined studies is the simplest logistic model capable of predicting the observed conservation patterns.
S. cerevisiae versus C. glabrata logistic regression analyses
The relationship between the probability that an adjacent pair in S. cerevisiae was adjacent also in C. glabrata, Pr, and their intergenic distance and co-regulation is:
logit Pr = 3.028 - 0.001 igd - 0.526 cre
where the logit transformation is given by logit , and igd, cre denotes intergenic distance (in units of base-pairs) and co-regulation score (with a maximum value of 1), respectively. The model indicates that the probability to be adjacent in both species decreases with intergenic distance and co-regulation, the latter being the weaker of the two determinants. The relevance of each variable is easily determined by comparing the coefficients in Table 2, where variables were scaled in standard deviations (standardized data). A higher absolute value of an estimate in these units, and its order of appearance in the stepwise regression, reflects this relevance. Moreover, the previous model gives us the effect of change in one determinant when controlling the other. Thus, the effect of an increase in intergene distance (in units of base-pairs) for a fixed co-regulation score is exp(-0.001) = 0.999, while the maximal effect of increase in co-regulation, controlling for spacer, is exp(-0.526) = 0.591 with cre = 1. Are these effects independent? Analyzing the correlation between estimates, we find some dependence between both predictors (correlation of cre-igd coefficients: -0.19). We could in turn add an additional term in the model to account for this interaction. However, the decrease in deviance achieved by this more complex model is small, so we can still consider the two-component model as a valid description. Overall, this corroborates that an increase in intergene distance diminishes the probability that genes are adjacent in both species. This also suggests that non-adjacently conserved pairs exhibit stronger co-regulation, which is, at first sight, a counter-intuitive result. Analyzing this behavior in detail (data not shown), we find that high co-regulation scores are associated with a low density of regulatory motifs, that is, regulation by a small number of common transcriptional factors. It is probably this low density of regulatory sites that ensures that any re-arrangement is less likely to be opposed by purifying selection.
Predictors of linkage conservation in a yeast lineage
How would the previous model change with the comparator species? To analyze this question, we consider five comparator species: C. glabrata (discussed above), Saccharomyces castelli, Kluyveromyces waltii, Kluyveromyces lactis, and Ashbya gossypii. We apply the same methodology as in the previous section (Additional data file 1) to obtain the corresponding reduced logistic models. These models include only those terms which significantly contributed to explain the conservation pattern.
Logistic models of gene order conservation for different comparator species
logit Pr = 3.028 - 0.001 igd - 0.526 cre
logit Pr = 2.246 - 0.0005 igd
logit Pr = 1.159 + 0.741 cex - 0.001 igd - 0.422 rec
logit Pr = 0.62 + 1.047 cex - 0.001 igd
logit Pr = 0.753 + 0.849 cex - 0.001 igd
Thus, an 'averaged' adjacent pair in S. cerevisiae with null intergenic distance is less likely to be found as adjacent in a given comparator as phylogenetic distance increases. While this is, naturally, trivial, it goes some way to validating the method. More interesting is to see how this behavior changes when these pair types have non-zero intergenic distances? For a characteristic spacer of 500 bp, the probabilities to remain adjacent are (0.93, 0.85, 0.75, 0.53, 0.56), which correspond to a percentage of decrease with respect to previous values of ~(2.1%, 6.6%, 9.6%, 18.5%, 17.6%). Gene pairs with a large intergene distance should then be disproportionately more likely to be conserved as adjacent the closer the comparator species is to S. cerevisiae. Put differently, as the time to common ancestor increases, as intergene distance increases, so the probability that the genes are not in synteny in the comparator goes up at an accelerating rate.
Co-expression and intergenic distance act as the two main determinants of order conservation in pre-WGD species (clustering of essential genes near the adjacent pairs appears as a third determinant in K. waltii). These variables appear to be independent since the correlation of their corresponding estimates is low in all proposed models (<0.08 in all three pre-WGD comparators). To analyze the role of co-expression in more detail, we discretize the co-expression values so that a unit increase in the model corresponds to an increase of 0.1 in the correlation (estimates did not change very much with respect to those in Table 2). What is the effect of an increase in co-expression? This is given by the exponential of the corresponding coefficient in the logistic model. For K. waltii, this reads as exp(0.78) = 2.18, which indicates that, controlling for intergene distance and recombination rate, each increase in the correlation of 0.1 increases the odds that a pair remains as adjacent by 2.18 (slightly higher values applied to K. lactis and A. gossypii).
Properties of gene pairs preserved throughout yeast evolution
Given that clusters of essential genes also have low recombination rates, we can, in addition, ask whether the retention of synteny of those gene pairs in the middle of essential gene clusters is due to the low recombination rate. In wheat, for example, it is observed that chromosomal domains associated with low recombination rates also have low re-arrangement rates , potentially consistent with a model in which recombination is associated with the generation of re-arrangements. By contrast, recent simulations suggest that selection to preserve essential genes in chromosomal domains of low gene expression noise (open chromatin) will result in low rates of disruption of gene order in essential gene clusters, independent of any effect of recombination . To ask whether essential gene clusters are conserved owing to their low recombination rates, we considered two subclasses: those gene pairs with very few essential genes in their vicinity (the 'low' group: N = 385) and those with many (the 'high' group: N = 34). Next, within each group we ask about the recombination rate of those pairs conserved in synteny across all lineages and those not so. If recombination is an independent predictor, then in both the high and low groups the recombination rate of those in conserved synteny should be lower. For the 83 pairs in the low group conserved as a pair, the recombination rate is slightly lower than that of the low group as a whole (1.04 versus 1.06), but not significantly so (P = 0.5). In the high class, 10 of the 34 are retained in synteny. These have, if anything, a slightly higher recombination rate than the average for the high group (0.98 versus 0.95), but again, not significantly so (P = 0.47). So, in sum, while clusters of essential genes have low recombination rates, the recombination rate does not in and of itself explain the conservation of synteny. This is not to say that what is reported in wheat is wrong nor, indeed, more generally that recombination does not induce re-arrangements. The most important problem in this analysis is that the recombination rate measurements come from current laboratory yeasts while the synteny conservation data spans events over hundreds of millions of years. Problematically, recombination rates are thought to evolve quite fast. To better resolve this issue it might be better to compare telomeric (high recombination rate) and centromeric (low recombination rates) domains, rather than asking about conservation of gene pairs in isolation.
Predictors of linkage conservation and reciprocal gene loss
How would the conservation of synteny of a given pair of neighboring genes be influenced by the processes associated with the WGD event? We focus our attention on two possible effects. First, linkage conservation might be influenced by the fate of the pre-WGD adjacent pairs after the WGD event. We could compare two opposite situations. Either both adjacent pairs have lost the same (orthologue) copy of the corresponding gene, or both remained duplicated in all three post-WGD species. These are actually the most common fates of ancestral loci in yeasts . According to this, one could imagine, for instance, that since a duplicate gene might contribute to perform part of the function originally associated with a single gene (sub-functionalization model), adjacent genes with duplicates could experience less pressure to remain linked, as part of the function is implemented by the duplicate. We would predict then that adjacent pairs resolved as single copy in all post-WGD species would more often be found as adjacent in pre-WGD species. This is indeed what we obtain. Single copy adjacent genes were more likely conserved as adjacent in all pre-WGD species: K. waltii (χ2 = 5.83, P < 0.02, d.f. = 1); K. lactis (χ2 = 5.77, P < 0.02, d.f. = 1); and A. gossypii (χ2 = 5.41, P = 0.02, d.f. = 1).
Alternatively, as deletion of one duplicate is the most common process after the WGD, linkage conservation could be influenced by how this deletion is resolved in the different post-WGD lineages. Divergent classes are those in which some of the genes lost are paralogues in the three post-WGD species, while convergent classes imply that all lost genes are orthologues. This latter class implies a less random choice of gene loss. We find that adjacent pairs both belonging to the convergent class are more conserved than expected in four out of five species: C. glabrata, χ2 = 4.18, P = 0.04; K. waltii, χ2 = 6.64, P = 0.01; K. lactis, χ2 = 9.56, P < 0.01; A. gossypii, χ2 = 6.81, P < 0.01; d.f. = 1 in all cases.
In asking about what factors determine gene order conservation, despite the dependence of the answer on the question, one regularity appears. This is the finding that gene pairs currently with a short intergene spacer are less likely to have been re-arranged. This fits with data from microsporidians in which gene overlap is common and gene order rearrangements are rare . The null model, assuming nothing more than an intolerance to inversions that cut within genes, provides a strikingly good fit for such a simple model. The model was made deliberately simple by not assuming that gene orientation would make a difference and takes no account of the density of functional sites between genes. As noted above, these are unrealistic assumptions. This indeed may explain in part the conservation of gene pairs that are co-expressed, as inversions could, for example, break bidirectional promoters between genes in divergent orientation.
Beyond the role of the intergene spacer, further answers are dependent on just how one asks the question. We can, for example, ask whether gene pairs in a given class tend to be more conserved than gene pairs not in the specified class. For example, gene pairs that specify proteins close in either the metabolic or protein interaction network do tend to be more commonly conserved as neighbors than gene pairs that also specify proteins that feature in the relevant network but are not close in the network. By contrast, if we ask whether network proximity is generally an important predictor of synteny conservation, the answer is no, largely because most proteins do not explicitly feature within the network. Second, when asking about predictors of linkage conservation, the answer depends on which species one is comparing. Close comparators highlight co-regulation, whilst more distant comparators suggest co-expression and maybe the recombination rate (as measured in S. cerevisiae) as important predictors. Analysis of the properties of the gene pairs preserved as a pair in all species points to the density of flanking essential genes as an important predictor, suggesting that essential gene clusters tend to be frozen, as previously noted [15, 18].
That the results are dependent on the species under comparison perhaps reflects a difference in the strength of selection to preserve a class of gene pairs and the commonality of such pairs. Consider, for example, the possibility that the top 2% of co-expressed gene pairs are under very strong selection to remain linked. Would this be transparent in comparisons between closely related species? The answer is probably not. In our close comparators, approximately 90% of gene pairs remain as immediate neighbors. If just the 2% most highly co-expressed genes resist re-arrangement, there may not even have been a single re-arrangement that might have occurred between linked highly co-expressed genes that was rejected by selection. Hence there would be no signal of co-expression as an important factor in linkage conservation. As the distance between comparators increases, however, the resilience of the 2% will start to appear as an ever stronger signal, assuming the co-expression to be both ancestral and under selection (in different ecologies different co-expression profiles might be under selection). In sum, strong but relatively rare selection will be discernable only in distant comparators. Put differently, the more distant comparisons and the analysis of those pairs always conserved hones in on the special subclass of genes for which selection acts to preserve the gene order.
Perhaps then relatively little is to be learnt from relatively close comparators as so few re-arrangements will have been sampled. In this context, however, there exists one apparent oddity. In the close species comparisons intergene distance and co-regulation appear as important predictors. However, against expectations, gene pairs with a high level of co-regulation, that is, that share much of the same transcription factor-based regulation, are more, not less, likely to be broken up. When analyzed in detail, however, we find that this strong signal is associated with a low density of regulatory motifs: very high co-regulation scores are disproportionately associated with gene pairs with only one (the same) transcriptional motif, hence a low motif density. It is this low motif density that most likely contributes to the lack of conservation of the gene pairs in the short term.
Even if we assume that longer distance phylogenetic comparisons are best, the yeast analysis suggests that phylogenetic distance alone is not the sole arbiter. Rather than the comparator distance, the duplication event experienced in the lineage seems also to be influencing the fate of adjacent pairs. The potential relaxation of the functional constraints associated with the pair members, because of either being duplicated or being divergently conserved, is reflected in a smaller tendency to remain as immediate neighbors.
The results presented here no doubt do not reflect the full complexity of gene order evolution. For example, while we expect that the absolute rate of gene order evolution should scale monotonically with the amount of intergene spacer, this model fails to make any sense of the much higher re-arrangement rates seen in rodents than in primates , although the low rate seen in chicken is consistent, the chicken genome being relatively compact. We can also ask whether the other forces we have identified might have any general applicability? Prior reports have found that clusters of housekeeping genes in mammalian genomes tend to have preserved synteny  and that essential gene clusters in mice are also conserved . In these instances it will be informative to ask about the relationship between the two parameters (there is a broad overlap between essential genes and housekeeping genes in mammals) and how intergene distance and recombination rate might interrelate. More generally, when more whole genome dispensability data are available it will be interesting to see if the preservation of essential clusters is a common phenomenon and, in turn, ask about the underlying rationale.
Materials and methods
Comparator species and genome data
We used data from the Yeast Gene Order Browser . This collection includes seven hemiascomycetes species, four of them having diverged before the whole genome duplication event occurred in the lineage. We considered only six for our study, three pre-WGD, that is, A. gossypii, K. lactis and K. waltii, and three post-WGD, that is, C. glabrata, S. castelli and S. cerevisiae. To compute unambiguosly whether a given syntenic pair in S. cerevisiae is conserved as syntenic in a comparator species we analyzed only pairs from a subset of genes termed ancestral loci. Each member of this set correspond to a locus in a pre-WGD species, or the corresponding duplicated pair of loci in the post-WGD species and has been defined by homology and genome context information .
We examined the metabolic relationship of a gene pair by means of a metabolic network of S. cerevisiae recently reconstructed using genomic, biochemical and physiological information . More specifically, we considered this network as a graph whose nodes and edges are the metabolic genes and the metabolic reactions, respectively , and quantified the metabolic relationship of a pair by its shortest distance in the network (graph has 851 nodes, and 294 of them are ancestral loci). We computed how many syntenic pairs belonging to this network (up to three intervening genes between them in S. cerevisiae) are conserved as syntenic (non-syntenic) in C. glabrata. We found 29 genes conserved in C. glabrata, 4 of which are non-syntenic. The mean graph shortest distance of those conserved (not conserved) as syntenic is = 3.58 ( = 5). This hints at metabolic network distance as a plausible predictor of linkage conservation, that is, the closer in the graph the more likely to be conserved as a linked pair corroborating previous studies [9, 20]. For the extended study with all syntenic genes included, we assigned a null distance value to those adjacent pairs without metabolic information. We set this characteristic value to the network mean value ( = 3.83).
Co-expression and intergenic distance
To quantify gene co-expression, 40 different sets of genome-wide transcription time series from ExpressDB were used as compiled in . In our analyses co-expression for a given gene pair denotes then the mean of the 40 correlation coefficients of mRNA expression, corresponding to 40 different experiments, for a given gene pair. All sequence information was obtained from the Saccharomyces Genome Database .
Density of lethals
We used a list of essential genes included in the Saccharomyces Genome Database , which contains information on a large-scale knockout study . We introduced a score quantifying the number of essential genes around a syntenic pair. For each pair, the density of lethals reads as the mean number of essential genes located at -3,-2,-1,pair1,pair2,1,2,3 gene coordinates, that is, up to 4 genes in either the 5' to 3' or the 3' to 5' direction around each member of the pair.
We considered the recombination data set obtained in , an estimate of recombination rate, by using double strand break analysis. To each gene we assigned a recombination rate. For a syntenic pair we took the mean of each gene rate.
Here |...| denotes the size of the set, ∩ the intersection and m i the number of regulatory motifs for gene i in the pair.
We used the high confidence data set of multi-validated protein interactions in S. cerevisiae  and quantified the protein-protein relationship of a pair by its shortest distance in the network as before. Only 168 adjacent genes are included in this graph (with 1088 nodes). We found 155 genes conserved in S. castelli, 18 of which remained adjacent. The mean graph shortest distance of those conserved (not conserved) as adjacent is = 5.12 ( = 5.28). This hints at protein-protein network distance as a plausible predictor of linkage conservation, that is, the closer in the graph the more likely to be conserved as a linked pair. For the extended study with all adjacent genes included, we assigned a characteristic distance value to those adjacent pairs without protein-protein interaction information. We set this characteristic value to the network mean value ( = 5.35).
Logistic regression is a class of multivariate analyses usually applied to describe the dependencies of binary responses with respect to a set of variables . While multiple regression usually quantifies the influence of several factors on continuous dependent variables, logistic regression extends these techniques to the study of qualitative features. In our case, we interpreted an adjacent pair in S. cerevisiae retained or not as syntenic in a given comparator as such a binary, or discrete, response to model. To estimate the relevance of each linkage predictor in a robust way, given that genomic data are known to be noisy and analysis of them might lead to very significant but totally misleading results , we considered several complementary strategies: simple (logistic) regression of each of the variables; forward stepwise regression according to the Akaike criterion (a measure of goodness of fit) , in this case, predictors are included in the model only if they increase the goodness of fit; multiple regression - here all proposed variables are considered; and principal component multiple regression (Additional data file 1). We scaled each of the covariates to zero mean and unit variance (standardized data) before carrying out the studies unless indicated. Correlation between some of the determinants could be relevant, for example, essential clusters having low recombination rates. To quantify the correlation between independent variables, one can compute the correlation matrix of the parameter estimates in the logistic equation. Large values of these correlations indicate that multicollinearity may be complicating the modeling process. Alternatively, if correlations between the estimates are fairly small, it is expected that removing one variable from the model does not change the coefficients and P values for other variables much.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is an extended discussion of the regression and null models corresponding to several comparator species.
- cex :
- cre :
- igd :
- let :
density of lethals
- met :
- pro :
distance in the protein-protein interaction network
- rec :
whole genome duplication.
We thank Oscar Rueda, Andrés Cañada and Kevin Byrne for technical assistance and valuable discussions. This research has been partially supported by the Spanish Ministerio de Educación y Ciencia Grant FIS2006-10368 (JFP) and the UK Biotechnology and Biological Sciences Research Council (LDH).
- Cavalier-Smith T: Evolution of the eukaryotic genome. The Eukaryotic Genome: Organization and Regulation. Edited by: Broda P, Oliver S, Sims P. 1993, Cambridge: Cambridge University Press, 333-385.Google Scholar
- Maynard-Smith J: Evolutionary Genetics. 1998, Oxford: Oxford University Press, 2Google Scholar
- Hurst LD, Pál C, Lercher MJ: The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet. 2004, 5: 299-310. 10.1038/nrg1319.PubMedView ArticleGoogle Scholar
- Lercher MJ, Urrutia AO, Hurst LD: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat Genet. 2002, 31: 180-183. 10.1038/ng887.PubMedView ArticleGoogle Scholar
- Lercher MJ, Urrutia AO, Pavlícek A, Hurst LD: A unification of mosaic structures in the human genome. Hum Mol Genet. 2003, 12: 2411-2415. 10.1093/hmg/ddg251.PubMedView ArticleGoogle Scholar
- Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, van Sluis P, Hermus MC, van Asperen R, Boon K, Voûte PA, et al: The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science. 2001, 291: 1289-1292. 10.1126/science.1056794.PubMedView ArticleGoogle Scholar
- Versteeg R, van Schaik BD, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AH: The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 2003, 13: 1998-2004. 10.1101/gr.1649303.PubMedPubMed CentralView ArticleGoogle Scholar
- Cohen BA, Mitra RD, Hughes JD, Church GM: A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet. 2000, 26: 183-186. 10.1038/79896.PubMedView ArticleGoogle Scholar
- Lee JM, Sonnhammer EL: Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 2003, 13: 875-882. 10.1101/gr.737703.PubMedPubMed CentralView ArticleGoogle Scholar
- Wong S, Wolfe KH: Birth of a metabolic gene cluster in yeast by adaptive gene relocation. Nat Genet. 2005, 37: 777-782. 10.1038/ng1584.PubMedView ArticleGoogle Scholar
- Teichmann SA, Veitia RA: Genes encoding subunits of stable complexes are clustered on the yeast chromosomes: an interpretation from a dosage balance perspective. Genetics. 2004, 167: 2121-2125. 10.1534/genetics.103.024505.PubMedPubMed CentralView ArticleGoogle Scholar
- Poyatos JF, Hurst LD: Is optimal gene order impossible?. Trends Genet. 2006, 22: 420-423. 10.1016/j.tig.2006.06.003.PubMedView ArticleGoogle Scholar
- Képès F: Periodic epi-organization of the yeast genome revealed by the distribution of promoter sites. J Mol Biol. 2003, 329: 859-865. 10.1016/S0022-2836(03)00535-7.PubMedView ArticleGoogle Scholar
- Lercher MJ, Hurst LD: Co-expressed yeast genes cluster over a long range but are not regularly spaced. J Mol Biol. 2006, 359: 825-831. 10.1016/j.jmb.2006.03.051.PubMedView ArticleGoogle Scholar
- Pál C, Hurst LD: Evidence for co-evolution of gene order and recombination rate. Nat Genet. 2003, 33: 392-395. 10.1038/ng1111.PubMedView ArticleGoogle Scholar
- Batada N, Hurst LD: Evolution of chromosome organization driven by selection for reduced gene expression noise. Nat Genet. 2007, 39: 945-949. 10.1038/ng2071.PubMedView ArticleGoogle Scholar
- Hurst LD, Williams EJ, Pál C: Natural selection promotes the conservation of linkage of co-expressed genes. Trends Genet. 2002, 18: 604-606. 10.1016/S0168-9525(02)02813-5.PubMedView ArticleGoogle Scholar
- Fischer G, Rocha EP, Brunet F, Vergassola M, Dujon B: Highly variable rates of genome rearrangements between hemiascomycetous yeast lineages. PLoS Genet. 2006, 2: e32-10.1371/journal.pgen.0020032.PubMedPubMed CentralView ArticleGoogle Scholar
- McLysaght A, Seoighe C, Wolfe K: High frequency of inversions during eukaryote gene order evolution. Comparative Genomics. Edited by: Sankoff D, Nadeau JH. 2000, Dordrecht: Kluwer Academic Publishers, 47-58.View ArticleGoogle Scholar
- Kharchenko P, Church GM, Vitkup D: Expression dynamics of a cellular metabolic network. Mol Syst Biol. 2005, 1: 2005.0016-10.1038/msb4100023. 2005.0016PubMedPubMed CentralView ArticleGoogle Scholar
- Kruglyak S, Tang H: Regulation of adjacent yeast genes. Trends Genet. 2000, 16: 109-111. 10.1016/S0168-9525(99)01941-1.PubMedView ArticleGoogle Scholar
- Byrne KP, Wolfe KH: Visualizing syntenic relationships among the hemiascomycetes with the Yeast Gene Order Browser. Nucleic Acids Res. 2006, D452-455. 10.1093/nar/gkj041. 34(Database issue)
- Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH: Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006, 440: 341-345. 10.1038/nature04562.PubMedView ArticleGoogle Scholar
- Agresti A: Categorical Data Analysis. 2002, New Jersey: John Wiley and Sons, 2View ArticleGoogle Scholar
- Akhunov ED, Akhunova AR, Linkiewicz AM, Dubcovsky J, Hummel D, Lazo G, Chao S, Anderson OD, David J, Qi L, et al: Synteny perturbations between wheat homoeologous chromosomes caused by locus duplications and deletions correlate with recombination rates. Proc Natl Acad Sci USA. 2003, 100: 10836-10841. 10.1073/pnas.1934431100.PubMedPubMed CentralView ArticleGoogle Scholar
- Slamovits CH, Fast NM, Law JS, Keeling PJ: Genome compaction and stability in microsporidian intracellular parasites. Curr Biol. 2004, 14: 891-896. 10.1016/j.cub.2004.04.041.PubMedView ArticleGoogle Scholar
- Bourque G, Zdobnov EM, Bork P, Pevzner PA, Tesler G: Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. Genome Res. 2005, 15: 98-110. 10.1101/gr.3002305.PubMedPubMed CentralView ArticleGoogle Scholar
- Singer GA, Lloyd AT, Huminiecki LB, Wolfe KH: Clusters of co-expressed genes in mammalian genomes are conserved by natural selection. Mol Biol Evol. 2005, 22: 767-775. 10.1093/molbev/msi062.PubMedView ArticleGoogle Scholar
- Hentges KE, Pollock DD, Liu B, Justice MJ: Regional variation in the density of essential genes in mice. PLoS Genet. 2007, 3: e72-10.1371/journal.pgen.0030072.PubMedPubMed CentralView ArticleGoogle Scholar
- Förster J, Famili I, Fu P, Palsson BO, Nielsen J: Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 2003, 13: 244-253. 10.1101/gr.234503.PubMedPubMed CentralView ArticleGoogle Scholar
- Kafri R, Bar-Even A, Pilpel Y: Transcription control reprogramming in genetic backup circuits. Nat Genet. 2005, 37: 295-299. 10.1038/ng1523.PubMedView ArticleGoogle Scholar
- Ball CA, Jin H, Sherlock G, Weng S, Matese JC, Andrada R, Binkley G, Dolinski K, Dwight SS, Harris MA, et al: Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 2001, 29: 80-81. 10.1093/nar/29.1.80.PubMedPubMed CentralView ArticleGoogle Scholar
- Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, et al: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999, 285: 901-906. 10.1126/science.285.5429.901.PubMedView ArticleGoogle Scholar
- Gerton JL, DeRisi J, Shroff R, Lichten M, Brown PO, Petes TD: Inaugural article: Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc Natl Acad Sci USA. 2000, 97: 11383-11390. 10.1073/pnas.97.21.11383.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298: 799-804. 10.1126/science.1075090.PubMedView ArticleGoogle Scholar
- Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M: Stratus not altocumulus: A new view of the yeast protein interaction network. PLoS Biol. 2006, 4: e317-10.1371/journal.pbio.0040317.PubMedPubMed CentralView ArticleGoogle Scholar
- Drummond DA, Raval A, Wilke CO: A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. 2006, 23: 327-337. 10.1093/molbev/msj038.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.