Identification of co-regulated transcripts affecting male body size in Drosophila

Factor analysis is applied to microarray data in order to relate gene networks to complex traits and identifies a factor associated with body size in Drosophila simulans.


Background
Unraveling complex traits requires an understanding of how genetic variation results in variation among transcript levels, proteins, and metabolites, and how this variation generates phenotypic variation. These distinct levels in the biological system are interdependent. The ability to model interactions among loci at each of these levels, and relationships between levels, is key to providing insight into complex traits. The promise of genomic and proteomic technology is in capturing variation for thousands of loci simultaneously. This affords an unprecedented opportunity to understand the consequences of genetic variation. Many studies have exploited this ability through the use of mutant analysis applied to wholegenome transcript arrays. Mutant analysis provides insight into the impact of a mutation on a gene network and wholegenome studies of transcription have revealed misexpression due to gene knockouts and have established redundancy and specificity of transcriptional regulation [1]. Cluster analysis has been successfully combined with tests of differential expression to study whole-genome response to mutation in order to develop hypotheses about co-regulation and coordinated expression [2,3].
However, the consequences of such strong perturbations are difficult to apply to pathways in non-mutant individuals. In addition, the mutations chosen usually cause a severe alteration in a single gene, such as a knockout. Natural variants introduce smaller changes in pathways [4] and natural variants may exhibit allelic differences at several loci. Natural variation in the transcriptome as a consequence of genetic variation has been demonstrated [5,6]. Natural genotypes can also be mated in a deliberate manner and the progeny of such matings can be used to estimate the genetic architecture of individual traits [7,8], and to link traits across different levels of the biological system [9,10]. We focus here on providing insight into how coordinated gene expression affects phenotype. Links between transcript abundance and phenotypic variation have been established [11][12][13][14][15]. What is needed now is an analytic approach that allows interpretation of the relationships among transcript levels and modeling of the link between transcript level and complex trait.
Factor analysis is an analytic approach that describes the covariation among a set of genes through the estimation of factors. One may interpret the factor as the mechanism, for example transcription factors, microRNAs (miRNAs), and so on, by which genes are co-regulated. The resulting factor model represents sets of coordinately expressed genes. Genes may participate in multiple factors. Principal components analysis, spectral map analysis and correspondence analysis are alternative multivariate techniques for microarray analysis [16,17] that can all be used in this capacity. Factor analysis, however, provides a convenient representation of the gene network by describing each gene's association with the factor as a load (between -1 and 1), where the strength of the load indicates how much influence the transcript level of that gene has upon the factor. The factor can then be examined for associations with complex traits [18]. Factor analysis is the extension of Sewell Wright's work on the correspondence among traits [19], and as such is perfectly suited for modeling the relationships among transcript levels for a set of crosses. The high dimensionality of genome-wide expression data presents special challenges. This challenge, primarily the illconditioned matrices resulting from such studies, has been well described and explicitly acknowledged in much of the literature on the analysis of gene-expression data [20][21][22]. If thousands of genes belonging to dozens of networks are simultaneously considered as current theory indicates, spurious associations may emerge and/or true associations may be obscured [23,24]. Previous applications of factor analyses to array data [25,26] dealt with this issue by an initial reduction of dimensionality through the use of cluster analysis.
Using simulation studies, we evaluate the utility of factor analysis for identifying covariation in gene-expression data and identifying underlying factors. We compare the performance of factor analysis to hierarchical clustering and tight clustering [27]. We then test the estimation of factors on a set of Drosophila lines for genes involved in the immune pathway. The immune system provides a relatively well understood set of interactions and as such allows a real data check on the applicability of factor analysis to microarray data.
A logical next step is to use factor analysis to relate variation among transcript levels to phenotypic variation, a step not possible in a cluster analysis. For Drosophila, body size is a complex trait where latitudinal clines in body size have been repeatedly demonstrated across ectotherms [28]. In D. sub-obscura, a body-size cline evolved in 12 decades, thus ranking body size in flies among the fastest-evolving morphological traits ever observed in natural populations [29]. The proximate reasons for these clines are complex, especially given that body size in flies is positively correlated with mating success in males [30][31][32]. Of further interest are data suggesting that the same genomic regions are involved in adaptation in two of these clines, South America and Australia [33]. In contrast to the immune system, there is little a priori information on how the candidates genes are related to one another. In addition, identification of factors associated with variation in body size in natural populations of Drosophila is a question of great evolutionary interest.

Simulations
In the initial scenario, a sample size of 100 individuals was examined. This sample size is large for a microarray experiment, but is in the low range of the minimum sample size suggested in factor analysis methodology [34]. We simulated a high degree of correlation among genes within a factor (ρ = 0.80), three factors with a manageable number of genes associated within each factor (correlated genes: 30), and some genes not associated with any factor (noise genes: 100). We assume genetically variable lines for which differences in transcript abundance among lines was moderate within each of the three factors [35]. Factor analysis on these data was performed. Factors were identified by examining the eigenvalues of the correlation matrix [23]. The first five eigenvalues were 25.3, 21.3, 18.3, 4.3, and 4.1. The substantial drop between the third and fourth eigenvalues (from 18.3 to 4.3) indicates that three factors (the number simulated) are clearly identified, explaining 34% of the variation. We then set the number of factors in the analysis to three, and estimated factor loadings in order to examine the structure of the factor. All (100%, n = 90) of the correlated genes loaded [36] on the correct factor, with none of the noise genes loading on any factor (see Table 1). Reducing the correlation among factors, and reducing the effect size do not affect the ability of factor analysis to identify the correct underlying structure ( Table 1).
Results of a hierarchical cluster analysis found that the three groups of genes clustered together with the noise genes which formed two distinct clusters. However, discriminating the true clusters from the noise clusters was not obvious using standard approaches. Tight clustering [27], where a resampling strategy is used to separate noise genes from signal, on these data was interesting. If the number of clusters is set to the true value of three, all 190 genes are identified as noise. If the number of clusters is set to five, 45 of the 100 noise genes are correctly identified as noise. All of the correlated genes are placed into the correct clusters. The remaining 55 noise genes are placed into clusters.
For the case with lower effect size and lower correlation, the dendrogram resulting from hierarchical cluster analysis is given in Figure 1. As in the factor analysis, the three groups of genes clustered together well, although not perfectly. Once again, however, statistics for determining the appropriate number of clusters did not clearly identify the correct number of clusters. The noise genes also seem to follow discernible clustering patterns. In tight clustering, when the correct number of clusters are specified and the number of extra clusters (k0) is set to 6-7, 23 of the 90 correlated genes are identified as noise and all of the noise genes are correctly identified. Setting the number of clusters higher results in clusters of noise genes. In this simple case, factor analysis clearly outperforms both traditional hierarchical cluster analysis and tight clustering, as it is easily able to discern the correct number of underlying clusters.
We then increased the number of genes from a total of 190 (90 in the three networks and 100 noise genes) to 1,900 (900 in the three factors and 1,000 noise genes). Using factor analysis, we easily identified the correct number of factors and 100% of the genes in each factor loaded on the correct factor (see Table 1). Lowering the correlation among genes in a factor to ρ = 0.4 resulted in the reduction of the explanatory power of the factor analysis. The number of underlying factors was correctly identified although, as expected, the total variation explained by the factors was reduced. Of the correlated genes, 66% loaded on the correct factors and only one noise gene (out of 1,000) was mistakenly placed into a factor. Given the reasonable fractions identified when the number of genes in factors differs by an order of 10 (190 versus 1,900), and the fact that our recovery of the structure was virtually unchanged, it is apparent that the number of genes in a factor does not impact on the ability of factor analysis to recover the factor structure. In contrast, hierarchical cluster analysis performs less well as the number of noise genes increases, with the noise genes increasing in their dispersion among clusters.
In a set of simulations to match our Drosophila experimental design, 10 genotypes with three replicates per genotype for a total of 30 samples (chips) were simulated. Averaging transcript abundance within each genotype removed uninteresting variation and increased resolution (data not shown). We began with three factors of 30 genes each, and The number of genotypes simulated is given in the first column. The number of underlying latent factors is given in the second column, followed by the number of genes simulated that are not a part of any underlying factor. The number of genes on each factor is given next, and are simulated as a multivariate normal with pairwise correlation among genes within the factor of ρ. The mean for the first genotype is drawn from a gamma distribution, and the subsequent means were drawn from a multivariate normal, with standard deviation of one such that the maximum difference between the means can be interpreted as the genotypic effect size. Thus, for each underlying factor the simulated genotypic effect is the maximum difference in transcript abundance among genotypes for the first, second, and third factor, respectively. Factors are considered to be clearly identified if there is a substantial drop in the eigenvalues of the correlation matrix, and a reasonable proportion of the total variation is explained. 100 noise genes. The maximum difference in transcript abundance among genotypes was large (1, 2 and 3). In this case, the three factors were not clearly identified. To determine whether genes would be correctly placed inside factors, the number of factors was set to three. When difference in transcript abundance among genotypes was lowest, fewer genes were placed in the correct factor (13/30). On the other hand, where difference in transcript abundance among genotypes was highest and 100% of the correlated genes loaded on the correct factor, 29 of the genes from the other two factors and 29 noise genes were erroneously identified as a part of this factor. When correlation among genes within a factor was reduced to ρ = 0.40, the difficulty in identifying genes on the correct factor increased.
We hypothesized that the presence of many noise genes, coincident with a sample size of ten, was responsible for the difficulty in identifying the correct number of gene factors. Without the noise genes, when correlation was ρ = 0.8, each of the underlying factors was clearly identified. All (100%) of the genes were correctly placed in the corresponding factors, although several genes were placed in multiple factors. When correlation among genes in a factor was reduced to ρ = 0.40, the number of factors was underestimated and many of the genes were identified in multiple factors. This indicates a complex interplay between the difference in transcript abundance among genotypes, the sample size, the number of noise genes and the correlation structure in the identification of factors.
We also examined the impact of increasing the number of factors beyond the number of genotypes. We simulated a set of 600 genes belonging to 20 factors, with 30 genes in each factor and 100 noise genes. In an analysis with nine factors (the maximum estimable with ten genotypes), factor 1 seemed to capture the majority of all the genes in the 20 factors. When effect size was large, the majority of the noise genes (58%) were identified by their failure to load highly (greater than 0.7) on any factor, and the majority of genes that were correlated did load highly on at least one factor (96%). Lowering the correlation lowers the ability to identify correlated genes as loading highly. In this case, 30% are identified and approximately the same fraction of noise genes (60%) are identified. Consistent with previous factor analysis theory [23,36], when the number of factors is larger than the sample size the number and composition of the factors cannot be estimated.
We then wanted to determine whether hierarchical cluster analysis would resolve this structure more clearly. We plotted the results in a dendrogram where each of the simulated fac-tors are plotted with a separate color ( Figure 2). No clear pattern of clustering was found. However, the clusters do form some 'kernels', so that biological knowledge of pathways could potentially be applied to interpret some of the groupings, as is common practice. However, the clear presence of noise genes throughout the cluster structure clearly makes interpretation difficult in cases where the true structure is unknown.
We then applied tight clustering to these data. When the correct number of clusters is specified, tight clustering does identify the structure of the 20 clusters. Each cluster consists of a subset of genes that belong to that cluster. Notably, it does not erroneously place noise genes into clusters. It performs less well in identifying the correct number of correlated genes failing to place 50% into clusters, instead classifying these correlated genes as noise. When a larger number of clusters (25) is specified, then there are some 'extra' clusters of noise genes and the clusters of genes are themselves not as distinct, that is 40 genes of the 600 are incorrectly grouped and 221 or 37% of the genes which should be in a cluster are designated as noise.
Overall, the simulation results indicate that focusing on a manageable number of possible factors with a measurable amount of difference in transcript abundance among lines can result in successful identification of factor structure, even when the number of genes examined is large relative to the sample size. We also find that the factor loadings can distinguish noise genes even in complex cases, although the factor structure can not be resolved clearly in those cases.

Data analysis Loci showing evidence for transcript variation
The GeneChip Drosophila Genome Array was used for this study and of the approximately 13,500 genes on the array, 7,886 showed expression on at least one array, and of these, 4,667 showed evidence for variation among genotypes. As we are studying covariation, we restricted our examination to this list of 4,667 loci (see Additional data file 1 [Supplementary Table 1]).

The immune pathway
To provide an assessment of the performance of factor analysis on a set of well characterized genes, FlyBase [37] was queried for candidate genes involved in the immune pathway [38]. We compared the candidate genes to the list of 4,667 transcripts, and 54 genes were identified. Factor analysis on these transcript levels resulted in the identification of three factors (see Figure 3). Notably, the first factor contained all the lysogen genes present in the study, and the second factor Hierarchical cluster plot of simulation with two genotypes, 100 noise genes, and three factors Figure 1 (see previous page) Hierarchical cluster plot of simulation with two genotypes, 100 noise genes, and three factors. ρ = 0.40, effect size = 0.2, 0.4, and 0.6. Blue, noise genes; green, genes from underlying factor 1; red, genes from underlying factor 2; black, genes from underlying factor 3. contained all cecropins (Table 2). Factor-analysis groups coregulated genes in a manner consistent with our understanding of the immune pathway. Hierarchical cluster analysis was also performed on this set of genes. As with the factor analysis, hierarchical cluster analysis found that the lysogen genes clustered together, as did the cecropins. However, determining the appropriate number of clusters was problematic and did not lead to any clear interpretation of the appropriate number of clusters. The final factor model for the immune genes included three factors, therefore we examined a kmeans clustering analysis with three clusters for comparison. We found that 17/24 (71%) of the genes that loaded high on factor 1 were on the same cluster. This cluster also includes a few genes that loaded on factors 2 and 3. Genes that loaded high on the second and third factor were distributed among the remaining two clusters. The genes in these groups that did not load significantly on any factor were distributed among the three clusters.

Candidate loci for body size
We again queried FlyBase for a list of genes involved in body size determination and found 92 body size candidates in our list of 4,667 transcripts. Four factors were identified ( Figure   3). The identification of covarying genes in the same factor was intriguing. Of particular note is the presence of loci that are contained within quantitative trait loci (QTL) for body size on factor 1: Cdk4, trx, akt1, fru, Dr, mask, khc, and InR [33].
In our analysis, we regressed transcript level for individual candidate genes on the body size phenotype. We found 2,892 genes significant at a nominal level of 0.05 (false discovery rate (FDR), 16%, see Table 3). At a more stringent nominal threshold of 0.01, there were 14 candidate genes for body size which showed significant association between transcript abundance and phenotypic variation for male body size (FDR of 7%). Of the QTL candidates, only InR showed significant association with the phenotype.
We then tested the hypothesis that the estimated factors were directly related to phenotypic variation in body size. In our analysis, we regressed the estimated factor (latent variable) on the phenotype body size for each genotype. The regression of factor 1 on body size showed evidence of an association between the factor and the phenotype of body size (P = 0.04, Figure 4a).
Hierarchical cluster plot of simulation with ten genotypes, 100 noise genes, and 20 factors Figure 2 (see previous page) Hierarchical cluster plot of simulation with ten genotypes, 100 noise genes, and 20 factors. ρ = 0.4, effect size = 1, 2,...20. Blue, noise genes; other colors represent genes that should cluster together.

Targets of miRNAs
Of 535 putative miRNA targets [39], 203 were contained in our set of 4,667 gene transcripts. Factor analysis resulted in the identification of four gene factors ( Table 4). The second factor contained four of the same genes as factor 1 for body size (puc, Eh, mys, bon) with 76 additional genes contained in this factor (loaded at 0.40 or greater). However, this factor was not associated with body size (P = 0.55). While some of the QTL candidates are also putative targets of miRNA regulation, (Cdk4, trx, Dr) these genes did not participate in this factor but were common to the factor identified by the third factor (see Table 4), for which 69 additional genes loaded. This third factor was negatively correlated with body size (P = 0.04, see Figure 4b). (Cdk4, trx, Dr) were not associated with body size in regressions between these individual genes and body size.

Discussion
We applied factor analysis to high-dimensional microarray data. Using simulated data to estimate factors, we found that when correlation among genes is strong, the number of factors and their structure can be estimated, even in the case where genes unrelated to the factor structure (noise genes) are included. We also found that when noise genes were included hierarchical cluster analysis was unable to separate the noise genes from the signal, or to correctly identify the number of clusters. When the number of genes is large relative to the sample size, as is common in array studies, the number of factors and the genes belonging to each factor can still be identified, as long as the number of factors is less than the sample size. In contrast, hierarchical cluster analysis did not identify the number of clusters even when the number of clusters was smaller than the sample size. We found that if the number of factors is larger than the number of genotypes, the majority (58%) of noise genes still do not load on any factor, while all but 26 of the 600 correlated genes do load on at least one factor. However, the correct association between individual genes and factors is lost. We conclude that while factor analysis is effective at separating the signal from the noise, the structure of the signal is not estimable. This is consistent with reports in the literature [23,36]. Using hierarchical clustering, the number and the structure of clusters is not recovered and noise genes are scattered throughout the cluster structure. Kernels of tightly correlated genes were visible, however, indicating that kernel identification is possible in cases where biological knowledge is present. Any separation of signal from noise is purely serendipitous.
These results are not unexpected as the mathematical properties of gene-expression data or the high dimensionality of the data lead to problems for any analysis. When the number of columns (in this case observations) is less than the number of rows (or variables), the matrix is considered ill conditioned [40]. Ill-conditioned matrices can cause problems for many types of statistical analysis and can lead to overfitting of predictive models, among other problems [41,42]. Some of the effects of the ill-conditioned matrices can be mitigated; however, the problem of an overdetermined system will always exist. An example is in multiple regression, where models with more variables than we have data can not be fit [43].
Tight clustering [27] represents a significant advance over hierarchical clustering in the estimation of cluster structure for microarray data. It provides a reasonable way of identifying most noise genes. In simple cases, however, the algorithm needs to have some flexibility when specifying starting values; that is, more rather than fewer clusters improve the chances of correctly identifying true clusters. The algorithm requires that the number of clusters be specified a priori. In contrast, in many simple cases the number and structure of factors can be recovered precisely using a factor analysis. In the most complex case examined (20 factors, 600 genes and 100 noise genes), when the number of clusters is correctly specified, tight clustering identifies 100% of the noise genes and the 20 clusters with correct genes within each cluster. However, it also incorrectly identifies 50% of genes with signal as noise genes. If too many clusters are specified (25), then the number of genes identified in clusters increases to 63%, although the correct structure is no longer maintained and 20% of the noise genes are incorrectly clustered. In contrast, factor analysis separates the signal from the noise, correctly identifying 96% of the noise genes as noise and 58% of the genes as having signal. In this complex case it is difficult to say whether factor analysis or tight clustering is   Factor analysis for candidate genes for body size. There were 92 candidate genes and a four factor model was fit. The genes that loaded with a value greater than 0.40 are listed here. For each factor, the first column is the gene symbol name from [37] and the second column is the loading value for that gene. Genes are considered as loading 'significantly' if the absolute value of the loading value is greater than or equal to 0.40. Genes are considered as loading 'high' if the absolute value of the loading value is greater than or equal to 0.70. The third column for each factor is the p-value for the individual gene expression value regression on male body size (p 1 ).  Table 4 Factor analysis for putative targets of miRNAs 'better'. Both of these approaches clearly outperform hierarchical clustering, and they are complementary in their approach. If the primary goal is to separate the correlated genes from the uncorrelated genes then factor analysis performs better than tight clustering. If determining the number and structure of factors (coexpressed genes) is the goal, using both approaches and comparing findings will be reasonable.

Regression plots
Clear network structures can be identified with factor analysis and can then be followed up experimentally. Furthermore, unlike cluster analysis (hierarchical or tight), which provides no summary of the clusters into a single variable, the factor loading values are directly interpretable as the degree of participation of that locus in a factor. In contrast to cluster analysis, the factor loadings for genes give us information on the relative strength of a gene on a factor. We can use factor loadings to identify the most 'significant' genes on a factor and we can use the factor loadings to remove genes that are not contributing to any factor. As such, genes that all load highly on the same factor can be said to be coordinately expressed and putatively co-regulated, and noise genes can be identified as they will not load high on any factor. Unlike clustering, factor analysis allows genes to participate in several factors, thus reflecting biological reality more accurately.
Extensions of factor analysis have been developed and include allowing for nonlinearity [44,45] and the application of Bayesian approaches [46]. Factor analysis is itself a subclass of modeling techniques known as structural equation models [47] and the ongoing theoretical interest in these approaches allows for expansion of consideration to more than the present circumstance.
We focus our Drosophila analyses on a set of genotypes that are mated according to a round-robin design where the parental lines are natural variants. Analyses of such lines allows inferences about the underlying genetic contribution to variation and inferences about variation in pathways that Factor analysis for putative targets of miRNAs. There were 203 candidate genes and a four-factor model was fit. The genes that loaded with a value greater than 0.40 are listed here. For each factor, the first column is the gene symbol name from [37] and the second column is the loading value for that gene. Genes are considered as loading 'significantly' if the absolute value of the loading value is ≥ 0.40. Genes are considered as loading 'high' if the absolute value of loading value is ≥ 0.70. The third column for each factor is the the P-value for the individual gene expression value regression on male body size (p 1 ). Factor analysis for putative targets of miRNAs occurs as a result of such natural genetic variation. By using candidate loci from reverse genetic and mutational projects, the importance of these loci in a broad context can be assessed. The factor analysis for the list of genes annotated as body size candidates resulted in the estimation of four factors. Factor 1 is of great interest as several of the genes that load on this factor -Cdk4, trx, akt1, fru, Dr, mask, woc, Khc, and InR -are contained in QTL for body size [33]. This overlap is exciting as the data from this QTL analyses are independent of our factor analysis. The identification of the same set of loci lends weight to the evidence that these loci are involved with the formation of body size in a natural population. Of these loci, only InR is directly correlated with body size. Given our limited knowledge of pathways for body size, it was exciting to note that two of the genes in this factor -trx and M(2)21ABhave been shown to interact [48]. In addition, the early sexdetermination cascade genes Sxl, tra, and tra2 are all associated with a different factor (factor 2).
If the expression of multiple genes is regulated by a common underlying factor such as a transcription factor or a miRNA, and this regulation exhibits genetic variation, then we expect that the gene expression among these genes in a set of genotypes from a mating design will also be correlated. Factor analysis on putative targets of miRNA control revealed that putative targets of the same miRNA often occurred on the same factor. The number of targets for each miRNA was small, and so we cannot conclude that the coexpression is greater than we would expect by chance. Nevertheless, this is the first opportunity to examine the correlation structure of these putative targets of miRNAs. We found that of the 203 genes identified as miRNA targets, there was evidence for 188 participating in at least one of the four factors. While four genes on factor 1 for body size (puc, Eh, mys, and bon), and nine additional candidate body size genes (Sh, Abd-B, trx, fng, qm, woc, Dr, and Cdk4) are putative targets of miRNAs [39], the resulting miRNA factors are uncorrelated with the factors for body size. One of the miRNA factors is associated with the body size phenotype.
In summary, factor analysis, a technique developed to discover and model underlying mechanisms in complex social and psychiatric situations, seems to offer a reasonable middle ground for gaining understanding of coordinated gene expression. Our simulations show that when the number of underlying factors is larger than the sample size, it is not straightforward to recover the structure of the simulated data, although signal can be separated from noise. In contrast, target groups, even when the number of genes is large, can be used to identify several underlying regulatory mechanisms.
In the case of body size for Drosophila, factor analysis offers an exciting opportunity to estimate gene networks, as relatively little is known about how genes involved in body size work together.

Simulations
In trying to estimate the structure of gene networks, one of the first questions to be addressed is whether the number of genes contributing to each network affects the ability of the analysis to determine the structure of the network. Accordingly, we varied the number of genes examined in our simulations to explore this process (see Table 1). Gene networks were simulated as a set of correlated expression values. These networks represent genes affected by some common underlying biological process and the correlation structure that results from this biological connectedness is the concrete evidence of the network. In our simulations, for each network, a mean from a gamma distribution (γ) is drawn. A gamma distribution was chosen for the mean values across genes as this distribution has been shown to fit the distribution of transcript levels seen from genes on an array [49]. Genes in the network were simulated according to a multivariate normal distribution with correlation among genes indicated by the parameter (ρ). We looked at two levels of correlation ρ = 0.40 (weak) and ρ = 0.80 (strong). Genetic variation for individual genes within a network was simulated as a linear combination of the multivariate normal mean, a fixed genotypic effect and random noise. Genotypic effects for different lines differed in magnitude. The standard deviation was set to 1, and the standard normal was used, so that differences between lines can be directly interpreted as the size of the effect, that is the effect sizes are absolute and not relative to the magnitude of the gene expression (or amount of variation) for a particular gene. When the maximum difference among genotypes is less than 0.2 the effects are small, when differences are between 0.2 and 0.6 the effects are medium and when effects are larger than 0.8 they are relatively large [35]. Between networks the maximum difference of transcript abundance (effect size) among lines was allowed to vary.
Genes not participating in any network, were also simulated. These genes, uncorrelated to each other, are hereafter referred to as noise genes. For noise genes, the mean value for gene expression was drawn from a gamma distribution and variation from that mean was random and without regard to the genotype.
Concurrently, factor analysis on the data was performed and oblique rotations were used for estimating factor loadings [36]. We used the eigenvalues to determine the number of factors according to standard factor analytic approaches [36]. The plot of the eigenvalues, against the ordinal number for the eigenvalue (SCREE plot) is examined for the last substantial drop in the magnitude of the eigenvalues and a model with the same number of factors as the number of eigenvalues before the last substantial drop are retained [23,36]. Hierarchical cluster analysis was performed using a Ward distance. Tight clustering, a resampling-based approach to cluster analysis, was performed using software provided by [27].

Drosophila lines
Isogenic lines of Drosophila simulans were made from flies caught in Wolfskill orchard [50] and crossed in a round-robin mating design as follows: the ten parental stocks (randomly sampled and independently derived from a large natural reference population) were crossed together in ten combinations to create heterozygous lines such that each parent was present twice, once as a dam and once as a sire. Crosses were (dam × sire): SIM6 × SIM11, SIM11 × SIM36, SIM36 × SIM 37, SIM37 × SIM43, SIM43 × SIM61, SIM61 × SIM77, SIM77 × SIM85, SIM85 × SIM99, SIM99 × SIM105, and SIM105 × SIM6. The resulting heterozygous genotypes were used for the experiment as follows: RNA from males was extracted and labeled 4-7 days post-eclosion [6,50]. Transcript level was estimated the average difference using MAS 5 [51]. Genes without positive signal on at least one array were removed from further consideration. The remaining genes were normalized to the chip median and log transformed. Genes lacking variation among genotypes cannot be meaningfully assessed for covariation. Accordingly, any gene that showed evidence (P ≤ 0.2) for variation in transcript level across lines [50] or evidence for additive genetic effect (P ≤ 0.05) [6] was considered in further analyses. Adult male flies were measured for body size [52].

Drosophila data analysis
Candidate gene lists were developed using FlyBase queries. The miRNA targets were taken from Enright and colleagues [39]. The resulting lists were matched against the set of loci for which we had evidence of genetic variation in transcript abundance among lines.
In the first step of the data analysis, regression of transcript abundance for individual genes that were candidates for body size, or putative miRNA targets, was conducted to test the hypothesis that individual genes transcript levels were associated with body size. The average transcript abundance for each gene i within genotype j (g ij ) was regressed on the average male body size (Y j ) for each genotype j as follows: The mean of all genotypes is µ and the random error is ε. Cluster analysis was performed on the set of immune genes. This was done using a hierarchical cluster analysis on the standardized values of gene expression. The results were plotted in a dendrogram.
Factor analysis on standardized values of gene expression, for each of the three lists (genes in the immune pathway, candidates for body size and putative miRNA targets) was conducted separately and factor loadings were estimated using an oblique rotation. The resulting set of eigenvalues was plotted in a SCREE plot, and the number of factors chosen such that the drop between eigenvalues was apparent, and a reasonable proportion of the variation was explained [36]. Each factor identified represents a gene network. Once the number of factors was identified, factor analysis was repeated for that fixed number of factors and loading values (the correlation between individual genes and the estimated factor structure) were estimated. All factor analyses were conducted in SAS (PROC FACTOR) using standard options, no special coding was required.
The effect of the factor upon the genotype is also estimable in the form of a factor value. Factor values were computed as a linear combination of the factor loading times the standardized value for the gene expression [18,23]. The factor value for genotype j (GN j ) is an estimate of the impact of the network upon the genotype and is estimated as where g ij is the expression value for gene i within genotype j and l i is the loading value estimated from the factor analysis for gene i across all genotypes. Differences in factor values represent differences between genotypes for the factor.
As all the genotypes are related in a mating design, we expect that differences in networks due to genetic variation which result in phenotypic variation should be detectable. Accordingly, the factor values were then regressed on the phenotype where Y j = µ + GN j + ε j (2) where GN j is the factor value for genotype j.

Additional data files
The following additional data is available with the online version of this paper. Additional data file 1 lists the median values of gene expression for each of the ten lines (after normalization) for the 4,667 genes found to have some evidence of a line effect. Some basic annotation information from Affymetrix is also provided along with the feature number.
Additional File 1 Line 1 through Line 10 medians for the 4667 genes considered in this analysis, after normalization (line1 median ) and standardization (line1 median_st ). Some basic annotation is also included as well as the Affymetrix feature number http://www.Flybase.com, Fall 2004) Line 1 through Line 10 medians for the 4667 genes considered in this analysis, after normalization (line1 median ) and standardization (line1 median_st ). Some basic annotation is also included as well as the Affymetrix feature number http://www.Flybase.com, Fall 2004).
The median values of gene expression for each of the ten lines (after normalization) are provided for the 4,667 genes found to have some evidence of a line effect. These are labeled as line1_median-line10_median. Standardized values are also provided and labeled as line1_median _st-line1-_median_st. Some basic annotation information from Affymetrix is also provided along with the feature number. The chromosome number was inferred based upon the cytogenetic map position. Click here for file