Identification of co-regulated transcripts affecting male body size in Drosophila

Coffman, Cynthia J; Wayne, Marta L; Nuzhdin, Sergey V; Higgins, Laura A; McIntyre, Lauren M

doi:10.1186/gb-2005-6-6-r53

Method
Open access
Published: 01 June 2005

Identification of co-regulated transcripts affecting male body size in Drosophila

Cynthia J Coffman^1,2,
Marta L Wayne³,
Sergey V Nuzhdin⁴,
Laura A Higgins³ &
…
Lauren M McIntyre^2,5

Genome Biology volume 6, Article number: R53 (2005) Cite this article

6580 Accesses
18 Citations
Metrics details

Abstract

Factor analysis is an analytic approach that describes the covariation among a set of genes through the estimation of 'factors', which may be, for example, transcription factors, microRNAs (miRNAs), and so on, by which the genes are co-regulated. Factor analysis gives a direct mechanism by which to relate gene networks to complex traits. Using simulated data, we found that factor analysis clearly identifies the number and structure of factors and outperforms hierarchical cluster analysis. Noise genes, genes that are not correlated with any factor, can be distinguished even when factor structure is complex. Applied to body size in Drosophila simulans, an evolutionarily important complex trait, a factor was directly associated with body size.

Background

Unraveling complex traits requires an understanding of how genetic variation results in variation among transcript levels, proteins, and metabolites, and how this variation generates phenotypic variation. These distinct levels in the biological system are interdependent. The ability to model interactions among loci at each of these levels, and relationships between levels, is key to providing insight into complex traits. The promise of genomic and proteomic technology is in capturing variation for thousands of loci simultaneously. This affords an unprecedented opportunity to understand the consequences of genetic variation. Many studies have exploited this ability through the use of mutant analysis applied to whole-genome transcript arrays. Mutant analysis provides insight into the impact of a mutation on a gene network and whole-genome studies of transcription have revealed misexpression due to gene knockouts and have established redundancy and specificity of transcriptional regulation [1]. Cluster analysis has been successfully combined with tests of differential expression to study whole-genome response to mutation in order to develop hypotheses about co-regulation and coordinated expression [2, 3].

However, the consequences of such strong perturbations are difficult to apply to pathways in non-mutant individuals. In addition, the mutations chosen usually cause a severe alteration in a single gene, such as a knockout. Natural variants introduce smaller changes in pathways [4] and natural variants may exhibit allelic differences at several loci. Natural variation in the transcriptome as a consequence of genetic variation has been demonstrated [5, 6]. Natural genotypes can also be mated in a deliberate manner and the progeny of such matings can be used to estimate the genetic architecture of individual traits [7, 8], and to link traits across different levels of the biological system [9, 10]. We focus here on providing insight into how coordinated gene expression affects phenotype. Links between transcript abundance and phenotypic variation have been established [11–15]. What is needed now is an analytic approach that allows interpretation of the relationships among transcript levels and modeling of the link between transcript level and complex trait.

Factor analysis is an analytic approach that describes the covariation among a set of genes through the estimation of factors. One may interpret the factor as the mechanism, for example transcription factors, microRNAs (miRNAs), and so on, by which genes are co-regulated. The resulting factor model represents sets of coordinately expressed genes. Genes may participate in multiple factors. Principal components analysis, spectral map analysis and correspondence analysis are alternative multivariate techniques for microarray analysis [16, 17] that can all be used in this capacity. Factor analysis, however, provides a convenient representation of the gene network by describing each gene's association with the factor as a load (between -1 and 1), where the strength of the load indicates how much influence the transcript level of that gene has upon the factor. The factor can then be examined for associations with complex traits [18]. Factor analysis is the extension of Sewell Wright's work on the correspondence among traits [19], and as such is perfectly suited for modeling the relationships among transcript levels for a set of crosses. The high dimensionality of genome-wide expression data presents special challenges. This challenge, primarily the ill-conditioned matrices resulting from such studies, has been well described and explicitly acknowledged in much of the literature on the analysis of gene-expression data [20–22]. If thousands of genes belonging to dozens of networks are simultaneously considered as current theory indicates, spurious associations may emerge and/or true associations may be obscured [23, 24]. Previous applications of factor analyses to array data [25, 26] dealt with this issue by an initial reduction of dimensionality through the use of cluster analysis.

Using simulation studies, we evaluate the utility of factor analysis for identifying covariation in gene-expression data and identifying underlying factors. We compare the performance of factor analysis to hierarchical clustering and tight clustering [27]. We then test the estimation of factors on a set of Drosophila lines for genes involved in the immune pathway. The immune system provides a relatively well understood set of interactions and as such allows a real data check on the applicability of factor analysis to microarray data.

A logical next step is to use factor analysis to relate variation among transcript levels to phenotypic variation, a step not possible in a cluster analysis. For Drosophila, body size is a complex trait where latitudinal clines in body size have been repeatedly demonstrated across ectotherms [28]. In D. subobscura, a body-size cline evolved in 12 decades, thus ranking body size in flies among the fastest-evolving morphological traits ever observed in natural populations [29]. The proximate reasons for these clines are complex, especially given that body size in flies is positively correlated with mating success in males [30–32]. Of further interest are data suggesting that the same genomic regions are involved in adaptation in two of these clines, South America and Australia [33]. In contrast to the immune system, there is little a priori information on how the candidates genes are related to one another. In addition, identification of factors associated with variation in body size in natural populations of Drosophila is a question of great evolutionary interest.

Results

Simulations

In the initial scenario, a sample size of 100 individuals was examined. This sample size is large for a microarray experiment, but is in the low range of the minimum sample size suggested in factor analysis methodology [34]. We simulated a high degree of correlation among genes within a factor (ρ = 0.80), three factors with a manageable number of genes associated within each factor (correlated genes: 30), and some genes not associated with any factor (noise genes: 100). We assume genetically variable lines for which differences in transcript abundance among lines was moderate within each of the three factors [35]. Factor analysis on these data was performed. Factors were identified by examining the eigenvalues of the correlation matrix [23]. The first five eigenvalues were 25.3, 21.3, 18.3, 4.3, and 4.1. The substantial drop between the third and fourth eigenvalues (from 18.3 to 4.3) indicates that three factors (the number simulated) are clearly identified, explaining 34% of the variation. We then set the number of factors in the analysis to three, and estimated factor loadings in order to examine the structure of the factor. All (100%, n = 90) of the correlated genes loaded [36] on the correct factor, with none of the noise genes loading on any factor (see Table 1). Reducing the correlation among factors, and reducing the effect size do not affect the ability of factor analysis to identify the correct underlying structure (Table 1).

Table 1 Gene expression simulations

Full size table

Results of a hierarchical cluster analysis found that the three groups of genes clustered together with the noise genes which formed two distinct clusters. However, discriminating the true clusters from the noise clusters was not obvious using standard approaches. Tight clustering [27], where a resampling strategy is used to separate noise genes from signal, on these data was interesting. If the number of clusters is set to the true value of three, all 190 genes are identified as noise. If the number of clusters is set to five, 45 of the 100 noise genes are correctly identified as noise. All of the correlated genes are placed into the correct clusters. The remaining 55 noise genes are placed into clusters.

For the case with lower effect size and lower correlation, the dendrogram resulting from hierarchical cluster analysis is given in Figure 1. As in the factor analysis, the three groups of genes clustered together well, although not perfectly. Once again, however, statistics for determining the appropriate number of clusters did not clearly identify the correct number of clusters. The noise genes also seem to follow discernible clustering patterns. In tight clustering, when the correct number of clusters are specified and the number of extra clusters (k0) is set to 6-7, 23 of the 90 correlated genes are identified as noise and all of the noise genes are correctly identified. Setting the number of clusters higher results in clusters of noise genes. In this simple case, factor analysis clearly outperforms both traditional hierarchical cluster analysis and tight clustering, as it is easily able to discern the correct number of underlying clusters.

We then increased the number of genes from a total of 190 (90 in the three networks and 100 noise genes) to 1,900 (900 in the three factors and 1,000 noise genes). Using factor analysis, we easily identified the correct number of factors and 100% of the genes in each factor loaded on the correct factor (see Table 1). Lowering the correlation among genes in a factor to ρ = 0.4 resulted in the reduction of the explanatory power of the factor analysis. The number of underlying factors was correctly identified although, as expected, the total variation explained by the factors was reduced. Of the correlated genes, 66% loaded on the correct factors and only one noise gene (out of 1,000) was mistakenly placed into a factor. Given the reasonable fractions identified when the number of genes in factors differs by an order of 10 (190 versus 1,900), and the fact that our recovery of the structure was virtually unchanged, it is apparent that the number of genes in a factor does not impact on the ability of factor analysis to recover the factor structure. In contrast, hierarchical cluster analysis performs less well as the number of noise genes increases, with the noise genes increasing in their dispersion among clusters.

In a set of simulations to match our Drosophila experimental design, 10 genotypes with three replicates per genotype for a total of 30 samples (chips) were simulated. Averaging transcript abundance within each genotype removed uninteresting variation and increased resolution (data not shown). We began with three factors of 30 genes each, and 100 noise genes. The maximum difference in transcript abundance among genotypes was large (1, 2 and 3). In this case, the three factors were not clearly identified. To determine whether genes would be correctly placed inside factors, the number of factors was set to three. When difference in transcript abundance among genotypes was lowest, fewer genes were placed in the correct factor (13/30). On the other hand, where difference in transcript abundance among genotypes was highest and 100% of the correlated genes loaded on the correct factor, 29 of the genes from the other two factors and 29 noise genes were erroneously identified as a part of this factor. When correlation among genes within a factor was reduced to ρ = 0.40, the difficulty in identifying genes on the correct factor increased.

We hypothesized that the presence of many noise genes, coincident with a sample size of ten, was responsible for the difficulty in identifying the correct number of gene factors. Without the noise genes, when correlation was ρ = 0.8, each of the underlying factors was clearly identified. All (100%) of the genes were correctly placed in the corresponding factors, although several genes were placed in multiple factors. When correlation among genes in a factor was reduced to ρ = 0.40, the number of factors was underestimated and many of the genes were identified in multiple factors. This indicates a complex interplay between the difference in transcript abundance among genotypes, the sample size, the number of noise genes and the correlation structure in the identification of factors.

We also examined the impact of increasing the number of factors beyond the number of genotypes. We simulated a set of 600 genes belonging to 20 factors, with 30 genes in each factor and 100 noise genes. In an analysis with nine factors (the maximum estimable with ten genotypes), factor 1 seemed to capture the majority of all the genes in the 20 factors. When effect size was large, the majority of the noise genes (58%) were identified by their failure to load highly (greater than 0.7) on any factor, and the majority of genes that were correlated did load highly on at least one factor (96%). Lowering the correlation lowers the ability to identify correlated genes as loading highly. In this case, 30% are identified and approximately the same fraction of noise genes (60%) are identified. Consistent with previous factor analysis theory [23, 36], when the number of factors is larger than the sample size the number and composition of the factors cannot be estimated.

We then wanted to determine whether hierarchical cluster analysis would resolve this structure more clearly. We plotted the results in a dendrogram where each of the simulated factors are plotted with a separate color (Figure 2). No clear pattern of clustering was found. However, the clusters do form some 'kernels', so that biological knowledge of pathways could potentially be applied to interpret some of the groupings, as is common practice. However, the clear presence of noise genes throughout the cluster structure clearly makes interpretation difficult in cases where the true structure is unknown.

We then applied tight clustering to these data. When the correct number of clusters is specified, tight clustering does identify the structure of the 20 clusters. Each cluster consists of a subset of genes that belong to that cluster. Notably, it does not erroneously place noise genes into clusters. It performs less well in identifying the correct number of correlated genes failing to place 50% into clusters, instead classifying these correlated genes as noise. When a larger number of clusters (25) is specified, then there are some 'extra' clusters of noise genes and the clusters of genes are themselves not as distinct, that is 40 genes of the 600 are incorrectly grouped and 221 or 37% of the genes which should be in a cluster are designated as noise.

Overall, the simulation results indicate that focusing on a manageable number of possible factors with a measurable amount of difference in transcript abundance among lines can result in successful identification of factor structure, even when the number of genes examined is large relative to the sample size. We also find that the factor loadings can distinguish noise genes even in complex cases, although the factor structure can not be resolved clearly in those cases.

Data analysis

Loci showing evidence for transcript variation

The GeneChip Drosophila Genome Array was used for this study and of the approximately 13,500 genes on the array, 7,886 showed expression on at least one array, and of these, 4,667 showed evidence for variation among genotypes. As we are studying covariation, we restricted our examination to this list of 4,667 loci (see Additional data file 1 [Supplementary Table 1]).

The immune pathway

To provide an assessment of the performance of factor analysis on a set of well characterized genes, FlyBase [37] was queried for candidate genes involved in the immune pathway [38]. We compared the candidate genes to the list of 4,667 transcripts, and 54 genes were identified. Factor analysis on these transcript levels resulted in the identification of three factors (see Figure 3). Notably, the first factor contained all the lysogen genes present in the study, and the second factor contained all cecropins (Table 2). Factor-analysis groups co-regulated genes in a manner consistent with our understanding of the immune pathway. Hierarchical cluster analysis was also performed on this set of genes. As with the factor analysis, hierarchical cluster analysis found that the lysogen genes clustered together, as did the cecropins. However, determining the appropriate number of clusters was problematic and did not lead to any clear interpretation of the appropriate number of clusters. The final factor model for the immune genes included three factors, therefore we examined a k-means clustering analysis with three clusters for comparison. We found that 17/24 (71%) of the genes that loaded high on factor 1 were on the same cluster. This cluster also includes a few genes that loaded on factors 2 and 3. Genes that loaded high on the second and third factor were distributed among the remaining two clusters. The genes in these groups that did not load significantly on any factor were distributed among the three clusters.

Table 2 Factor analysis for candidate genes for immune function

Full size table

Candidate loci for body size

We again queried FlyBase for a list of genes involved in body size determination and found 92 body size candidates in our list of 4,667 transcripts. Four factors were identified (Figure 3). The identification of covarying genes in the same factor was intriguing. Of particular note is the presence of loci that are contained within quantitative trait loci (QTL) for body size on factor 1: Cdk4, trx, akt1, fru, Dr, mask, khc, and InR [33].

In our analysis, we regressed transcript level for individual candidate genes on the body size phenotype. We found 2,892 genes significant at a nominal level of 0.05 (false discovery rate (FDR), 16%, see Table 3). At a more stringent nominal threshold of 0.01, there were 14 candidate genes for body size which showed significant association between transcript abundance and phenotypic variation for male body size (FDR of 7%). Of the QTL candidates, only InR showed significant association with the phenotype.

Table 3 Factor analysis for candidate genes for body size

Full size table

We then tested the hypothesis that the estimated factors were directly related to phenotypic variation in body size. In our analysis, we regressed the estimated factor (latent variable) on the phenotype body size for each genotype. The regression of factor 1 on body size showed evidence of an association between the factor and the phenotype of body size (P = 0.04, Figure 4a).

Targets of miRNAs

Of 535 putative miRNA targets [39], 203 were contained in our set of 4,667 gene transcripts. Factor analysis resulted in the identification of four gene factors (Table 4). The second factor contained four of the same genes as factor 1 for body size (puc, Eh, mys, bon) with 76 additional genes contained in this factor (loaded at 0.40 or greater). However, this factor was not associated with body size (P = 0.55). While some of the QTL candidates are also putative targets of miRNA regulation, (Cdk4, trx, Dr) these genes did not participate in this factor but were common to the factor identified by the third factor (see Table 4), for which 69 additional genes loaded. This third factor was negatively correlated with body size (P = 0.04, see Figure 4b). (Cdk4, trx, Dr) were not associated with body size in regressions between these individual genes and body size.

Table 4 Factor analysis for putative targets of miRNAs

Full size table

Discussion

We applied factor analysis to high-dimensional microarray data. Using simulated data to estimate factors, we found that when correlation among genes is strong, the number of factors and their structure can be estimated, even in the case where genes unrelated to the factor structure (noise genes) are included. We also found that when noise genes were included hierarchical cluster analysis was unable to separate the noise genes from the signal, or to correctly identify the number of clusters. When the number of genes is large relative to the sample size, as is common in array studies, the number of factors and the genes belonging to each factor can still be identified, as long as the number of factors is less than the sample size. In contrast, hierarchical cluster analysis did not identify the number of clusters even when the number of clusters was smaller than the sample size.

We found that if the number of factors is larger than the number of genotypes, the majority (58%) of noise genes still do not load on any factor, while all but 26 of the 600 correlated genes do load on at least one factor. However, the correct association between individual genes and factors is lost. We conclude that while factor analysis is effective at separating the signal from the noise, the structure of the signal is not estimable. This is consistent with reports in the literature [23, 36]. Using hierarchical clustering, the number and the structure of clusters is not recovered and noise genes are scattered throughout the cluster structure. Kernels of tightly correlated genes were visible, however, indicating that kernel identification is possible in cases where biological knowledge is present. Any separation of signal from noise is purely serendipitous.

These results are not unexpected as the mathematical properties of gene-expression data or the high dimensionality of the data lead to problems for any analysis. When the number of columns (in this case observations) is less than the number of rows (or variables), the matrix is considered ill conditioned [40]. Ill-conditioned matrices can cause problems for many types of statistical analysis and can lead to overfitting of predictive models, among other problems [41, 42]. Some of the effects of the ill-conditioned matrices can be mitigated; however, the problem of an overdetermined system will always exist. An example is in multiple regression, where models with more variables than we have data can not be fit [43].

Tight clustering [27] represents a significant advance over hierarchical clustering in the estimation of cluster structure for microarray data. It provides a reasonable way of identifying most noise genes. In simple cases, however, the algorithm needs to have some flexibility when specifying starting values; that is, more rather than fewer clusters improve the chances of correctly identifying true clusters. The algorithm requires that the number of clusters be specified a priori. In contrast, in many simple cases the number and structure of factors can be recovered precisely using a factor analysis. In the most complex case examined (20 factors, 600 genes and 100 noise genes), when the number of clusters is correctly specified, tight clustering identifies 100% of the noise genes and the 20 clusters with correct genes within each cluster. However, it also incorrectly identifies 50% of genes with signal as noise genes. If too many clusters are specified (25), then the number of genes identified in clusters increases to 63%, although the correct structure is no longer maintained and 20% of the noise genes are incorrectly clustered. In contrast, factor analysis separates the signal from the noise, correctly identifying 96% of the noise genes as noise and 58% of the genes as having signal. In this complex case it is difficult to say whether factor analysis or tight clustering is 'better'. Both of these approaches clearly outperform hierarchical clustering, and they are complementary in their approach. If the primary goal is to separate the correlated genes from the uncorrelated genes then factor analysis performs better than tight clustering. If determining the number and structure of factors (coexpressed genes) is the goal, using both approaches and comparing findings will be reasonable.

Clear network structures can be identified with factor analysis and can then be followed up experimentally. Furthermore, unlike cluster analysis (hierarchical or tight), which provides no summary of the clusters into a single variable, the factor loading values are directly interpretable as the degree of participation of that locus in a factor. In contrast to cluster analysis, the factor loadings for genes give us information on the relative strength of a gene on a factor. We can use factor loadings to identify the most 'significant' genes on a factor and we can use the factor loadings to remove genes that are not contributing to any factor. As such, genes that all load highly on the same factor can be said to be coordinately expressed and putatively co-regulated, and noise genes can be identified as they will not load high on any factor. Unlike clustering, factor analysis allows genes to participate in several factors, thus reflecting biological reality more accurately.

Extensions of factor analysis have been developed and include allowing for nonlinearity [44, 45] and the application of Bayesian approaches [46]. Factor analysis is itself a subclass of modeling techniques known as structural equation models [47] and the ongoing theoretical interest in these approaches allows for expansion of consideration to more than the present circumstance.

We focus our Drosophila analyses on a set of genotypes that are mated according to a round-robin design where the parental lines are natural variants. Analyses of such lines allows inferences about the underlying genetic contribution to variation and inferences about variation in pathways that occurs as a result of such natural genetic variation. By using candidate loci from reverse genetic and mutational projects, the importance of these loci in a broad context can be assessed. The factor analysis for the list of genes annotated as body size candidates resulted in the estimation of four factors. Factor 1 is of great interest as several of the genes that load on this factor - Cdk4, trx, akt1, fru, Dr, mask, woc, Khc, and InR - are contained in QTL for body size [33]. This overlap is exciting as the data from this QTL analyses are independent of our factor analysis. The identification of the same set of loci lends weight to the evidence that these loci are involved with the formation of body size in a natural population. Of these loci, only InR is directly correlated with body size. Given our limited knowledge of pathways for body size, it was exciting to note that two of the genes in this factor - trx and M(2)21AB - have been shown to interact [48]. In addition, the early sex-determination cascade genes Sxl, tra, and tra2 are all associated with a different factor (factor 2).

If the expression of multiple genes is regulated by a common underlying factor such as a transcription factor or a miRNA, and this regulation exhibits genetic variation, then we expect that the gene expression among these genes in a set of genotypes from a mating design will also be correlated. Factor analysis on putative targets of miRNA control revealed that putative targets of the same miRNA often occurred on the same factor. The number of targets for each miRNA was small, and so we cannot conclude that the coexpression is greater than we would expect by chance. Nevertheless, this is the first opportunity to examine the correlation structure of these putative targets of miRNAs. We found that of the 203 genes identified as miRNA targets, there was evidence for 188 participating in at least one of the four factors. While four genes on factor 1 for body size (puc, Eh, mys, and bon), and nine additional candidate body size genes (Sh, Abd-B, trx, fng, qm, woc, Dr, and Cdk4) are putative targets of miRNAs [39], the resulting miRNA factors are uncorrelated with the factors for body size. One of the miRNA factors is associated with the body size phenotype.

In summary, factor analysis, a technique developed to discover and model underlying mechanisms in complex social and psychiatric situations, seems to offer a reasonable middle ground for gaining understanding of coordinated gene expression. Our simulations show that when the number of underlying factors is larger than the sample size, it is not straightforward to recover the structure of the simulated data, although signal can be separated from noise. In contrast, target groups, even when the number of genes is large, can be used to identify several underlying regulatory mechanisms. In the case of body size for Drosophila, factor analysis offers an exciting opportunity to estimate gene networks, as relatively little is known about how genes involved in body size work together.