Community transcriptomics reveals universal patterns of protein sequence conservation in natural microbial communities

Background Combined metagenomic and metatranscriptomic datasets make it possible to study the molecular evolution of diverse microbial species recovered from their native habitats. The link between gene expression level and sequence conservation was examined using shotgun pyrosequencing of microbial community DNA and RNA from diverse marine environments, and from forest soil. Results Across all samples, expressed genes with transcripts in the RNA sample were significantly more conserved than non-expressed gene sets relative to best matches in reference databases. This discrepancy, observed for many diverse individual genomes and across entire communities, coincided with a shift in amino acid usage between these gene fractions. Expressed genes trended toward GC-enriched amino acids, consistent with a hypothesis of higher levels of functional constraint in this gene pool. Highly expressed genes were significantly more likely to fall within an orthologous gene set shared between closely related taxa (core genes). However, non-core genes, when expressed above the level of detection, were, on average, significantly more highly expressed than core genes based on transcript abundance normalized to gene abundance. Finally, expressed genes showed broad similarities in function across samples, being relatively enriched in genes of energy metabolism and underrepresented by genes of cell growth. Conclusions These patterns support the hypothesis, predicated on studies of model organisms, that gene expression level is a primary correlate of evolutionary rate across diverse microbial taxa from natural environments. Despite their complexity, meta-omic datasets can reveal broad evolutionary patterns across taxonomically, functionally, and environmentally diverse communities.


Background
Variation in the rate and pattern of amino acid substitution is a fundamental property of protein evolution. Understanding this variation is intrinsic to core topics in evolutionary analysis, including phylogenetic reconstruction, quantification of selection pressure, and identification of proteins critical to cellular function [1,2]. A diverse range of factors has been postulated to affect the rate of sequence evolution within individual genomes, including mutation and recombination rate [3], genetic contributions to fitness (that is, gene essentiality) [4], timing of replication [5], number of protein-protein interactions [6][7][8], and gene expression level [9]. Among these, gene expression level has emerged as the strongest predictor of evolutionary rate across diverse taxa, with highly expressed genes experiencing high sequence conservation [9][10][11][12][13][14]. However, these studies have focused on model organisms or small numbers of target species. The links between gene expression and broader evolutionary properties, including evolutionary rate, and the mechanistic basis for these relationships remain poorly described for the vast majority of organisms, notably non-model taxa from diverse natural communities.
Deep-coverage sequencing of microbial community DNA and RNA (metagenomes and metatranscriptomes) provides an unprecedented opportunity to explore protein-coding genes across diverse organisms from natural populations. Such studies have yielded valuable insight into the genetic potential and functional activity of natural communities [15][16][17][18][19], but thus far have been applied only sparingly to questions of evolution. Furthermore, only a subset of studies present coupled DNA-RNA datasets for comparison [17,[19][20][21]. When analyzed in tandem, coupled DNA-RNA datasets facilitate categorization of the relative transcription levels of different gene categories, potentially revealing properties of sequence evolution driven in part by expression level variation. However, it remains uncertain whether broad evolutionary correlates of gene expression, potentially including sequence conservation, would even be detectable in community-level samples, which contain sequences from potentially thousands of widely divergent taxa. Here, we compare microbial metagenomic and metatranscriptomic datasets from marine and terrestrial habitats to explore fundamental properties of sequence evolution in the expressed gene set.
Specifically, we use coupled microbial (Bacteria and Archaea) metagenomic and metatranscriptomic datasets to explore the hypothesis that highly expressed genes are more conserved than minimally expressed genes. In lieu of conservation estimates based on alignments of orthologous genes, which are not feasible using fragmentary shotgun data containing tens of thousands of genes, sequence conservation was estimated based on amino acid identity relative to top matches in a reference database. Our results indicate a strong inverse relationship between evolutionary rate and gene expression level in natural microbial communities, measured here by proxy using transcript abundance. Furthermore, these results demonstrate broad consistencies in protein-coding gene expression, amino acid usage, and metabolic function across ecologically and taxonomically diverse microorganisms from different environments. This study illustrates the utility of environmental meta-omic datasets for informing theoretical predictions based (largely) on model organisms in controlled laboratory settings.

Results and discussion
Expressed genes evolve slowly The relationship between gene expression (transcript abundance) and sequence conservation was examined for protein-coding genes in coupled metagenome and metatranscriptome datasets generated by shotgun pyrosequencing of microbial community DNA and RNA, respectively. These datasets represent varied environments, including the oligotrophic water column from two subtropical open ocean sites in the Bermuda Atlantic Time Series (BATS) and Hawaii Ocean Time Series (HOT) projects, the oxygen minimum zone (OMZ) formed in the nutrient-rich coastal upwelling zone off northern Chile, and the surface soil layer from a North American temperate forest (Tables 1 and 2). Prior studies have experimentally validated the metatranscriptomic protocols used here (RNA amplification, cDNA synthesis, pyrosequencing; see Materials and methods), confirming that estimates of relative transcript abundance inferred from pyrosequencing accurately parallel measurements based on quantitative PCR [15,17,19]. Here, amino acid identity relative to a top match reference sequence identified by BLASTX against the National Center for Biotechnology Information nonredundant protein database (NCBI-nr) is used to estimate sequence conservation.
In all the samples, amino acid identities, averaged across all genes per dataset, were significantly higher for RNA-derived sequences (metatranscriptomes) compared to DNA-derived sequences (metagenomes), with an average difference of 8.9% between paired datasets (range, 4.4 to 14.7%; P < 0.001, t-test; Table 2). Further analysis of a representative sample (OMZ, 50 m) showed that RNA identities remained consistently elevated across a gradient of high-scoring segment pair (HSP) alignment lengths ( Figure 1). This pattern suggests that the DNA-RNA difference was not driven by the (on average) shorter read lengths in the RNA transcript pool (length data not shown), which could have imposed selection for reads with higher identity in order to meet the bit score cutoff (see Materials and methods). This pattern was not observed in the highest alignment length bin (>100 amino acids), likely due to the small number of genes (n = 53) detected among the RNA reads falling into this category (for example, 0.4% of those in the 40 to 50 amino acid bin; see error bars in Figure 1).
To further rule out that the DNA-RNA discrepancy was due to methodological differences in DNA-and RNA-derived samples (for example, error rate variation due to differential sample processing; see Materials and methods), we examined amino acid identities in expressed and non-expressed genes derived from the DNA dataset only. Hereafter, we operationally define 'non-expressed' genes as those detected only in the DNA datasets, whereas 'expressed' genes are those detected in both the DNA and RNA datasets (gene counts per fraction are provided in Table 3). Across all datasets, mean identities for DNA-derived non-expressed genes were significantly lower (mean difference, 10.6%; range, 3.7 to 19.4%; P < 0.001, t-test; Table 2) than those of DNAderived expressed genes, whose values were similar to those of RNA transcripts that matched expressed genes ( Table 2). This trend was consistent across all samples ( Table 2) and independent of the database used for identifying reads, as comparisons against the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Global Ocean Sampling (GOS) protein databases for a representative sample (OMZ, 50 m) revealed a similar RNA-DNA incongruity (Table 4). Furthermore, this pattern was unchanged when ribosomal proteins were excluded from the datasets (Table 4), as has been done previously to avoid bias due to the high expression and conservation of these proteins [14]. These data confirm a significantly higher level of sequence conservation in expressed versus non-expressed genes, broadly defined based on the presence or absence of transcripts.
Given the differences observed between expressed and non-expressed categories, a positive correlation between conservation and the relative level of gene expression may also be anticipated [9]. Here, per-gene expression level was measured as the ratio of gene transcript abundance in the RNA relative to gene abundance in the DNA, with abundance normalized to dataset size. Correlations between amino acid identity and expression ratio were not observed in any of the samples when all genes representing all taxa were combined (r 2 = 0 to 0.02; see Figure 2 for a representative dataset). This pattern suggests that for a substantial portion of the metatranscriptome, transcriptional activity cannot be used as a predictor of evolutionary rate. This is likely due in part to the difficulty of accurately estimating expression ratios for low frequency genes, which constitute the majority of the metatranscriptome at the sequencing depths used in this study [22,23]. However, across all samples, mean amino acid identity consistently increased with expression ratio when genes were binned into broad categories: all genes, top 10%, top 1%, and top 0.1% most highly expressed ( Figure 3). These data indicate that while transcript abundance is a poor quantitative indicator of sequence conservation on a gene-bygene basis in community datasets, the most highly expressed genes are, on average, more highly conserved than those expressed at lower levels.

Genome-level corroboration
It is possible that differences in the relative representation of genes in the BLAST databases may cause the incongruity in sequence conservation between expressed and nonexpressed genes. Specifically, if expressed genes are more abundant in the database (which may be likely if these genes are also more abundant in nature), an expressed gene sampled from the environment will have a higher likelihood of finding a close match in the database, relative to a non-expressed gene. We therefore examined the discrepancy between expressed and non-expressed gene sets only for DNA reads whose top hits match the same reference genome. Under a null hypothesis of uniform evolutionary rates across a genome, all genes in a sample whose closest relative is the same reference genome should exhibit uniform divergence from the reference. The link between expression level and sequence conservation was observed at the level of individual genomes. Figure 4 (left panel) shows the discrepancy in amino acid identity between expressed versus non-expressed genes that match the top five most abundant reference taxa (whole genomes) in each sample. In all genomes, excluding Bradyrhizobium japonicum from the soil sample, the mean amino acid identity of expressed genes was significantly greater than that of non-expressed genes (P < 0.001, t-test). These taxon-specific patterns argue against an overall bias due to varying levels of gene representation in the database. Rather, assuming that the sequences that match the expressed and non-expressed gene fractions of a given reference genome are indeed present in the same genome in the sampled environment (an assumption that might be unwarranted if these two gene fractions experience varying rates of recombination or horizontal transfer among divergent taxa -see below), these results suggest that differential conservation levels, and not sampling artifacts, are driving the overall discrepancy between expressed and non-expressed genes.

Core genes are overrepresented in the expressed gene fraction
Our data confirm an inverse relationship between expression level and evolutionary rate in natural microbial communities. However, it remains unclear to what extent gene expression level depends on a gene's functional importance to organism fitness (that is, essentiality) versus other potential explanations, such as 'translational accuracy or robustness' [24]. It has been argued that orthologous genes retained across divergent taxa ('core' genes) may mediate basic cellular functions and that such genes are more likely to be more essential than non-core (taxon-specific) genes [25][26][27]. Here, we calculated the proportional representation of expressed and non-expressed genes in the core genome, determined separately for each of the top five most abundant organisms in each of the samples (18 taxa total). Each taxon's core genome is composed of a relative orthologous gene set determined from comparison to a closely related sister taxon (or taxa; Table 5). The exact number of genes within each core set would likely vary if different sister taxa were used for comparison [28]. Here, the proportion of each genome that fell within the core set varied widely, from 17 to 80% (Table 5), reflecting natural variation and variation in the availability of whole genomes from different taxonomic groups. Expressed genes were significantly more likely to fall within a core gene set shared across taxa. Figure 4 (right panel) shows the difference in core genome representation (percentage of genes within core set) between expressed and non-expressed gene fractions for each reference organism. In 52 of the 60 comparisons (87%), the percentage of expressed genes falling within the core set was greater than that for the non-expressed gene fraction; of these differences, 38 (73%) were significant (P < 0.0009, chi-square). In some taxa, such as Prochlorococcus marinus str. NATL2A, core genome representation was over 30% greater among expressed genes relative to non-expressed genes. In contrast, for the HOT 500 m dataset, expressed genes were not enriched in core genes, which we speculate may be due to the activity of the microbial community at this depth (see Conclusions section below). Overall, however, the data support the broad trend that highly expressed genes are more likely to belong to an orthologous set shared across multiple taxa.
The differential representation of core genes within expressed and non-expressed genes may influence the relative sequence conservation levels of these two gene fractions. Gene acquisition from external sources (for example, homologous recombination, horizontal gene transfer (HGT)) is an important source of genetic variation in bacteria [29]. A conserved core genome is traditionally thought to undergo lower rates of recombination and HGT relative to more flexible genomic regions (for example, genomic islands) [30], though the horizontal transfer of core genes may also be common in some taxa [31]. A central limitation to shotgun sequencing datasets is that disparate sequences cannot be definitively linked to the same genome, making it challenging to evaluate the relative contributions of HGT, homologous recombination, and mutation to sequence divergence. Consequently, it is possible that the higher levels of sequence divergence observed in the non-expressed gene set are due in part to enhanced rates of HGT among the noncore genes that predominate in this gene set.
Surprisingly, within the expressed gene fraction, noncore genes were more highly expressed than core genes. Among the datasets representing the five most abundant taxa per sample (n = 60, as above), 80% showed higher expression levels (expression ratio) of non-core genes relative to core genes ( Figure 5). Averaged across all of these taxa, the expression ratio was 34% higher in non-core genes relative to core genes (2.5 versus 1.9; n = 13,324 and  30,096, respectively; P < 0.00001). This pattern seemingly conflicts with studies based on cultured organisms. For example, a prior comparative survey of 17 bacterial proteomes showed a relative enrichment of peptides representing proteins encoded within the core genome [28]. Also, essential proteins necessary for organism survival have been shown to be expressed at higher abundances than nonessential proteins in cultures of both Escherichia coli [32] and Pseudomonas aeruginosa [33]. This observation indirectly links core genome representation and gene expression, as essential orthologs have been shown to be more broadly represented among diverse taxonomic groups than nonessential genes [34]. Our data, representing diverse taxa from the natural environment, raise the hypothesis that core genes are more likely to be expressed (above the level of detection at the sequencing depths used here). However, non-core genes, when expressed, are more likely to be expressed at higher levels. The high expression of non-core genes, also observed previously for Prochlorococcus [19], may reflect the importance of taxonspecific genes for adaptation to individual niches in a heterogenous environment [30].

Functional patterns in expressed gene sets
The degree to which expressed gene sets share functional similarity across microbial communities from diverse habitats is unclear. Hewson et al. [16] observed shared functional gene content among metatranscriptome samples taken from the same depth zone (upper photic layer) at eight sites in the open ocean. Also, the four OMZ metatranscriptome datasets analyzed in this study have been shown to cluster separately from the corresponding metagenome datasets based on functional category abundances, suggesting similar expressed gene content across depths [35]. However, this clustering was likely influenced in part by variation in per-gene sequence abundance (evenness) between the metagenomes and metatranscriptome, and did not explicitly compare expressed and nonexpressed gene fractions. Here, we explored functional differences between expressed and non-expressed genes (as defined above) within metagenome (DNA) samples, for which the relative read copy number per gene is    highlighting functional differences between ocean and soil communities (Figures 6 and 7). Among the four ocean metagenomes, expressed gene sets clustered together to the exclusion of the non-expressed genes from the same samples ( Figure 6). Indeed, shifts in functional gene usage between expressed and non-expressed fractions were broadly similar across all samples ( Figures  8 and 9). Instances in which all five samples showed the same direction of change (increase or decrease) in KEGG gene category abundance occurred in 14 of the 25 functional categories shown in Figure 8 (marked by open stars), significantly higher (nine times) than random expectations if ignoring potential covariance between categories (P < 0.0002, chi-square). Notably, across all five samples, the expressed gene set was significantly enriched in genes involved in energy and nucleotide metabolism, transcription, and protein folding, sorting, and degradation ( Figure 8). In contrast, the   Figure 4 Expressed and non-expressed genes differ in amino acid identity (left) and core genome representation (right). Data are from DNA sequence sets and include the five most abundant taxa per sample, with taxon abundance determined by the proportion of total reads with top matches to protein-coding genes in each genome (BLASTX of all DNA reads against NCBI-nr). 'Core genome representation' is calculated as the percentage of each gene set (that is, expressed or non-expressed genes) falling within the core genome of each taxon, as defined in the text. All differences (left and right panels) are significant (P < 0.001), unless marked with an asterisk. non-expressed gene set was enriched in genes mediating lipid metabolism and glycan biosynthesis and metabolism; in all ocean samples but not the soil sample, DNA replication and repair was also significantly overrepresented among non-expressed genes (P < 0.0004, chisquare). At the finer resolution provided at the KEGG pathway level, genes involved in oxidative phosphorylation, chaperones and protein folding catalysis, translation factors, and photosynthesis were consistently and significantly (P < 0.0001, chi-square) overrepresented among expressed genes in all samples, whereas genes of peptidoglycan biosynthesis, mismatch repair, and amino sugar and nucleotide sugar metabolism were proportionally more abundant in the non-expressed fraction ( Figure 9). These data indicate broad similarities in functional gene expression across diverse microbial communities, with expressed gene pools biased towards tasks of energy metabolism and protein synthesis but relatively underrepresented by genes of cell growth (for example, lipid metabolism, DNA replication).

Database-independent analysis
Our characterization of relative evolutionary rates in expressed versus non-expressed genes is based on sequence divergence relative to closest relatives in the sequence database (NCBI-nr). It is unclear to what extent this same trend may be detected within clusters of related sequences within our samples, independent of comparison to an external reference database. We therefore examined variability in amino acid divergence within clusters of expressed and non-expressed proteincoding sequences for five representative samples, including shallow and deep depths from the OMZ and HOT oceanic sites, and the surface soil sample (Table 6).
Mean identity per cluster was consistently higher for DNA sequences in non-expressed clusters compared to DNA sequences from expressed clusters (mean difference 5.3%; Table 6). This pattern is opposite to that observed in comparisons of sequences to external reference databases (above). However, we argue that this inverse pattern is indeed consistent with our hypothesis that expressed genes are more likely to be part of a core set shared across taxa ( Figure 4). If this hypothesis is true, then the DNA-only cluster set (non-expressed genes) will be relatively enriched in non-core genes, including those present in only one taxon/genome and lacking any known homologs (for example, orphans) [36,37]. In environmental sequence sets, if these sequences appear multiple times, they are more likely to be identical, or nearly so, because they come from a single taxon population and therefore cluster only with themselves (homologs from other taxa are by definition absent and will not fall into the cluster).
In contrast, if expressed genes are more likely to fall within the core genome, clusters containing both DNA-and RNA-derived sequences (that is, expressed sequences) will be relatively enriched in homologs that occur across multiple divergent taxa. By definition, therefore, DNA+RNA clusters will be relatively enriched in sequences differing at both the population level and at higher taxonomic levels (for example, 'species'), while DNA-only clusters will be enriched in sequences differing only at the population level. Given this explanation, we would predict that DNA+RNA clusters (with RNA sequences excluded) are larger than DNA-only clusters and that the DNA-only cluster set as a whole is enriched in high identity clusters. Indeed, DNA+RNA clusters are, on average, approximately 20 to 33% larger than DNA-only clusters (RNA sequences not included in counts) and DNA-only cluster sets, notably those of the OMZ samples, are enriched in clusters with identities greater than 98% ( Figure 10). These data indicate that expressed gene clusters recruit a larger and more diverse set of sequences, consistent with the hypothesis that expressed genes are more likely to represent core genes shared across taxa. More generally, the contrast between this self-clustering approach and the BLAST-  Figure 5 Mean expression level of core and non-core genes across the five most abundant taxa per sample. Per gene expression level is measured as a ratio -(Transcript abundance in RNA sample)/(Gene abundance in the DNA sample) -with abundance normalized to dataset size. 'Core' genes are determined individually for each taxon based on orthology with a closely related sister taxon, as described in the main text. Asterisks mark taxa for which expression ratios differed significantly between core and non-core genes (P < 0.001, t-test). based comparisons (above) demonstrates how divergence measurements taken relative to an external top match reference can differ from those relative to a top match internal reference from the same dataset, with the latter more likely to involve comparisons between highly related sequences from the same strains/ populations.

GC content and amino acid usage differ between expressed and non-expressed genes
The discrepancy in sequence conservation between expressed and non-expressed genes coincided with differences in nucleotide composition and amino acid usage between these two sequence pools. GC content was substantially higher in the soil compared to the ocean samples (approximately 20 to 25% enrichment) and consistent between the DNA and RNA pools ( Table  7). In contrast, across all 11 ocean samples, RNAderived protein-coding sequences were significantly elevated in GC relative to those from the DNA (mean RNA-DNA difference, 6%; Table 7), suggesting a broad shift towards GC enrichment in the expressed gene pool. Surprisingly, however, DNA sequences corresponding to expressed genes consistently had a lower GC content than DNA reads matching non-expressed genes (mean difference, 1.9%). These data suggest that the DNA versus-RNA discrepancy in GC content may be driven by a subset of transcripts in the RNA pool, likely those at high abundance. Indeed, analysis of the RNA reads from one sample (OMZ 50 m) showed a progressive increase in GC content with transcript abundance (when transcripts are subdivided into four categories (top 10%, 1%, 0.1% 0.01%) based on the rank abundance of the genes they encode (data not shown). Consistent with the GC pattern, amino acid usage of protein-coding sequences differed significantly between the DNA and RNA samples (Table 8, Figures 11, 12, 13,  and 14). Notably, with the exception of three ocean samples (HOT 500 m, OMZ 110 m and 200 m) and the outlying soil sample, RNA datasets from diverse regions and depths grouped separately from DNA samples when clustered based on amino acid frequencies (Figure 12), suggesting a global distinction between the metagenomic and metatranscriptomic amino acid sequence pools in marine microbial communities. Indeed, of 240 comparisons of amino acid proportions in DNA versus RNA datasets (  (P < 0.0002, chi-square; Table 8, Figure 13). (The high proportion of significant changes is due to the large sample sizes in the analysis.) On average, alanine, glycine, and tryptophan (high GC content) underwent the largest proportional increases from DNA to RNA, while lysine, isoleucine, and asparagine (low GC content) all decreased substantially in frequency. These shifts were largely consistent among ocean samples, but clearly distinct from the pattern observed in soil, where several amino acids changed in frequency in the direction opposite to that in the ocean samples.  Figure 7 Relative abundance of KEGG k2 functional categories (25 most abundant) across nr-reference genes identified in five representative DNA datasets. Reference genes detected among the DNA reads were classified as unique to the DNA dataset (non-expressed) or shared between the DNA and RNA datasets (expressed).
Among the ocean datasets, DNA-RNA shifts in amino acid frequency were strongly related to amino acid GC content (Figure 11a; see Materials and methods). Amino acids with an intermediate GC content (0.5) constituted equivalent fractions, 40% and 36%, of the total number of amino acid frequency increases and decreases, respectively. Strikingly, amino acids with GC content below 0.5 were significantly less abundant in the RNA, being involved in 61% of all decreasing DNA-RNA amino acid frequency changes. In contrast, frequency increases were dominated by amino acids enriched in GC: 50% of increases involved amino acids with GC greater than 0.5, significantly higher than the representation of these amino acids in changes involving a decrease (3%). A similar, but less dramatic, shift in amino acid usage is observed when the DNA reads were binned into expressed and non-expressed gene sets (Figure 11b). In general, the magnitude of the proportional shift in amino acid usage decreases with depth in the water column ( Figure 14). This pattern may reflect an overall decrease in microbial activity with depth, such that the transcriptome, less weighted by highly expressed and highly conserved genes, more closely resembles the metagenome as activity declines. Together, these results suggest a significant shift towards GC-rich amino acids in the expressed gene pool.  Prior studies describe a relationship among gene expression level, sequence conservation, and amino acid usage [38][39][40][41][42]. Specifically, significant enrichment in GC-rich amino acids among highly expressed genes has been demonstrated for individual bacterial taxa, including Prochlorococcus [38,39]. GC richness in expressed genes is potentially driven by a combination of factors, including selection against metabolically costly amino acids (for example, AT-enriched phenylalanine and tyrosine) [40], or selection against AT-richness in highly expressed genes. Alternatively, this pattern may stem from an overall enhanced conservation level in highly expressed genes [12]. Assuming an underlying GC-to-AT mutational bias, which may be a universal trend in bacteria [41,42], selectively constrained genes are predicted to retain a GC-rich signature relative to less-constrained genes. Therefore, the proportional increase in GC-enriched amino acids in expressed genes compared to non-expressed genes in this study is consistent with our observation of enhanced sequence conservation in the expressed community gene pool, and confirms a fundamental distinction in amino acid usage related to gene expression level.

Conclusions
Microbial metagenomes and metatranscriptomes are amalgams of thousands of taxonomically and functionally diverse microorganisms, each of which experiences unique evolutionary pressures. Such complexity might be expected to preclude the detection of bulk evolutionary signals in meta-omic data. Here, we show broad trends in protein coding sequence conservation that transcend variation in both taxonomic composition and habitat type.
Specifically, we confirm that the hypothesized positive relationship between gene expression level and sequence conservation, which has been well established for individual taxa under experimental conditions [11], is a universal trend across diverse microbial communities in both marine and terrestrial environments. Detecting this trend required binning genes into broad categories (expressed versus non-expressed) based on the detection of transcripts, which depended, in part, on the depth of sequencing per sample. Deeper sequencing would reveal a greater proportion of the expressed gene pool and potentially lead to more accurate measurements of expression level for low frequency genes [22,23]. However, the tremendous taxonomic diversity inherent in microbial communities, as well as the temporal heterogeneity of the environment in which these communities exist, likely confounds any attempt to predict protein conservation based on transcript abundance on a geneby-gene basis using meta-omic data. Nonetheless, the broad discrepancy in sequence conservation between expressed and non-expressed gene fractions is significant, operates consistently across diverse taxa (Figure 4), and confirms that expression level is a primary determinant of evolutionary rate in naturally occurring microorganisms.
The mechanism linking evolutionary rates and expression level is still debated. Sequence conservation in highly expressed proteins has been hypothesized to be driven by selection acting to minimize the costs of protein misfolding, which should increase in tandem with expression level (protein copy number per cell) [14], though the harmful effects of misfolding have been brought into question [43]. This selection for 'translational robustness/ accuracy' is predicted to be largely decoupled from a protein's functional importance [13,14,24]. Here, using mRNA abundance as a proxy for expression level, our results demonstrate broad commonalities in expressed gene content across communities in widely different habitats (ocean versus soil). These data indicate a trend toward genes of protein synthesis and energy metabolism in the more actively expressed gene fraction and toward genes of cell replication and growth in the less expressed fraction. Additionally, this finding, in the context of our results demonstrating enhanced sequence conservation among expressed genes, indirectly suggests that the expression-conservation relationship may partially be constrained by protein function. However, these data cannot be used to justify this conclusion, since both gene expression level and functional importance may independently co-vary with protein evolution rates, as has been demonstrated for isolates of Pseudomonas aeruginosa [33]. Though characterizing the mechanism linking gene expression level and evolutionary rate is beyond the scope of this study, metatranscriptomic data may inform future studies exploring the relative effect of protein function on sequence conservation. We show that the expressed gene set, compared to the non-expressed set, is more likely to contain genes that belong to an orthologous core genome shared across closely related sister taxa. This pattern was broadly consistent across the marine samples and the soil sample. Interestingly, the overrepresentation of expressed genes within the core set was not observed in the HOT 500 m sample (Figure 4), which we hypothesize may be related to an overall decline of metabolic activity at deeper depths within the water column. Microbial transcriptomes can vary significantly in response to the growth phase of the organism [44][45][46]. In less actively growing communities (for example, stationary phase), expression level might be more uniform across the genome (that is, background expression), with both core and non-core genes having a relatively equal probability of detection. In contrast, in actively growing communities, the distribution of transcripts might become dominated by a subset of highly expressed genes (for example, genes mediating energy metabolism, membrane transport), as we have observed in other samples. If such genes fall within the core genome, core genome representation in the expressed gene set would be predicted to be greater in more active communities.
Our datasets highlight similarities in gene expression and sequence evolution across very different microbial habitats, but differ markedly in other attributes. Notably, the soil community was a clear outlier with respect to functional gene content (Figures 6, 7, and 9) and amino acid usage (Figures 12 and 13), likely due to the distinct community composition of this habitat (Figure 4). However, given our analysis of a single soil metatranscriptome, and the use of different RNA extraction kits for soil versus marine samples (see Materials and methods), we urge caution when comparing microbial community composition between soil and marine datasets. A more comprehensive comparison of taxonomy and functional gene expression would involve extended metatranscriptome sampling across multiple soil types (and locations), as well as optimization of RNA extraction protocols to ensure unbiased lysis of all microorganisms. Such an analysis was not the focus of this study. However, the inclusion of the soil sample confirmed a positive relationship between expression level and sequence conservation at both the genome and community levels   ( Figures 3 and 4), as well as an overrepresentation of core genes within the highly expressed gene set ( Figure 4). Though it is possible that such patterns may not be observed in other sample types, or following different extraction protocols, our results provide strong evidence for universal features of protein-coding gene evolution in natural microbial communities. The composition of metatranscriptomic and metagenomic sequence datasets depends not only on intrinsic biological factors (for example, community composition, metabolic state) but also on the physical and chemical environment at the time of sampling. Furthermore, interpretation of the resulting data can vary based on the analytical method (for example, database-dependent versus -independent analyses, as shown here) and on the availability and biases of the reference sequences to which the data are compared. Here, we attempt to rule out potential database artifacts by analyses at both the community and genome level. In so doing, our results suggest that environmental meta-omic datasets, despite their inherent complexity, can inform theoretical evolutionary predictions and reveal universal trends across ecologically and phylogenetically diverse microbial communities.

Materials and methods
We examined protein-coding sequences in coupled microbial metagenomes and metatranscriptomes from multiple depths at three distinct oceanographic sites and from surface soil in a temperate forest (Table 1). Ocean datasets (excluding the HOT 110 m RNA sample, which was sequenced in this study) were generated in prior studies using the Roche 454 Genome Sequencer with FLX series chemistry and extracted from public databases (see Table 1 for accession numbers). Sequences from the soil sample were generated in this study ( [17,19,23,35]. It is possible that these kits may lyse different microbial taxa at varying efficiencies. As this possibility has not been assessed, we urge caution when comparing the compositions of soil and marine communities based on metatranscriptome data, though this was not the focus of our study. Total DNA was quantified and used directly for pyrosequencing. Total RNA was further processed, as described below.
rRNA subtraction, RNA amplification and cDNA synthesis Total RNA was amplified and prepared for pyrosequencing using established protocols. Briefly, the proportion of ribosomal RNA transcripts (bacterial and archaeal 16S and 23S rRNA and eukaryotic 18S and 28S rRNA) in total soil RNA was reduced via a published subtractive hybridization protocol using sample-specific rRNA probes [23]. Following rRNA subtraction, total RNA was amplified as described previously using a modification of the MessageAmp™ II-Bacteria kit (Ambion) [17,19]. Briefly, total RNA was polyadenylated and converted to double-stranded cDNA via reverse transcription. cDNA was then transcribed in vitro (37°C, 12 to 14 h) to produce microgram quantities of single-stranded antisense RNA. The amplified products (approximately 5 to 10 μg aliquot) were converted back to double-stranded cDNA using the SuperScript ® III First-Strand Synthesis System (Invitrogen, Carlsbad, CA, USA) for first-strand synthesis (random hexamer priming), and the SuperScript™ Double-Stranded cDNA synthesis kit (Invitrogen) for second-strand synthesis. cDNA was then purified (QIAquick PCR purification kit, Qiagen, Valencia, CA, USA), digested with BpmI (37°C, 3 h) to remove poly (A) tails, and used directly for pyrosequencing.

Pyrosequencing
Soil DNA and cDNA were purified for sequencing via the Agencourt ® AMPure ® kit (Beckman Coulter Genomics, Danvers, MA, USA) and used for the generation   Table 1.

Data analysis Homology searches
Pyrosequencing reads matching ribosomal RNA genes were identified in cDNA and DNA datasets by BLASTN searches against a custom database of prokaryotic and eukaryotic small and large subunit rRNA sequences (5S, 16S, 18S, 23S and 28S rRNA) taken from microbial genomes and the ARB SILVA LSU and SSU databases [47]. Reads matching rRNA with bit scores >50 were identified and removed from further analysis. Replicate sequences sharing 100% nucleotide similarity and length, which may represent artifacts generated by the pyrosequencing protocol [23,48], were identified among non-rRNA sequences using the open-source program CD-HIT [49] and removed from each dataset. Non-replicate, non-rRNA sequences were characterized by BLASTX searches against NCBI-nr, KEGG, and assembled protein sequences taken from the GOS database. Protein-coding sequences were identified as the top reference (database) gene(s) matching each read above a bit score of 50. When reads matched multiple genes with equal bit score, each matching gene was considered a top hit, with its representation scaled proportionate to the number of genes sharing the same bit score.
Amino acid identities, relative taxon abundance, and expression ratio Local amino acid identities were recovered from within the top HSP for each sequence having a significant match (bit score >50) to a reference gene in the nr database; gaps were not included in identity calculations. Identities of multiple reads matching the same reference gene were averaged to obtain per gene identities, and these values were then averaged across all reference genes per dataset. Relative taxon abundance per sample was determined as the number of sequence reads matching the same reference taxon as a top hit in BLASTX searches of NCBI-nr. The relative transcriptional activity (expression level) per expressed gene was normalized to account for variations in gene abundance in the DNA pool, as calculated by the expression ratio: (RNA reads per gene/Total RNA reads matching NCBInr)/(DNA reads per gene/Total DNA reads matching NCBI-nr)

Core genome
The proportional representation of expressed and nonexpressed genes in the orthologous gene set shared between taxa was determined separately for the top five most abundant organisms in each of the samples. The core genome of Prochlorococcus sp., which was common in our samples, was defined as those genes present across all 12 of the sequenced Prochlorococcus strains, based on the analysis of [50]. For all other taxa, a core gene set was more broadly defined by reciprocal BLASTP searches against a closely related sister taxon (bit score cutoff = 50). Sister taxa used for core genome identification, and the number of genes shared between taxa, are listed in Table 5.

Hierarchical clustering
Average-linkage clustering was performed in Cluster 3.0 using Pearson coefficients (centered) derived from pairwise correlations of KEGG functional category abundances ( Figures 6 and 9), amino acid proportions ( Figure 12), or proportional changes in amino acid usage ( Figure 13). KEGG category clustering was based on expressed and non-expressed gene sets within the DNA data only, as defined in the main text based on matches to nr reference proteins. Abundance per KEGG category was calculated as the total number of reads per category as a proportion of total reads matching the KEGG hierarchy at either the category (Figure 8) or pathway level (Figure 9).

Database-independent cluster analysis
Sequence divergence was determined for clusters of related sequences within pooled DNA + RNA datasets. To avoid spurious clustering of non-coding regions (for example, small RNAs (sRNAs)), amino acid sequences present in the DNA and RNA reads (those with significant matches to the nr database) were identified using the ab initio gene-finding program Metagene [51]. (Comparisons of amino acid sequences determined from Metagene and sequences determined from BLAST HSP regions produced nearly identical results.) Sequences derived from DNA and RNA reads were pooled and clustered using the program CD-HIT [49] applying a low sequence identity threshold (55%) over the aligned region. Resulting clusters contained RNA-derived sequences only, DNA-derived sequences only, or both DNA-and RNA-derived sequences. For all clusters containing DNA-derived sequences only (nonexpressed) or DNA + RNA-derived (expressed) sequences, the divergence between each DNA-derived sequence and the reference (longest) sequence per cluster was calculated along the length of the aligned region. These values were averaged to estimate divergence per cluster. Clusters containing only one sequence (singletons) and clusters in which the reference sequence was the only DNA sequence were excluded from further analysis. For clusters in which the reference sequence was RNA-derived, but that contained only one DNA sequence, the percentage identity of the sole DNA sequence was recorded. Divergence per cluster was then averaged for all expressed and nonexpressed sequence/gene clusters.

Amino acid usage
We compared amino acid usage between DNA and RNA datasets and DNA datasets parsed into expressed versus non-expressed gene categories as above. Specifically, for each dataset, reads matching nr proteins (see above for BLASTX descriptions) were examined to assess amino acid usage and GC content, using two approaches. First, amino acids were counted across all HSPs in alignments between sequence reads and their closest match in the nr database. Second, amino acids were counted across open reading frames that were identified using Metagene [51]. These two methods returned nearly identical results; we therefore restrict our discussion to the Metagene data. Amino acids were then categorized based on GC content, where GC content was measured by the relative proportion of Gs and Cs among the synonymous codons for each amino acid (for example, alanine has four synonymous codons; 10 of the 12 total bases (0.83) representing these codons are either a G or C). Proportional changes in amino acid usage, KEGG gene category and pathway abundances, and core genome representation in expressed and non-expressed gene fractions were compared using chi-square tests. Mean amino acid identities and expression ratios were compared using standard two-tailed t-tests. Significance was determined at alpha <0.05 following Bonferroni corrections for multiple comparisons.