Community transcriptomics reveals universal patterns of protein sequence conservation in natural microbial communities
© Stewart et al.; licensee BioMed Central Ltd. 2011
Received: 5 January 2011
Accepted: 22 March 2011
Published: 22 March 2011
Combined metagenomic and metatranscriptomic datasets make it possible to study the molecular evolution of diverse microbial species recovered from their native habitats. The link between gene expression level and sequence conservation was examined using shotgun pyrosequencing of microbial community DNA and RNA from diverse marine environments, and from forest soil.
Across all samples, expressed genes with transcripts in the RNA sample were significantly more conserved than non-expressed gene sets relative to best matches in reference databases. This discrepancy, observed for many diverse individual genomes and across entire communities, coincided with a shift in amino acid usage between these gene fractions. Expressed genes trended toward GC-enriched amino acids, consistent with a hypothesis of higher levels of functional constraint in this gene pool. Highly expressed genes were significantly more likely to fall within an orthologous gene set shared between closely related taxa (core genes). However, non-core genes, when expressed above the level of detection, were, on average, significantly more highly expressed than core genes based on transcript abundance normalized to gene abundance. Finally, expressed genes showed broad similarities in function across samples, being relatively enriched in genes of energy metabolism and underrepresented by genes of cell growth.
These patterns support the hypothesis, predicated on studies of model organisms, that gene expression level is a primary correlate of evolutionary rate across diverse microbial taxa from natural environments. Despite their complexity, meta-omic datasets can reveal broad evolutionary patterns across taxonomically, functionally, and environmentally diverse communities.
Variation in the rate and pattern of amino acid substitution is a fundamental property of protein evolution. Understanding this variation is intrinsic to core topics in evolutionary analysis, including phylogenetic reconstruction, quantification of selection pressure, and identification of proteins critical to cellular function [1, 2]. A diverse range of factors has been postulated to affect the rate of sequence evolution within individual genomes, including mutation and recombination rate , genetic contributions to fitness (that is, gene essentiality) , timing of replication , number of protein-protein interactions [6–8], and gene expression level . Among these, gene expression level has emerged as the strongest predictor of evolutionary rate across diverse taxa, with highly expressed genes experiencing high sequence conservation [9–14]. However, these studies have focused on model organisms or small numbers of target species. The links between gene expression and broader evolutionary properties, including evolutionary rate, and the mechanistic basis for these relationships remain poorly described for the vast majority of organisms, notably non-model taxa from diverse natural communities.
Deep-coverage sequencing of microbial community DNA and RNA (metagenomes and metatranscriptomes) provides an unprecedented opportunity to explore protein-coding genes across diverse organisms from natural populations. Such studies have yielded valuable insight into the genetic potential and functional activity of natural communities [15–19], but thus far have been applied only sparingly to questions of evolution. Furthermore, only a subset of studies present coupled DNA-RNA datasets for comparison [17, 19–21]. When analyzed in tandem, coupled DNA-RNA datasets facilitate categorization of the relative transcription levels of different gene categories, potentially revealing properties of sequence evolution driven in part by expression level variation. However, it remains uncertain whether broad evolutionary correlates of gene expression, potentially including sequence conservation, would even be detectable in community-level samples, which contain sequences from potentially thousands of widely divergent taxa. Here, we compare microbial metagenomic and metatranscriptomic datasets from marine and terrestrial habitats to explore fundamental properties of sequence evolution in the expressed gene set.
Specifically, we use coupled microbial (Bacteria and Archaea) metagenomic and metatranscriptomic datasets to explore the hypothesis that highly expressed genes are more conserved than minimally expressed genes. In lieu of conservation estimates based on alignments of orthologous genes, which are not feasible using fragmentary shotgun data containing tens of thousands of genes, sequence conservation was estimated based on amino acid identity relative to top matches in a reference database. Our results indicate a strong inverse relationship between evolutionary rate and gene expression level in natural microbial communities, measured here by proxy using transcript abundance. Furthermore, these results demonstrate broad consistencies in protein-coding gene expression, amino acid usage, and metabolic function across ecologically and taxonomically diverse microorganisms from different environments. This study illustrates the utility of environmental meta-omic datasets for informing theoretical predictions based (largely) on model organisms in controlled laboratory settings.
Results and discussion
Expressed genes evolve slowly
Read counts and accession numbers of pyrosequencing datasets
This study, SRA028811
This study, sra028811
This study, sra028811
This study, sra028811
Mean percentage amino acid identity of 454 reads matching database reference genes (NCBI-nr) shared between and unique to DNA and RNA samples
Percentage identity to reference genes present ina
Unique reference genes shared between and unique to DNA and RNA datasets
Reference genes present ina
Mean percentage amino acid identity of OMZ 50-m reads with top matches to distinct reference databases (GOS, KEGG, NCBI-nr) and with ribosomal proteins removed
Percentage identity to reference genes present inb
Without ribosomal proteinsf
It is possible that differences in the relative representation of genes in the BLAST databases may cause the incongruity in sequence conservation between expressed and non-expressed genes. Specifically, if expressed genes are more abundant in the database (which may be likely if these genes are also more abundant in nature), an expressed gene sampled from the environment will have a higher likelihood of finding a close match in the database, relative to a non-expressed gene. We therefore examined the discrepancy between expressed and non-expressed gene sets only for DNA reads whose top hits match the same reference genome. Under a null hypothesis of uniform evolutionary rates across a genome, all genes in a sample whose closest relative is the same reference genome should exhibit uniform divergence from the reference.
Core genes are overrepresented in the expressed gene fraction
Proportion of reference taxon genes shared with sister taxon (that is, core gene set)
Number of CDSb
Percentage of cored
Alpha Proteobacterium HIMB114
Pelagibacter ubique HTCC1062
Ca. Kuenenia stuttgartiensis
Planctomyces limnophilus DSM 3776
Ca. Pelagibacter sp. HTCC7211
Pelagibacter ubique HTCC1062
Ca. Pelagibacter ubique HTCC1002
Pelagibacter sp. HTCC7211
Ca. Pelagibacter ubique HTCC1062
Pelagibacter sp. HTCC7211
Prochlorococcus marinus AS9601
All Pro. strains
Prochlorococcus marinus CCMP1375
All Pro. strains
Prochlorococcus marinus MIT 9312
All Pro. strains
Prochlorococcus marinus MIT9301
All Pro. strains
Prochlorococcus marinus NATL1A
All Pro. strains
Prochlorococcus marinus NATL2A
All Pro. strains
Uncultured SUP05 cluster bacterium
Ca. Ruthia magnifica
Solibacter usitatus Ellin6076
Acidobacterium capsulatum ATCC 51196
Ca. Koribacter versatilis Ellin345
Acidobacterium capsulatum ATCC 51196
Acidobacterium capsulatum ATCC 51196
Solibacter usitatus Ellin6076
Bradyrhizobium japonicum USDA 110
Bradyrhizobium sp. BTAi1
Verrucomicrobium spinosum DSM 4136
Expressed genes were significantly more likely to fall within a core gene set shared across taxa. Figure 4 (right panel) shows the difference in core genome representation (percentage of genes within core set) between expressed and non-expressed gene fractions for each reference organism. In 52 of the 60 comparisons (87%), the percentage of expressed genes falling within the core set was greater than that for the non-expressed gene fraction; of these differences, 38 (73%) were significant (P < 0.0009, chi-square). In some taxa, such as Prochlorococcus marinus str. NATL2A, core genome representation was over 30% greater among expressed genes relative to non-expressed genes. In contrast, for the HOT 500 m dataset, expressed genes were not enriched in core genes, which we speculate may be due to the activity of the microbial community at this depth (see Conclusions section below). Overall, however, the data support the broad trend that highly expressed genes are more likely to belong to an orthologous set shared across multiple taxa.
The differential representation of core genes within expressed and non-expressed genes may influence the relative sequence conservation levels of these two gene fractions. Gene acquisition from external sources (for example, homologous recombination, horizontal gene transfer (HGT)) is an important source of genetic variation in bacteria . A conserved core genome is traditionally thought to undergo lower rates of recombination and HGT relative to more flexible genomic regions (for example, genomic islands) , though the horizontal transfer of core genes may also be common in some taxa . A central limitation to shotgun sequencing datasets is that disparate sequences cannot be definitively linked to the same genome, making it challenging to evaluate the relative contributions of HGT, homologous recombination, and mutation to sequence divergence. Consequently, it is possible that the higher levels of sequence divergence observed in the non-expressed gene set are due in part to enhanced rates of HGT among the non-core genes that predominate in this gene set.
Functional patterns in expressed gene sets
The degree to which expressed gene sets share functional similarity across microbial communities from diverse habitats is unclear. Hewson et al.  observed shared functional gene content among metatranscriptome samples taken from the same depth zone (upper photic layer) at eight sites in the open ocean. Also, the four OMZ metatranscriptome datasets analyzed in this study have been shown to cluster separately from the corresponding metagenome datasets based on functional category abundances, suggesting similar expressed gene content across depths . However, this clustering was likely influenced in part by variation in per-gene sequence abundance (evenness) between the metagenomes and metatranscriptome, and did not explicitly compare expressed and non-expressed gene fractions. Here, we explored functional differences between expressed and non-expressed genes (as defined above) within metagenome (DNA) samples, for which the relative read copy number per gene is more uniform than for metatranscriptome samples. To do so, the proportional abundance of KEGG gene categories and functional pathways was examined for five samples representing contrasting environments: the oxycline and lower photic zone of the coastal OMZ (50 m), the suboxic, mesopelagic core of the OMZ (200 m), the upper photic zone in the oligotrophic North Pacific (HOT 25 m), the deep, mesopelagic zone (HOT 500 m), and the soil from Harvard Forest.
Counts and mean percentage identity of amino acid sequence clusters for four representative samples
Mean percentage identityd
OMZ 50 m
OMZ 200 m
HOT 75 m
HOT 500 m
Mean identity per cluster was consistently higher for DNA sequences in non-expressed clusters compared to DNA sequences from expressed clusters (mean difference 5.3%; Table 6). This pattern is opposite to that observed in comparisons of sequences to external reference databases (above). However, we argue that this inverse pattern is indeed consistent with our hypothesis that expressed genes are more likely to be part of a core set shared across taxa (Figure 4). If this hypothesis is true, then the DNA-only cluster set (non-expressed genes) will be relatively enriched in non-core genes, including those present in only one taxon/genome and lacking any known homologs (for example, orphans) [36, 37]. In environmental sequence sets, if these sequences appear multiple times, they are more likely to be identical, or nearly so, because they come from a single taxon population and therefore cluster only with themselves (homologs from other taxa are by definition absent and will not fall into the cluster).
GC content and amino acid usage differ between expressed and non-expressed genes
GC percentages (averaged over all reads) in open reading frames identified using Metagene
Proportional changea in amino acid usage in RNA datasets compared to DNA datasets
Among the ocean datasets, DNA-RNA shifts in amino acid frequency were strongly related to amino acid GC content (Figure 11a; see Materials and methods). Amino acids with an intermediate GC content (0.5) constituted equivalent fractions, 40% and 36%, of the total number of amino acid frequency increases and decreases, respectively. Strikingly, amino acids with GC content below 0.5 were significantly less abundant in the RNA, being involved in 61% of all decreasing DNA-RNA amino acid frequency changes. In contrast, frequency increases were dominated by amino acids enriched in GC: 50% of increases involved amino acids with GC greater than 0.5, significantly higher than the representation of these amino acids in changes involving a decrease (3%). A similar, but less dramatic, shift in amino acid usage is observed when the DNA reads were binned into expressed and non-expressed gene sets (Figure 11b). In general, the magnitude of the proportional shift in amino acid usage decreases with depth in the water column (Figure 14). This pattern may reflect an overall decrease in microbial activity with depth, such that the transcriptome, less weighted by highly expressed and highly conserved genes, more closely resembles the metagenome as activity declines. Together, these results suggest a significant shift towards GC-rich amino acids in the expressed gene pool.
Prior studies describe a relationship among gene expression level, sequence conservation, and amino acid usage [38–42]. Specifically, significant enrichment in GC-rich amino acids among highly expressed genes has been demonstrated for individual bacterial taxa, including Prochlorococcus [38, 39]. GC richness in expressed genes is potentially driven by a combination of factors, including selection against metabolically costly amino acids (for example, AT-enriched phenylalanine and tyrosine) , or selection against AT-richness in highly expressed genes. Alternatively, this pattern may stem from an overall enhanced conservation level in highly expressed genes . Assuming an underlying GC-to-AT mutational bias, which may be a universal trend in bacteria [41, 42], selectively constrained genes are predicted to retain a GC-rich signature relative to less-constrained genes. Therefore, the proportional increase in GC-enriched amino acids in expressed genes compared to non-expressed genes in this study is consistent with our observation of enhanced sequence conservation in the expressed community gene pool, and confirms a fundamental distinction in amino acid usage related to gene expression level.
Microbial metagenomes and metatranscriptomes are amalgams of thousands of taxonomically and functionally diverse microorganisms, each of which experiences unique evolutionary pressures. Such complexity might be expected to preclude the detection of bulk evolutionary signals in meta-omic data. Here, we show broad trends in protein coding sequence conservation that transcend variation in both taxonomic composition and habitat type.
Specifically, we confirm that the hypothesized positive relationship between gene expression level and sequence conservation, which has been well established for individual taxa under experimental conditions , is a universal trend across diverse microbial communities in both marine and terrestrial environments. Detecting this trend required binning genes into broad categories (expressed versus non-expressed) based on the detection of transcripts, which depended, in part, on the depth of sequencing per sample. Deeper sequencing would reveal a greater proportion of the expressed gene pool and potentially lead to more accurate measurements of expression level for low frequency genes [22, 23]. However, the tremendous taxonomic diversity inherent in microbial communities, as well as the temporal heterogeneity of the environment in which these communities exist, likely confounds any attempt to predict protein conservation based on transcript abundance on a gene-by-gene basis using meta-omic data. Nonetheless, the broad discrepancy in sequence conservation between expressed and non-expressed gene fractions is significant, operates consistently across diverse taxa (Figure 4), and confirms that expression level is a primary determinant of evolutionary rate in naturally occurring microorganisms.
The mechanism linking evolutionary rates and expression level is still debated. Sequence conservation in highly expressed proteins has been hypothesized to be driven by selection acting to minimize the costs of protein misfolding, which should increase in tandem with expression level (protein copy number per cell) , though the harmful effects of misfolding have been brought into question . This selection for 'translational robustness/accuracy' is predicted to be largely decoupled from a protein's functional importance [13, 14, 24]. Here, using mRNA abundance as a proxy for expression level, our results demonstrate broad commonalities in expressed gene content across communities in widely different habitats (ocean versus soil). These data indicate a trend toward genes of protein synthesis and energy metabolism in the more actively expressed gene fraction and toward genes of cell replication and growth in the less expressed fraction. Additionally, this finding, in the context of our results demonstrating enhanced sequence conservation among expressed genes, indirectly suggests that the expression-conservation relationship may partially be constrained by protein function. However, these data cannot be used to justify this conclusion, since both gene expression level and functional importance may independently co-vary with protein evolution rates, as has been demonstrated for isolates of Pseudomonas aeruginosa . Though characterizing the mechanism linking gene expression level and evolutionary rate is beyond the scope of this study, metatranscriptomic data may inform future studies exploring the relative effect of protein function on sequence conservation.
We show that the expressed gene set, compared to the non-expressed set, is more likely to contain genes that belong to an orthologous core genome shared across closely related sister taxa. This pattern was broadly consistent across the marine samples and the soil sample. Interestingly, the overrepresentation of expressed genes within the core set was not observed in the HOT 500 m sample (Figure 4), which we hypothesize may be related to an overall decline of metabolic activity at deeper depths within the water column. Microbial transcriptomes can vary significantly in response to the growth phase of the organism [44–46]. In less actively growing communities (for example, stationary phase), expression level might be more uniform across the genome (that is, background expression), with both core and non-core genes having a relatively equal probability of detection. In contrast, in actively growing communities, the distribution of transcripts might become dominated by a subset of highly expressed genes (for example, genes mediating energy metabolism, membrane transport), as we have observed in other samples. If such genes fall within the core genome, core genome representation in the expressed gene set would be predicted to be greater in more active communities.
Our datasets highlight similarities in gene expression and sequence evolution across very different microbial habitats, but differ markedly in other attributes. Notably, the soil community was a clear outlier with respect to functional gene content (Figures 6, 7, and 9) and amino acid usage (Figures 12 and 13), likely due to the distinct community composition of this habitat (Figure 4). However, given our analysis of a single soil metatranscriptome, and the use of different RNA extraction kits for soil versus marine samples (see Materials and methods), we urge caution when comparing microbial community composition between soil and marine datasets. A more comprehensive comparison of taxonomy and functional gene expression would involve extended metatranscriptome sampling across multiple soil types (and locations), as well as optimization of RNA extraction protocols to ensure unbiased lysis of all microorganisms. Such an analysis was not the focus of this study. However, the inclusion of the soil sample confirmed a positive relationship between expression level and sequence conservation at both the genome and community levels (Figures 3 and 4), as well as an overrepresentation of core genes within the highly expressed gene set (Figure 4). Though it is possible that such patterns may not be observed in other sample types, or following different extraction protocols, our results provide strong evidence for universal features of protein-coding gene evolution in natural microbial communities.
The composition of metatranscriptomic and metagenomic sequence datasets depends not only on intrinsic biological factors (for example, community composition, metabolic state) but also on the physical and chemical environment at the time of sampling. Furthermore, interpretation of the resulting data can vary based on the analytical method (for example, database-dependent versus -independent analyses, as shown here) and on the availability and biases of the reference sequences to which the data are compared. Here, we attempt to rule out potential database artifacts by analyses at both the community and genome level. In so doing, our results suggest that environmental meta-omic datasets, despite their inherent complexity, can inform theoretical evolutionary predictions and reveal universal trends across ecologically and phylogenetically diverse microbial communities.
Materials and methods
We examined protein-coding sequences in coupled microbial metagenomes and metatranscriptomes from multiple depths at three distinct oceanographic sites and from surface soil in a temperate forest (Table 1). Ocean datasets (excluding the HOT 110 m RNA sample, which was sequenced in this study) were generated in prior studies using the Roche 454 Genome Sequencer with FLX series chemistry and extracted from public databases (see Table 1 for accession numbers). Sequences from the soil sample were generated in this study (detailed below) and are available in the NCBI Sequence Read Archive under accession [SRA028811].
Soil sample collection and DNA/RNA isolation
Soil was collected from within a transition hardwood-white pine/hemlock forest in the Prospect Hill Tract of Harvard Forest (Massachusetts, USA; 42.54 N 72.18 W; elevation, 385 m) on 27 September 2010. Two cores were taken from 1 to 10 cm below the leaf horizon using a 10 mm diameter soil corer. These cores were homogenized, placed into 50 ml Falcon tubes, flash frozen in liquid nitrogen, and transported to the lab on dry ice. Visible pieces of plant material were removed from soil subsamples with sterile forceps. During plant removal, the subsamples were placed on a sterile surface, laid within a bed of dry ice. Total microbial DNA and RNA were then extracted from these subsamples using PowerSoil Total RNA and DNA isolation kits (MoBio, Carlsbad, CA, USA), according to the manufacturer's protocol. This extraction protocol differs from the method used to generate the marine datasets, which employed the mirVana™ miRNA Isolation kit (Ambion, Austin, TX, USA) for RNA isolation [17, 19, 23, 35]. It is possible that these kits may lyse different microbial taxa at varying efficiencies. As this possibility has not been assessed, we urge caution when comparing the compositions of soil and marine communities based on metatranscriptome data, though this was not the focus of our study. Total DNA was quantified and used directly for pyrosequencing. Total RNA was further processed, as described below.
rRNA subtraction, RNA amplification and cDNA synthesis
Total RNA was amplified and prepared for pyrosequencing using established protocols. Briefly, the proportion of ribosomal RNA transcripts (bacterial and archaeal 16S and 23S rRNA and eukaryotic 18S and 28S rRNA) in total soil RNA was reduced via a published subtractive hybridization protocol using sample-specific rRNA probes . Following rRNA subtraction, total RNA was amplified as described previously using a modification of the MessageAmp™ II-Bacteria kit (Ambion) [17, 19]. Briefly, total RNA was polyadenylated and converted to double-stranded cDNA via reverse transcription. cDNA was then transcribed in vitro (37°C, 12 to 14 h) to produce microgram quantities of single-stranded antisense RNA. The amplified products (approximately 5 to 10 μg aliquot) were converted back to double-stranded cDNA using the SuperScript® III First-Strand Synthesis System (Invitrogen, Carlsbad, CA, USA) for first-strand synthesis (random hexamer priming), and the SuperScript™ Double-Stranded cDNA synthesis kit (Invitrogen) for second-strand synthesis. cDNA was then purified (QIAquick PCR purification kit, Qiagen, Valencia, CA, USA), digested with BpmI (37°C, 3 h) to remove poly(A) tails, and used directly for pyrosequencing.
Soil DNA and cDNA were purified for sequencing via the Agencourt® AMPure® kit (Beckman Coulter Genomics, Danvers, MA, USA) and used for the generation of single-stranded DNA libraries and emulsion PCR according to established protocols (454 Life Sciences, Roche, Branford, CT, USA). Clonally amplified library fragments were sequenced with full plate runs on a Roche Genome Sequencer FLX instrument using Titanium chemistry. Pyrosequencing datasets generated during this study have been deposited in the NCBI Sequence Read Archive under accession numbers listed in Table 1.
Pyrosequencing reads matching ribosomal RNA genes were identified in cDNA and DNA datasets by BLASTN searches against a custom database of prokaryotic and eukaryotic small and large subunit rRNA sequences (5S, 16S, 18S, 23S and 28S rRNA) taken from microbial genomes and the ARB SILVA LSU and SSU databases . Reads matching rRNA with bit scores >50 were identified and removed from further analysis. Replicate sequences sharing 100% nucleotide similarity and length, which may represent artifacts generated by the pyrosequencing protocol [23, 48], were identified among non-rRNA sequences using the open-source program CD-HIT  and removed from each dataset. Non-replicate, non-rRNA sequences were characterized by BLASTX searches against NCBI-nr, KEGG, and assembled protein sequences taken from the GOS database. Protein-coding sequences were identified as the top reference (database) gene(s) matching each read above a bit score of 50. When reads matched multiple genes with equal bit score, each matching gene was considered a top hit, with its representation scaled proportionate to the number of genes sharing the same bit score.
Amino acid identities, relative taxon abundance, and expression ratio
Local amino acid identities were recovered from within the top HSP for each sequence having a significant match (bit score >50) to a reference gene in the nr database; gaps were not included in identity calculations. Identities of multiple reads matching the same reference gene were averaged to obtain per gene identities, and these values were then averaged across all reference genes per dataset. Relative taxon abundance per sample was determined as the number of sequence reads matching the same reference taxon as a top hit in BLASTX searches of NCBI-nr. The relative transcriptional activity (expression level) per expressed gene was normalized to account for variations in gene abundance in the DNA pool, as calculated by the expression ratio: (RNA reads per gene/Total RNA reads matching NCBI-nr)/(DNA reads per gene/Total DNA reads matching NCBI-nr)
The proportional representation of expressed and non-expressed genes in the orthologous gene set shared between taxa was determined separately for the top five most abundant organisms in each of the samples. The core genome of Prochlorococcus sp., which was common in our samples, was defined as those genes present across all 12 of the sequenced Prochlorococcus strains, based on the analysis of . For all other taxa, a core gene set was more broadly defined by reciprocal BLASTP searches against a closely related sister taxon (bit score cutoff = 50). Sister taxa used for core genome identification, and the number of genes shared between taxa, are listed in Table 5.
Average-linkage clustering was performed in Cluster 3.0 using Pearson coefficients (centered) derived from pairwise correlations of KEGG functional category abundances (Figures 6 and 9), amino acid proportions (Figure 12), or proportional changes in amino acid usage (Figure 13). KEGG category clustering was based on expressed and non-expressed gene sets within the DNA data only, as defined in the main text based on matches to nr reference proteins. Abundance per KEGG category was calculated as the total number of reads per category as a proportion of total reads matching the KEGG hierarchy at either the category (Figure 8) or pathway level (Figure 9).
Database-independent cluster analysis
Sequence divergence was determined for clusters of related sequences within pooled DNA + RNA datasets. To avoid spurious clustering of non-coding regions (for example, small RNAs (sRNAs)), amino acid sequences present in the DNA and RNA reads (those with significant matches to the nr database) were identified using the ab initio gene-finding program Metagene . (Comparisons of amino acid sequences determined from Metagene and sequences determined from BLAST HSP regions produced nearly identical results.) Sequences derived from DNA and RNA reads were pooled and clustered using the program CD-HIT  applying a low sequence identity threshold (55%) over the aligned region. Resulting clusters contained RNA-derived sequences only, DNA-derived sequences only, or both DNA- and RNA-derived sequences. For all clusters containing DNA-derived sequences only (non-expressed) or DNA + RNA-derived (expressed) sequences, the divergence between each DNA-derived sequence and the reference (longest) sequence per cluster was calculated along the length of the aligned region. These values were averaged to estimate divergence per cluster. Clusters containing only one sequence (singletons) and clusters in which the reference sequence was the only DNA sequence were excluded from further analysis. For clusters in which the reference sequence was RNA-derived, but that contained only one DNA sequence, the percentage identity of the sole DNA sequence was recorded. Divergence per cluster was then averaged for all expressed and non-expressed sequence/gene clusters.
Amino acid usage
We compared amino acid usage between DNA and RNA datasets and DNA datasets parsed into expressed versus non-expressed gene categories as above. Specifically, for each dataset, reads matching nr proteins (see above for BLASTX descriptions) were examined to assess amino acid usage and GC content, using two approaches. First, amino acids were counted across all HSPs in alignments between sequence reads and their closest match in the nr database. Second, amino acids were counted across open reading frames that were identified using Metagene . These two methods returned nearly identical results; we therefore restrict our discussion to the Metagene data. Amino acids were then categorized based on GC content, where GC content was measured by the relative proportion of Gs and Cs among the synonymous codons for each amino acid (for example, alanine has four synonymous codons; 10 of the 12 total bases (0.83) representing these codons are either a G or C).
Proportional changes in amino acid usage, KEGG gene category and pathway abundances, and core genome representation in expressed and non-expressed gene fractions were compared using chi-square tests. Mean amino acid identities and expression ratios were compared using standard two-tailed t-tests. Significance was determined at alpha <0.05 following Bonferroni corrections for multiple comparisons.
Bermuda Atlantic Time Series
Global Ocean Sampling
horizontal gene transfer
Hawaii Ocean Time Series
high-scoring segment pair
Kyoto Encyclopedia of Genes and Genomes
National Center for Biotechnology Information non-redundant protein database
oxygen minimum zone.
We thank Rachel Barry for her tireless work in preparing samples for pyrosequencing, and Anne Pringle and Christopher Sthultz for providing the Harvard Forest soil sample. Generous support for this study came from the Gordon and Betty Moore Foundation (EFD), the National Science foundation (NSF), and a gift from the Agouron Institute (EFD). This work is a contribution of the Center for Microbial Oceanography: Research and Education (C-MORE).
- Nei M: Selectionism and neutralism in molecular evolution. Mol Biol Evol. 2005, 22: 2318-2342. 10.1093/molbev/msi242.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang ZH: Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol. 1996, 11: 367-372. 10.1016/0169-5347(96)10041-0.PubMedView ArticleGoogle Scholar
- Betancourt AJ, Presgraves DC: Linkage limits the power of natural selection in Drosophila. Proc Natl Acad Sci USA. 2002, 99: 13616-13620. 10.1073/pnas.212277199.PubMedPubMed CentralView ArticleGoogle Scholar
- Hirsh AE, Fraser HB: Protein dispensability and rate of evolution. Nature. 2001, 411: 1046-1049. 10.1038/35082561.PubMedView ArticleGoogle Scholar
- Flynn KM, Vohr SH, Hatcher PJ, Cooper VS: Evolutionary rates and gene dispensability associate with replication timing in the archaeon Sulfolobus islandicus. Genome Biol Evol. 2010, 2: 859-869. 10.1093/gbe/evq068.PubMedPubMed CentralView ArticleGoogle Scholar
- Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296: 750-752. 10.1126/science.1068696.PubMedView ArticleGoogle Scholar
- Fraser HB, Hirsh AE: Evolutionary rate depends on number of protein-protein interactions independently of gene expression level. BMC Evol Biol. 2004, 4: 13-10.1186/1471-2148-4-13.PubMedPubMed CentralView ArticleGoogle Scholar
- Dickerson RE: The structures of cytochrome c and the rates of molecular evolution. J Mol Evol. 1971, 1: 26-45. 10.1007/BF01659392.PubMedView ArticleGoogle Scholar
- Rocha EPC, Danchin A: An analysis of determinants of amino acids substitution rates in bacterial proteins. Mol Biol Evol. 2004, 21: 108-116. 10.1093/molbev/msh004.PubMedView ArticleGoogle Scholar
- Sharp PM: Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. J Mol Evol. 1991, 33: 23-33. 10.1007/BF02100192.PubMedView ArticleGoogle Scholar
- Pal C, Papp B, Lercher MJ: An integrated view of protein evolution. Nat Rev Genet. 2006, 7: 337-348. 10.1038/nrg1838.PubMedView ArticleGoogle Scholar
- Herbeck JT, Wall DP, Wernegreen JJ: Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia. Microbiol. 2003, 149: 2585-2596. 10.1099/mic.0.26381-0.View ArticleGoogle Scholar
- Drummond DA, Wilke CO: Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell. 2008, 134: 341-352. 10.1016/j.cell.2008.05.042.PubMedPubMed CentralView ArticleGoogle Scholar
- Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH: Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA. 2005, 102: 14338-14343. 10.1073/pnas.0504070102.PubMedPubMed CentralView ArticleGoogle Scholar
- Poretsky RS, Hewson I, Sun SL, Allen AE, Zehr JP, Moran MA: Comparative day/night metatranscriptomic analysis of microbial communities in the North Pacific subtropical gyre. Environ Microbiol. 2009, 11: 1358-1375. 10.1111/j.1462-2920.2008.01863.x.PubMedView ArticleGoogle Scholar
- Hewson I, Poretsky RS, Tripp HJ, Montoya JP, Zehr JP: Spatial patterns and light-driven variation of microbial population gene expression in surface waters of the oligotrophic open ocean. Environ Microbiol. 2010, 12: 1940-1956. 10.1111/j.1462-2920.2010.02198.x.PubMedView ArticleGoogle Scholar
- Shi YM, Tyson GW, DeLong EF: Metatranscriptomics reveals unique microbial small RNAs in the ocean's water column. Nature. 2009, 459: 266-269. 10.1038/nature08055.PubMedView ArticleGoogle Scholar
- McCarren J, Becker JW, Repeta DJ, Shi YM, Young CR, Malmstrom RR, Chisholm SW, DeLong EF: Microbial community transcriptomes reveal microbes and metabolic pathways associated with dissolved organic matter turnover in the sea. Proc Natl Acad Sci USA. 2010, 107: 16420-16427. 10.1073/pnas.1010732107.PubMedPubMed CentralView ArticleGoogle Scholar
- Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, Chisholm SW, DeLong EF: Microbial community gene expression in ocean surface waters. Proc Natl Acad Sci USA. 2008, 105: 3805-3810. 10.1073/pnas.0708897105.PubMedPubMed CentralView ArticleGoogle Scholar
- Urich T, Lanzen A, Qi J, Huson DH, Schleper C, Schuster SC: Simultaneous assessment of soil microbial community structure and function through analysis of the meta-transcriptome. PLoS ONE. 2008, 3: e2527-10.1371/journal.pone.0002527.PubMedPubMed CentralView ArticleGoogle Scholar
- Gilbert JA, Field D, Huang Y, Edwards R, Li WZ, Gilna P, Joint I: Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS ONE. 2008, 3: e3042-10.1371/journal.pone.0003042.PubMedPubMed CentralView ArticleGoogle Scholar
- Gifford SM, Sharma S, Rinta-Kanto JM, Moran MA: Quantitative analysis of a deeply sequenced marine microbial metatranscriptome. ISME J. 2011, 5: 461-472. 10.1038/ismej.2010.141.PubMedPubMed CentralView ArticleGoogle Scholar
- Stewart FJ, Ottesen EA, DeLong EF: Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics. ISME J. 2010, 4: 896-907. 10.1038/ismej.2010.18.PubMedView ArticleGoogle Scholar
- Drummond DA, Raval A, Wilke CO: A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. 2006, 23: 327-337. 10.1093/molbev/msj038.PubMedView ArticleGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Microevolutionary genomics of bacteria. Theor Popul Biol. 2002, 61: 435-447. 10.1006/tpbi.2002.1588.PubMedView ArticleGoogle Scholar
- Decottignies A, Sanchez-Perez I, Nurse P: Schizosaccharomyces pombe essential genes: a pilot study. Genome Res. 2003, 13: 399-406. 10.1101/gr.636103.PubMedPubMed CentralView ArticleGoogle Scholar
- Mata J, Bahler J: Correlations between gene expression and gene conservation in fission yeast. Genome Res. 2003, 13: 2686-2690. 10.1101/gr.1420903.PubMedPubMed CentralView ArticleGoogle Scholar
- Callister SJ, Mccue LA, Turse JE, Monroe ME, Auberry KJ, Smith RD, Adkins JN, Lipton MS: Comparative bacterial proteomics: analysis of the core genome concept. PLoS ONE. 2008, 3: e1542-10.1371/journal.pone.0001542.PubMedPubMed CentralView ArticleGoogle Scholar
- Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature. 2000, 405: 299-304. 10.1038/35012500.PubMedView ArticleGoogle Scholar
- Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, DeLong EF, Chisholm SW: Genomic islands and the ecology and evolution of Prochlorococcus. Science. 2006, 311: 1768-1770. 10.1126/science.1122050.PubMedView ArticleGoogle Scholar
- Nicolas P, Bessieres P, Ehrlich SD, Maguin E, van de Guchte M: Extensive horizontal transfer of core genome genes between two Lactobacillus species found in the gastrointestinal tract. BMC Evol Biol. 2007, 7: 141-10.1186/1471-2148-7-141.PubMedPubMed CentralView ArticleGoogle Scholar
- Taniguchi Y, Choi PJ, Li GW, Chen HY, Babu M, Hearn J, Emili A, Xie XS: Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010, 329: 533-538. 10.1126/science.1188308.PubMedPubMed CentralView ArticleGoogle Scholar
- Dotsch A, Klawonn F, Jarek M, Scharfe M, Blocker H, Haussler S: Evolutionary conservation of essential and highly expressed genes in Pseudomonas aeruginosa. BMC Genomics. 2010, 11: 234-10.1186/1471-2164-11-234.PubMedPubMed CentralView ArticleGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12: 962-968.PubMedPubMed CentralView ArticleGoogle Scholar
- Stewart FJ, Ulloa O, DeLong EF: Microbial metatranscriptomics in a permanent marine oxygen minimum zone. Environ Microbiol. 2011,Google Scholar
- Oliver SG: From DNA sequence to biological function. Nature. 1996, 379: 597-600. 10.1038/379597a0.PubMedView ArticleGoogle Scholar
- Daubin V, Ochman H: Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004, 14: 1036-1042. 10.1101/gr.2231904.PubMedPubMed CentralView ArticleGoogle Scholar
- Palacios C, Wernegreen JJ: A strong effect of AT mutational bias on amino acid usage in Buchnera is mitigated at high-expression genes. Mol Biol Evol. 2002, 19: 1575-1584.PubMedView ArticleGoogle Scholar
- Banerjee T, Ghosh TC: Gene expression level shapes the amino acid usages in Prochlorococcus marinus MED4. J Biomol Struct Dyn. 2006, 23: 547-553.PubMedView ArticleGoogle Scholar
- Akashi H, Gojobori T: Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc Natl Acad Sci USA. 2002, 99: 3695-3700. 10.1073/pnas.062526999.PubMedPubMed CentralView ArticleGoogle Scholar
- Lind PA, Andersson DI: Whole-genome mutational biases in bacteria. Proc Natl Acad Sci USA. 2008, 105: 17878-17883. 10.1073/pnas.0804445105.PubMedPubMed CentralView ArticleGoogle Scholar
- Hershberg R, Petrov DA: Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet. 2010, 6: e1001115-10.1371/journal.pgen.1001115.PubMedPubMed CentralView ArticleGoogle Scholar
- Plata G, Gottesman ME, Vitkup D: The rate of the molecular clock and the cost of gratuitous protein synthesis. Genome Biol. 2010, 11: R98-10.1186/gb-2010-11-9-r98.PubMedPubMed CentralView ArticleGoogle Scholar
- Peplinski K, Ehrenreich A, Doring C, Bomeke M, Reinecke F, Hutmacher C, Steinbuchel A: Genome-wide transcriptome analyses of the 'Knallgas' bacterium Ralstonia eutropha H16 with regard to polyhydroxyalkanoate metabolism. Microbiology. 2010, 156: 2136-2152. 10.1099/mic.0.038380-0.PubMedView ArticleGoogle Scholar
- Chaussee MA, Dmitriev AV, Callegari EA, Chaussee MS: Growth phase-associated changes in the transcriptome and proteome of Streptococcus pyogenes. Arch Microbiol. 2008, 189: 27-41. 10.1007/s00203-007-0290-1.PubMedView ArticleGoogle Scholar
- Waite RD, Papakonstantinopoulou A, Littler E, Curtis MA: Transcriptome analysis of Pseudomonas aeruginosa growth: comparison of gene expression in planktonic cultures and developing and mature biofilms. J Bacteriol. 2005, 187: 6571-6576. 10.1128/JB.187.18.6571-6576.2005.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig WG, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.PubMedPubMed CentralView ArticleGoogle Scholar
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.PubMedView ArticleGoogle Scholar
- Li WZ, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22: 1658-1659. 10.1093/bioinformatics/btl158.PubMedView ArticleGoogle Scholar
- Kettler GC, Martiny AC, Huang K, Zucker J, Coleman ML, Rodrigue S, Chen F, Lapidus A, Ferriera S, Johnson J, Steglich C, Church GM, Richardson P, Chisholm SW: Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 2007, 3: e231-10.1371/journal.pgen.0030231.PubMedPubMed CentralView ArticleGoogle Scholar
- Noguchi H, Park J, Takagi T: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006, 34: 5623-5630. 10.1093/nar/gkl723.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.