Community-wide analysis of microbial genome sequence signatures
© Dick et al.; licensee BioMed Central Ltd. 2009
Received: 29 April 2009
Accepted: 21 August 2009
Published: 21 August 2009
Analyses of DNA sequences from cultivated microorganisms have revealed genome-wide, taxa-specific nucleotide compositional characteristics, referred to as genome signatures. These signatures have far-reaching implications for understanding genome evolution and potential application in classification of metagenomic sequence fragments. However, little is known regarding the distribution of genome signatures in natural microbial communities or the extent to which environmental factors shape them.
We analyzed metagenomic sequence data from two acidophilic biofilm communities, including composite genomes reconstructed for nine archaea, three bacteria, and numerous associated viruses, as well as thousands of unassigned fragments from strain variants and low-abundance organisms. Genome signatures, in the form of tetranucleotide frequencies analyzed by emergent self-organizing maps, segregated sequences from all known populations sharing < 50 to 60% average amino acid identity and revealed previously unknown genomic clusters corresponding to low-abundance organisms and a putative plasmid. Signatures were pervasive genome-wide. Clusters were resolved because intra-genome differences resulting from translational selection or protein adaptation to the intracellular (pH ~5) versus extracellular (pH ~1) environment were small relative to inter-genome differences. We found that these genome signatures stem from multiple influences but are primarily manifested through codon composition, which we propose is the result of genome-specific mutational biases.
An important conclusion is that shared environmental pressures and interactions among coevolving organisms do not obscure genome signatures in acid mine drainage communities. Thus, genome signatures can be used to assign sequence fragments to populations, an essential prerequisite if metagenomics is to provide ecological and biochemical insights into the functioning of microbial communities.
The age of genomics has opened up new perspectives on the natural microbial world, offering insights into organisms that drive geochemical cycles and are critical to human and environmental health. The prevalence of horizontal gene transfer, recombination, and population-level genomic diversity underscores the dynamic nature of bacterial and archaeal genomes and demands reconsideration of fundamental issues such as microbial taxonomy [1, 2] and the concept of microbial species [3, 4]. Application of genomics to uncultivated assemblages of microorganisms in natural environments ('metagenomics' or 'community genomics') has provided a new window into in situ microbial diversity and function [5–7]. To date, community genomics has revealed the form and extent of recombination and heterogeneity in gene content [8–11], elucidated virus-host interactions , redefined the extent of genetic and biochemical diversity in the oceans [13–15], uncovered new metabolic capabilities [16–19] and taxonomic groups , and shown how functions are distributed across environmental gradients .
An important approach to study evolutionary and ecological processes, pioneered by Karlin and others , is the analysis of nucleotide compositional characteristics of genomes. The simplest and most widely used measure of nucleotide composition, the abundance of guanine plus cytosine (%GC), is shaped by multiple factors encompassing both neutral and selective processes. Neutral factors include intrinsic properties of the replication, repair, and recombination machinery that result in mutational biases [23, 24]. Selective processes encompass both internal (for example, translation machinery) and external influences such as physical (temperature, pressure), chemical (salinity, pH) and ecological factors (competition for metabolic resources  and niche complexity ). Although the relative importance of these factors remains uncertain , it is clear that %GC varies widely between species but is relatively constant within species. Thus, %GC has been used to trace origins of DNA fragments within genomes  and to assign fragmentary metagenomic sequences to candidate organisms . Such inferences must be made with caution: %GC simplifies nucleotide composition down to a single parameter with known limitations for investigating genome dynamics .
Oligonucleotide frequencies capture species-specific characteristics of nucleotide composition more effectively than %GC . Analyses of genome sequences from cultivated organisms have shown that the frequency at which oligonucleotides occur is unique between species while being conserved genome-wide within species [22, 30–34]. Taken together, the frequency of all oligonucleotides of a given length defines the 'genome signature' (for example, the frequency of all possible 256 tetranucleotides). Sequence signatures are evident in oligonucleotides ranging from di- (two-mers) to octanucleotides (eight-mers). While the specificity of genome signatures increases with oligonucleotide length , the number of possible oligomers increases exponentially with oligomer length, so signatures based on longer oligomers require calculations over larger genomic regions to achieve sufficient sampling. Genome signatures have been used to detect horizontally transferred DNA [36–39], reconstruct phylogenetic relationships [22, 32, 40] and infer lifestyles of bacteriophage [41, 42].
Genome signatures also offer a compelling means of assigning metagenomic sequence fragments to microbial taxa, a procedure termed 'binning' . This is a prerequisite for realizing some of the most valuable opportunities random shotgun metagenomics offers, including assignment of ecological and biogeochemical functions to particular community members and assessment of population-level genomic diversity and community structure. However, binning is a formidable challenge because: the inherent diversity of microbial communities typically limits genomic assembly, resulting in highly fragmentary data ; there are few universally conserved phylogenetically informative markers, leaving the vast majority of metagenomic sequence fragments 'anonymous' with regard to their organism of origin; and current sequence databases grossly under-represent the microbial diversity in the natural world, limiting the utility of fragment recruitment or BLAST-based methods [13, 44, 45]. Consequently, it is important to develop methods that classify all genome sequence fragments independently of reference databases.
Genome signatures are a promising approach for sequence classification. However, it is important to understand the source of the signal and how environmental effects and evolutionary distance will compromise it. To date, sequence signatures have been explored using genomes from cultivated microbes [22, 30–34], and prospects for binning have been evaluated based largely on simulated datasets consisting of mixtures of isolate genomes [44, 46–48]. Although these studies are indispensable in that they allow theoretical evaluation of binning capability, they do not represent the diversity (community-wide and within population) and dynamics (for example, horizontal gene transfer, recombination, viruses) of real microbial communities. Further, they employ genomes derived from disparate environments and so do not address the extent to which environmental factors shape genome signatures. It has been reported that environment shapes nucleotide composition [26, 49–51]. If so, then genome signatures may not discriminate coexisting, coevolving organisms, especially where environmental pressures are extreme. On the other hand, binning results of real microbial communities [46, 48, 52] are inherently difficult to evaluate because the true identity of most sequence fragments is unknown. Thus, there remain fundamental questions regarding the forces and processes that give rise to and maintain genome signatures, and the extent to which these signatures are obscured by shared environmental pressures and community interactions such as horizontal gene transfer and broad host range viruses.
Here we present a comprehensive analysis of genome signatures in sequences derived from natural biofilms inhabiting a subsurface chemolithoautotrophic acid mine drainage (AMD) ecosystem in the Richmond Mine at Iron Mountain, CA . The biofilms are dominated by just a handful of organisms that are sustained primarily by the oxidation of Fe(II) derived from pyrite (FeS2) dissolution . Due to this relatively low diversity, modest levels of shotgun sequencing (approximately 100 Mb per sample) have yielded deep genomic sampling (10 to 20× sequence coverage) of the dominant populations, enabling reconstruction of 12 near-complete genomes from three samples [16, 55, 56] (BJ Baker et al., submitted). These assembled composite genomes provide the organism affiliation of sequences with which binning accuracy can be evaluated. Therefore, the dataset allows assessment of binning performance while capturing sequence heterogeneity that is an intrinsic feature of natural microbial populations. We find that AMD biofilm microorganisms are indeed distinguished by population-specific genome signatures and show that sequence signatures can be used to identify and cluster sequences from low-abundance community members de novo, without reference genomes or reliance on databases. Our results have implications for metagenomic binning and provide new insights into the sources of genome signatures that distinguish coexisting populations.
Description of samples, community genomic sequencing and assembly
Deeply sampled composite genomes from Iron Mountain community genomic datasets used in binning analysis
UBA, UBA BS
UBA, UBA BS
UBA, UBA BS, UBA filtrate
Leptospirillum group II†
Leptospirillum group II‡
Leptospirillum group III†
Ferroplasma acidarmanus fer1†
UBA, UBA BS
Baker et al., submitted
Baker et al., submitted
Baker et al., submitted
UBA, UBA BS
Clustering sequences by tetranucleotide frequency and emergent self-organizing map
To quantitatively evaluate binning performance on sequence fragments of different lengths, tetra-SOMs were run on the same dataset (including unassigned sequences and reconstructed composite genomes) but with sequences broken into various fragment sizes. Binning accuracy was calculated for a subset of genomes for which deeply sampled and manually curated assemblies are available (Additional data file 2). For sequence fragments 5 kb or larger, sensitivity (percentage of fragments from each genome correctly identified) and precision (percentage of fragments in each bin belonging to the correct genome) rates of > 90% were achieved (Additional data file 2). Sensitivity was somewhat lower for Leptospirillum groups II and III due to poor resolution of certain genomic regions between these two populations. When Leptospirillum was considered as a single group, binning sensitivity was comparable to the other reference genomes. Sensitivity decreased notably only when shorter (< 5 kb) sequence fragments were analyzed, but precision remained remarkably high even for 1,400-bp fragments (Additional data file 2). Lower sensitivity is due to sequence fragments that fall between clusters, beyond the borders of any bin. Notably, the tetra-ESOM correctly assigned sequence fragments as short as 500 bp, provided that some larger fragments were included in the analysis (Additional data file 2b). To address the question of how genome completeness influences performance, genomes randomly subsampled at different levels were analyzed by tetra-ESOM. Binning accuracy was maintained even at 20% genome sequence; only at 10% subsampling was a notable decline observed, and even then only for certain genomes (Additional data file 3).
Incorrectly assigned fragments often contained mobile elements or other features expected to have atypical nucleotide composition. The majority (54 of 94) of incorrectly binned fragments from all five reference genomes show evidence of transposons, prophage, or integrated plasmids. Other frequently unresolved genomic regions contain CRISPR elements  and rRNA genes, both of which have constrained sequences and thus atypical tetranucleotide patterns . The region of the ESOM map containing a mixture of Leptospirillum groups II and III (Figure 3) was dominated by fragments (80 of 92) encoding mobile elements that may be exchangeable between the two Leptospirillum groups (for example, integrated plasmid-like sequence ) and strain/group-unique regions believed to have been recently acquired (for example, prophage).
Interestingly, many strain-unique regions were correctly binned with their host genomes. There are 197 strain-unique genes between the fer1 and fer1(env) genomes, the majority of which occur in distinct genomic blocks of up to 24 genes with atypical %GC content inferred to be the result of prophage insertion . Ninety-six percent (22 of 23) of sequence fragments containing these genomic islands were accurately assigned as Ferroplasma in our binning analysis.
Genome signatures of low-abundance community members and viruses
The tetra-ESOM revealed large regions of the map that were devoid of sequence fragments of known organism affiliation (Figure 3, regions 11 to 17). We used mate pair linkage with rRNA gene-containing contigs, phylogenetic analysis, and/or close relatedness (synteny and identity) to other community members to identify these bins as follows: a new type of Leptospirillum most closely related to Leptospirillum ferrodiazotrophum (group III); several members of the Thermoplasmatales for which genomic sequence had not been previously obtained (C-plasma, D-plasma, and a divergent type of A-plasma); several Actinobacteria; and multiple more shallowly sampled populations, including a gammaproteobacterium and several Sulfobacillus-like organisms (Figures 2 and 3). A small, prominent region of the map adjacent to the Leptospirillum groups contained approximately 250 kb of composite sequence (Figure 3, region 11) inferred to be a Leptospirillum plasmid . Tetranucleotide usage patterns of this putative plasmid are quite distinct from those of either Leptospirillum groups (Additional data file 4).
We calculated tetranucleotide frequencies for viral genomes that were recently reconstructed from the same genomic datasets and linked to their hosts via CRISPR viral resistance system sequences (Additional data file 4) . Three of the viruses closely resemble their hosts' tetranucleotide usage (AMDV1, Leptospirillum groups II and III; AMDV4, E-plasma; AMDV3, A-/E-/G-plasma), a trend that has been observed previously for cultivated viruses and hosts [41, 63]. Interestingly, two viruses have very different tetranucleotide frequency patterns (AMDV2, E-plasma; AMDV5, I-plasma; Additional data file 4).
Characteristics of genome signatures
To assess the contributions of these potential sources of genome signature signal, we compared SOMs based on amino acid composition, codon composition, and tetranucleotide frequency. Amino acid composition alone distinguished certain genomes (Additional data file 5). This was especially true for phylogenetically distant organisms (for example, archaea versus bacteria), but some separation was also apparent among groups within some lineages such as Ferroplasma versus other Thermoplasmatales. SOMs based on codon composition were notably more accurate than amino acid composition and comparable to those based on tetranucleotide frequency (Additional data file 5).
The correlation of genome signatures with codon usage raises the question of whether they persist in intergenic regions. Thus, we extracted intergenic regions from assembled and annotated genomes and analyzed them with coding regions by tetra-ESOM (intergenic regions were concatenated to tally tetranucleotide frequencies but care was taken to avoid artifacts; see Materials and methods). Intergenic regions from each genome formed discrete, cohesive clusters that mapped adjacent to coding regions from the same genome but were separated by U-Matrix boundaries (Additional data file 7). Intergenic sequences from each genome were grouped based on length, concatenated, and analyzed by ESOM; all size classes of intergenic regions from the same genome clustered together regardless of length, from the shortest (4 to 20 bp) to longest (> 1,000 bp) (data not shown). The noncoding complement of each Thermoplasmatales genome formed a distinct cluster adjacent to noncoding regions of the other Thermoplasmatales. The only outlier to this trend was A-plasma, which has the highest %GC among these organisms. Based on U-Matrix background, the distance between noncoding sequences of different genomes is comparable to the distance between noncoding and coding sequences of the same genome. To determine if the presence of noncoding sequence influences binning accuracy in the initial experiments, we calculated the percentage of coding sequence on incorrectly binned fragments from the five reference genomes (5 kb and 1 kb window sizes). For many genomes, the incorrectly binned fragments do indeed have a smaller average percentage of coding sequence. However, this percentage varied widely on incorrectly binned fragments. Only a small fraction of such fragments had a percentage of coding sequence smaller than one standard deviation below the genome-wide average (Additional data file 8).
For sequence signatures to differentiate populations in a genome-wide manner, it is necessary that within-genome differences resulting from atypical regions of amino acid and/or synonymous codon usage are smaller than between-genome differences. This issue is especially relevant in AMD, where proteins are under diverse constraints depending on whether they function in the extracellular (around pH 1) or intracellular (around pH 5) environment . Indeed, proteins from the AMD populations in these two fractions have disparate isoelectric points owing to the unique amino acid composition of acid-stable proteins . We identified 106 Leptospirillum group II-UBA proteins that are consistently enriched in the extracellular fraction according to environmental shotgun proteomics data [55, 66] and compared sequence signatures of their genes with the other 2,522 Leptospirillum group II genes. No systematic differences were detected via tetra-ESOM, suggesting that genome signatures persist even when gene sequences are influenced by considerable protein-coding constraints (Additional data file 9).
Selection for codons that optimize translation rate may also influence codon usage. We analyzed genome signatures for the 50 Leptospirillum group II proteins most abundantly detected via environmental shotgun proteomics [55, 66]. With the exception of one subset of genes encoding mainly ribosomal proteins (which mapped into the mixed region between Leptospirillum groups II and III), highly expressed genes clustered with the rest of the genome (Additional data file 9).
Through analysis of a deeply sampled and extensively curated community genomic dataset, we have demonstrated that genome signatures can be used to differentiate coexisting microbial populations despite functional and environmental constraints, processes such as lateral gene transfer, and pressures imposed by viral predation that might have diminished them to the point that they are no longer diagnostic. The genome-wide nature of the signatures makes them potentially useful for classification of sequence fragments. Results from our AMD dataset show that the signal can be detected on fragments as small as 500 bp, genome clusters can be defined using fragments as short as 1,400 bp (Additional data file 2) and a small fraction of the genome (Additional data file 3). These findings suggest broad applicability of the tetra-ESOM approach for metagenomic studies. However, in order to understand and predict its utility for binning, it is important to identify sources of genome signatures as well as processes that are likely to diminish the signal.
Insights into the sources of distinctive genome signatures
It has been suggested that environmental constraints strongly shape nucleotide composition [26, 49–51]. If this were the case, two effects should be apparent in genome signatures of AMD populations. First, shared pressures deriving from the extreme AMD environment would drive genome signatures together, potentially obscuring differences between populations. Second, since each genome encodes proteins destined for diverse environments (that is, intracellular and extracellular), there should be prominent intra-genome variation of genome signature and scattering of fragments from the same genome into disparate regions of the SOM. Neither of these expectations is met in the AMD dataset. There are vast differences in nucleotide composition between populations, with genomic %GC ranging from 35% (ARMAN-4 and ARMAN-5) to 69% (low-abundance Actinobacteria) and genome signatures forming discrete clusters. Amino acid compositional constraints required for stability of proteins exposed to acidic solutions do not result in sequence signatures that are markedly distinct from the rest of the genome. In other words, within-population differences in genome signature are small relative to differences between populations. Although we do not rule out some environmental influence on genome signatures, we conclude that, in AMD, this influence is not strong enough to obscure differences between populations. Similar community-wide analyses need to be conducted in other systems to determine whether our findings extend to other extremophilic microbial communities.
Our results show that genome signatures are related to several traits, including %GC, amino acid composition, synonymous codon usage, and palindrome avoidance. These characteristics are interrelated and further connected to a host of biochemical, ecological, and evolutionary processes (Additional data file 10). Large differences in %GC and/or amino acid composition guarantee distinctive genome signatures but are not required to differentiate genomes. At finer evolutionary scales, where %GC and amino acid composition are not informative, populations can be readily distinguished through subtle differences in tetranucleotide frequency, which correlate with genome-specific synonymous codon usage. Tetra-ESOM analyses based on codon usage and tetranucleotide frequency displayed similar clustering resolution, indicating that little signal derives from longer-range characteristics such as codon pair bias. It should be noted, however, that using tetranucleotide frequency rather than codon composition has practical advantages for binning because it is independent of coding strand and reading frame and thus insensitive to errors in gene-calling or frame shifts due to poor quality sequence. These issues are particularly important for short, low-coverage sequence fragments.
Although genome signatures are largely manifested through codon composition, the observation that population-specific signatures also occur in non-coding regions (Additional data file 7) suggests a mechanism of generation that is independent of protein coding. We hypothesize this underlying process is mutational bias associated with DNA replication and repair, which exerts directional pressure on nucleotide composition . In fact, between-genome codon biases can be predicted solely by %GC and context-dependent nucleotide biases (that is, mutation rates at each site are dependent on the identity of neighboring nucleotides) calculated from non-coding regions [67, 68]. It is interesting to note that non-coding regions mapped into discrete clusters, distinct from coding regions of the same genome or non-coding regions of different genomes, including those with identical %GC. Differences in genome signature of coding and non-coding sequences from the same genome are to be expected based on differing functional constraints on these regions (for example, coding amino acids versus small RNAs or regulatory elements such as promoters). The distinction of non-coding regions from different genomes is consistent with genome-specific mutational biases.
An alternative to the mutation bias hypothesis, at least for coding sequences, is that genome signatures are shaped by factors related to translation. Changes in codon usage can be driven by changes in the tRNA gene complement [69, 70] that may occur, for example, through interaction with plasmids and viruses . However, we found AMD genomes with distinct genome signatures, such as G-plasma, E-plasma, and Ferroplasma, that have only minor differences in tRNA gene content, and these differences do not correspond to observed differences in codon usage. In addition to tRNA gene complement, there may be changes in tRNA gene regulation, which can significantly impact cellular tRNA concentrations and have been correlated with changes in codon usage . Thus, although we cannot rule out a tRNA regulatory influence on genome signatures, our findings suggest that coevolution of tRNA gene content and codon usage is not a primary mechanism underlying the divergence of genome signatures in related AMD populations.
Codon bias can also arise as the result of selection for certain codons that are optimal for fast and/or accurate translation . This form of codon bias primarily influences the subset of genes encoding highly expressed proteins, is prevalent for fast-growing organisms [69, 74], and correlates with ecological strategy . In fact, a Leptospirillum group II genome fragment encoding nine ribosomal proteins and two translation elongation factors had distinctive tetranucleotide composition, indicating that this mode of codon bias occurs in AMD organisms. However, as commonly construed, translational selection would influence within-genome codon bias, not the genome-wide codon biases that differentiate populations as observed in our study. It is tempting to speculate that differences in ecological strategy (for example, response rate to resource availability ) could have genome-wide influence on codon usage, but there is currently no evidence in our dataset to suggest that this is the case.
Finally, restriction avoidance places another selective genome-wide constraint on DNA composition that may contribute to genome signatures. Under-representation of palindromic tetranucleotides (Figure 6) has been attributed to avoidance of enzymes designed to recognize and degrade foreign DNA [22, 32, 46]. Our data show that palindrome avoidance contributes to the genome signature but is not the sole or even primary determinant. Most archaeal viruses and bacteriophage have sequence signatures that resemble their hosts, including avoidance of specific subsets of palindromes. However, mismatches between the tetranucleotide signatures of AMDV2 and AMDV5 and their respective hosts point to the lesser importance of palindrome avoidance in these organisms. In the case of AMDV5, other evidence suggests a recent alteration in host range . It is interesting to note that the genomes of archaeal AMD viruses encode several restriction modification (RM) system genes. These may have significance for virus host-interactions  and for influencing genome signatures. Broad host range viruses or viruses that jump to new hosts can potentially drive changes in the host sequence signatures if they replace or supplement the restriction systems of the host. Alternatively, the degree of similarity in tetranucleotide signatures of viruses and their hosts may be a function of the extent to which the virus relies upon its host's replication and translation machinery (for example, associated with a lysogenic versus lytic lifestyle) [41, 42, 63].
Implications for metagenomic, ecological, and evolutionary studies
Due to the high levels of diversity in most natural systems, random sequencing approaches yield fragmentary data, often comprising genomic sequences no more than a few kilobases in length. While more comprehensive coverage of individual organisms can be achieved by single cell genomics [78–80] or targeted, large-insert approaches [81, 82], random shotgun approaches retain two important advantages: the random nature provides insights that are unbiased by preconceived notions of community composition; and population-level variation is captured because each sequencing read derives from a different individual cell.
A key challenge for virtually all shotgun metagenomics investigations is the assignment of genome fragments to the organism they derive from. This step links organism to metabolism and function and is essential if we are to understand microbial community dynamics and predict ecosystem level impacts of changes in community membership and structure. Binning is particularly challenging for lower-abundance organisms, which may play keystone roles that are critical to ecosystem function. Thus, our finding that tetra-ESOM can resolve the phylogenetic affiliation of genome fragments on the scale of two mate-paired reads is of great significance. This approach has clear applicability to low-complexity datasets such as those derived from our AMD biofilms, bioreactors , and enrichment cultures . In fact, even for the relatively extensively analyzed AMD dataset, it revealed multiple new genomic clusters, including a near complete genome of a novel actinobacterium (GJ Dick et al., in preparation), a putative plasmid, and many discrete but less well-sampled populations.
Tetra-ESOM may also provide a powerful method for analysis of unassembled data from complex samples such as soil, seawater, and the human microbiome if representative isolate genomes are available. The feasibility of binning metagenomic sequences from complex samples using reference genomes will increase with current initiatives to fill in the phylogenetic tree with genome sequences from cultivated microorganisms.
An important advantage of unsupervised, compositional-based approaches such as tetra-ESOM is that gene sequences need not be represented in databases to be identified; only representation of the genome signature is required. This is in contrast to fragment recruitment  and BLAST-based binning approaches that only work for homologous sequences. We found that clusters of a few hundred kilobases of sequence (as little as 20% of the genome) were resolved, suggesting that a few fosmids or bacterial artificial chromosomes linked to 16S rRNA genes can be sufficient to serve as a reference to define a bin. Thus, recent progress in using large-insert metagenomic libraries to link 16S rRNA genes to genomic sequence from diverse uncultivated microorganisms is very valuable in this regard .
Because the reach of composition-based approaches to binning extends beyond gene content of reference genomes, they hold great promise for identifying and classifying genes from the variable fraction of the pan-genome (present in only a subset of strains or species), an important determinant of pathogenicity and niche differentiation [86–88]. In AMD populations, genome reconstruction has shown that this strain-variable fraction often involves inserted plasmid and virus sequences [8, 9]. In the current study, these integrated elements clustered either with the host genome or in regions shared between different species or genera. Since horizontally transferred DNA is rapidly converted to the genome signature of its new host [22, 28, 89], the extent to which such genomic regions reflect the genome-wide signature of nucleotide composition is likely a function of the donor of the genetic material and how recently they were acquired. Recently acquired sequences with distinctive tetranucleotide patterns may bin incorrectly, and unexpected binning outcomes can be used to identify laterally transferred regions [62, 90].
Although the tetra-ESOM method works well to separate sequence fragments from organisms distinct at the genus or higher level, it has some limitations. Tetra-ESOM is generally unable to distinguish closely related species or strains. An important question, especially for more diverse samples, is whether limitations in genome sequence signature space will impose an inherent constraint on the number of populations that can be resolved. There are a staggering 6 × 10222 ways to code for a typical protein in our samples (based on an average protein size of 467 amino acids and assuming an average of 3 possible ways to code for any amino acid). This richness of protein coding space suggests ample capacity for numerous genome signatures. To date, SOMs have shown promising results in resolving up to 81 complete genomes, in successfully classifying fragments of 1,502 genomes into phylogenetic groups, and in visualizing phylogenetic clustering of sequences in complex environmental samples . However, it remains difficult to assess the accuracy and phylogenetic resolution of oligonucleotide-based SOMs on metagenomic datasets from diverse natural microbial communities. Another concern is computational demand. Continued increases in processor speeds will likely need to be supplemented with more efficient and/or accurate algorithms such as the recently introduced hyperbolic SOM  and growing SOM .
Bacterial, archaeal, and viral populations in the AMD biofilm community have genome-wide signatures of nucleotide composition that are effectively captured and visualized through self-organizing maps of tetranucleotide frequency. We conclude that even under extremely acidic conditions, shared environmental pressure does not obscure genome signatures of nucleotide composition. Our data point to pervasive mechanisms of generating and maintaining genome signatures; although a variety of factors and processes contribute, we propose that mutational bias is the primary underlying mechanism driving the divergence of genome signature between closely related organisms. The resulting signal, evident through synonymous codon usage, is genome-wide and sufficiently diagnostic to classify fragmentary metagenomic data from coexisting populations of a natural microbial community at approximately the genus level. However, distinguishing features of genome signatures may be subtle, being masked by within-genome heterogeneity and the multidimensional nature of tetranucleotide frequency patterns. Tetra-ESOM is a key method for visualizing and exposing these potentially weak signals. Being unsupervised, it requires no database representation of the organisms present. Visualization of the data structure highlights differences between populations and reveals atypical regions corresponding to biologically meaningful genomic features such as mobile elements or previously unrecognized genotypes present at low abundance in the community. When employed in conjunction with complementary methods such as genomic assembly and analysis of phylogenetic marker genes, genome signatures offer powerful perspectives on metagenomic data.
Materials and methods
Sample collection, construction of genomic libraries, sequencing, and community genomic assembly
An overview of the samples and methodology used in this study is provided in Figure 1. Sample collection, DNA extraction, random fragmentation and cloning of approximately 3-kb fragments, Sanger sequencing, assembly, and curation of community genomics data were performed using phred/phrap/consed package as detailed previously [12, 55]. The combined UBAs nonLeptos dataset was constructed by assembling sequencing reads derived from both the UBA BS and UBA biofilm samples (with UBA reads previously assigned to Leptospirillum spp. removed). This included 229,082 reads and approximately 210 Mb of total sequence, which assembled into 15,929 contigs and 36.6 Mb of composite sequence.
Calculation of tetranucleotide frequencies and clustering by ESOM
Tetranucleotide frequencies were determined for each assembled contig using a custom Perl script. Frequencies were calculated with a 1-bp sliding window and pairs of reverse complementary tetranucleotides were summed in order to avoid strand bias. Longer contigs and assembled genomes were split into 5-kb windows and only contigs longer than 2 kb were considered unless noted otherwise. To assess binning accuracy, data points (representing contigs/windows) are colored according to their genome of origin (when known), but this information is not available to the clustering process.
Contigs were clustered by tetranucleotide frequency utilizing Databionics ESOM Tools . The input for tetra-ESOM was a 136-dimensional vector (representing the frequencies of the 136 unique reverse complement tetranucleotide pairs, normalized for contig length) for each contig/window. These raw frequencies were transformed with the 'Robust ZT' option built into Databionics ESOM Tools, which normalizes the data using robust estimates of mean and variance. Data were permuted before each run to avoid errors due to sampling order. Maps were toroidal (borderless) with Euclidean grid distance and dimensions scaled from the default map size (50 × 82) as a function of the number of data points, to a ratio of approximately 5.5 map nodes per data point. For example, a typical clustering with approximately 7,500 data points was run on map with dimensions 155 × 255. Training was conducted with the K-Batch algorithm (k = 0.15%) for 20 training epochs. The standard best match search method was used with local best match search radius of 8. Other training parameters were as follows: Gaussian weight initialization method; Euclidean data space function; starting value for training radius of 50 with linear cooling to 1; starting value for learning rate of 0.5 with linear cooling to 0.1; Gaussian kernel function.
Clustering resolution versus evolutionary distance
To quantify the degree of clustering between closely related genomes, we analyzed SOM maps using fixed point kernel densities . Spatial data from the SOM was imported into ArcGIS (ESRI Software) and clusters were defined using Hawth's Analysis Tools for ArcGIS . Cluster boundaries were determined using density estimators that captured 90% of data points from each genome (Additional data file 1). We then calculated separation between genomes as a percentage (Non-overlapping points/Total number of points) for two bins being compared. Average amino acid identity was calculated as described previously .
Predicted tetranucleotide frequency
The predicted frequency of each unique pair of reverse complementary tetranucleotides was calculated based on genome-wide frequencies of potentially contributing codons. As shown in Figure 5, for any given tetranucleotide there are 12 potentially associated codons depending on coding strand and reading frame. Four codons (numbers 3, 4, 9, and 10 in Figure 5) are fully captured by the tetranucleotide, four are partially captured at two of three positions (numbers 2, 5, 8, and 11), and four are partially captured at one of three positions (codons 1, 6, 7, and 12). Each of these three classes is weighted according to their contribution: 1, 2/3, and 1/3 respectively. For partially captured codons, contributions of all possibilities were taken into account; for example, in Figure 5, codon number 5 (TGX) there are four possible codons - TGA, TGT, TGC, and TGG.
Binning performance on variable length sequence fragments and subsampled genomes
Sensitivity (percentage of fragments from each genome correctly identified) and precision (percentage of fragments in each bin belonging to the correct genome) of binning were calculated for a subset of assembled genomes that are deeply sampled and manually curated (Table 1; Additional data file 2). Fragment size was varied in two ways: all contigs were broken into a given size (2, 4, 6, or 10 kb); or 10% of each genome was randomly selected and fragmented (0.5, 1.0, 1.5, or 2.0 kb) while the remaining fraction of the genome was fragmented into 5-kb windows (Additional data file 2). Bin territories were defined manually, using boundaries apparent via distance-based background topology (U-Matrix) as guidelines. It is important to note this method allows data points between bins or near borders to remain unclassified. Analysis of subsampled genomes was conducted with assembled genomes only - unassigned fragments were excluded to prevent them from contributing to definition of bins. Genomes were fragmented into 5-kb sequences, which were then randomly selected to obtain the indicated percentage of the genome.
Sequence signatures in coding versus non-coding regions
Intergenic regions were extracted and concatenated, with 'N's inserted between regions to avoid generation of erroneous tetranucleotides. Intergenic regions were grouped by size (in 20-bp bins) to monitor variance in sequence signatures from intergenic regions of differing lengths. All coding sequences were similarly concatenated with interleaving 'N's. Concatenated coding and non-coding regions were then broken into 5-kb windows and run against the same background dataset of assembled genomes and unassigned sequences as usual.
Sequence signatures in extracellular and highly expressed protein-coding genes
Shotgun proteomics data were obtained for Leptospirillum group II extracellular and whole cell fractions from the ABend, ABfront, and UBA locations of the Richmond mine [55, 66]. Proteins were defined as enriched in the extracellular fraction if, in at least two of the three samples, they were only detected in the extracellular fraction, or the ratio of spectral counts from extracellular to intracellular fraction was > 2. The 50 most abundantly expressed proteins were identified on the basis of tandem mass spectrometry (MS/MS) spectral counts. ESOM analysis of genes encoding extracellular and highly expressed proteins were both conducted as described above; open reading frames were concatenated, interleaved with 'N's, then split into 5-kb windows and analyzed along with the full dataset.
Nucleotide sequence accession numbers
This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accessions ACXJ00000000 (unassigned contigs), ACXK00000000 (A-plasma), ACXL00000000 (E-plasma), ACXM00000000, (I-plasma), and ACVJ00000000 (ARMAN-2, described in detail in BJ Baker et al., in preparation). The versions described in this paper are the first versions, ACXJ01000000, ACXK01000000, ACXL01000000, ACXM01000000, and ACVJ01000000.
Additional data files
The following additional data are available with the online version of this paper: a figure showing automated clustering of tetra-ESOM data using fixed point kernel densities (Additional data file 1); an evaluation of binning accuracy based on deeply sampled metagenomes for which contigs are assigned to genomes with a high degree of confidence (Additional data file 2); binning accuracy calculated for genomes that were sampled to varying extents of completeness (10 to 100%) (Additional data file 3); a heat map of average genome-wide frequency of each tetranucleotide for each genome, including bacteria, archaea, viruses, and a putative plasmid (Additional data file 4); comparison of tetra-ESOMs of assembled genomes based on amino acid composition, codon composition, and tetranucleotide frequency (Additional data file 5); a figure showing that the observed difference in frequency of each tetranucleotide between pairs of genomes correlates with the difference predicted based on codon composition (Additional data file 6); a figure showing tetra-ESOM of deeply sampled genomes for which coding and noncoding regions were separated (Additional data file 7); a figure showing for incorrectly binned fragments the percentage of sequence coding for genes in comparison with the genome-wide coding percentage (Additional data file 8); a figure showing tetra-ESOM of Leptospirillum group II genes coding for highly expressed proteins or proteins enriched in the extracellular fraction analyzed as separate fractions from the rest of the genome (Additional data file 9); a schematic of processes and factors influencing genome signature (Additional data file 10).
acid mine drainage
emergent self-organizing map
percentage content of guanine plus cytosine
We thank Ms D Aliaga Goltsman, Dr V Denef, Ms C Sun, Dr R Hettich, Dr N VerBerkmoes, and Mr M Shah for data and bioinformatic assistance. We are grateful to Mrs M Kelly for guiding kernel density analysis, to Mr Rudy Carver for sampling assistance and Mr TW Arman, President, Iron Mountain Mines, and Dr R Sugarek for site access. The manuscript was significantly improved thanks to critical revisions from Mr D Soergel and Dr S Brenner and three anonymous reviewers. This work was supported by DOE Genomics:GTL project Grant No. DE-FG02-05ER64134 (Office of Science) and sequencing was done at the DOE Joint Genome Institute. AFA was supported by grants from the Swedish Research Council and Carl Tryggers Foundation.
- Konstantinidis KT, Tiedje JM: Towards a genome-based taxonomy for prokaryotes. J Bacteriol. 2005, 187: 6258-6264. 10.1128/JB.187.18.6258-6264.2005.PubMedPubMed CentralView ArticleGoogle Scholar
- Konstantinidis KT, Tiedje JM: Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA. 2005, 102: 2567-2572. 10.1073/pnas.0409727102.PubMedPubMed CentralView ArticleGoogle Scholar
- Achtman M, Wagner M: Microbial diversity and the genetic nature of microbial species. Nat Rev Microbiol. 2008, 6: 431-440. 10.1038/nrmicro1872.PubMedGoogle Scholar
- Doolittle WF: Phylogenetic classification and the universal tree. Science. 1999, 284: 2124-2128. 10.1126/science.284.5423.2124.PubMedView ArticleGoogle Scholar
- Allen EE, Banfield JF: Community genomics in microbial ecology and evolution. Nat Rev Microbiol. 2005, 3: 489-498. 10.1038/nrmicro1157.PubMedView ArticleGoogle Scholar
- DeLong EF: Microbial community genomics in the ocean. Nat Rev Microbiol. 2005, 3: 459-469. 10.1038/nrmicro1158.PubMedView ArticleGoogle Scholar
- Handelsman J: Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004, 68: 669-685. 10.1128/MMBR.68.4.669-685.2004.PubMedPubMed CentralView ArticleGoogle Scholar
- Allen EE, Tyson GW, Whitaker RJ, Detter JC, Richardson PM, Banfield JF: Genome dynamics in a natural archaeal population. Proc Natl Acad Sci USA. 2007, 104: 1883-1888. 10.1073/pnas.0604851104.PubMedPubMed CentralView ArticleGoogle Scholar
- Simmons SL, DiBartolo G, Denef VJ, Goltsman DSA, Thelen MP, Banfield JF: Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. PLoS Biology. 2008, 6: 10.1371/journal.pbio.0060177.Google Scholar
- Eppley JM, Tyson GW, Getz WM, Banfield JF: Genetic exchange across a species boundary in the archaeal genus Ferroplasma. Genetics. 2007, 177: 407-416. 10.1534/genetics.107.072892.PubMedPubMed CentralView ArticleGoogle Scholar
- Konstantinidis KT, Delong EF: Genomic patterns of recombination, clonal divergence, and environment in marine microbial populations. ISME J. 2008, 2: 1052-1065. 10.1038/ismej.2008.62.PubMedView ArticleGoogle Scholar
- Andersson AF, Banfield JF: Virus population dynamics and acquired virus resistance in natural microbial communities. Science. 2008, 320: 1047-1050. 10.1126/science.1157358.PubMedView ArticleGoogle Scholar
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al: The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biology. 2007, 5: e77-10.1371/journal.pbio.0050077.PubMedPubMed CentralView ArticleGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y, Smith HO: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science. 2004, 304: 66-74. 10.1126/science.1093857.PubMedView ArticleGoogle Scholar
- Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al: The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biology. 2007, 5: e16-10.1371/journal.pbio.0050016.PubMedPubMed CentralView ArticleGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428: 37-43. 10.1038/nature02340.PubMedView ArticleGoogle Scholar
- Tyson GW, Lo I, Baker BJ, Allen EE, Hugenholtz P, Banfield JF: Genome-directed isolation of the key nitrogen fixer Leptospirillum ferrodiazotrophum sp. nov. from an acidophilic microbial community. Appl Environ Microbiol. 2005, 71: 6319-6324. 10.1128/AEM.71.10.6319-6324.2005.PubMedPubMed CentralView ArticleGoogle Scholar
- Schleper C, Jurgens G, Jonuscheit M: Genomic studies of uncultivated archaea. Nat Rev Microbiol. 2005, 3: 479-488. 10.1038/nrmicro1159.PubMedView ArticleGoogle Scholar
- Béjà O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA: Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science. 2000, 289: 1902-1906. 10.1126/science.289.5486.1902.PubMedView ArticleGoogle Scholar
- Baker BJ, Tyson GW, Webb RI, Flanagan J, Hugenholtz P, Allen EE, Banfield JF: Lineages of acidophilic archaea revealed by community genomic analysis. Science. 2006, 314: 1933-1935. 10.1126/science.1132690.PubMedView ArticleGoogle Scholar
- DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U, Martinez A, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM: Community genomics among stratified microbial assemblages in the ocean's interior. Science. 2006, 311: 496-503. 10.1126/science.1120250.PubMedView ArticleGoogle Scholar
- Karlin S, Mrázek J, Campbell AM: Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997, 179: 3899-3913.PubMedPubMed CentralGoogle Scholar
- Sueoka N: Directional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci USA. 1988, 85: 2653-2657. 10.1073/pnas.85.8.2653.PubMedPubMed CentralView ArticleGoogle Scholar
- Lynch M: The origins of genome architecture. 2007, Sunderland, MA: Sinauer Associates, Inc.Google Scholar
- Rocha EPC: Base composition might result from competition for metabolic resources. Trends Genet. 2002, 18: 291-294. 10.1016/S0168-9525(02)02690-2.PubMedView ArticleGoogle Scholar
- Foerstner KU, von Mering C, Hooper SD, Bork P: Environments shape the nucleotide composition of genomes. EMBO Rep. 2005, 6: 1208-1213. 10.1038/sj.embor.7400538.PubMedPubMed CentralView ArticleGoogle Scholar
- Bentley SD, Parkhill J: Comparative genomic structure of prokaryotes. Annu Rev Genet. 2004, 38: 771-792. 10.1146/annurev.genet.38.072902.094318.PubMedView ArticleGoogle Scholar
- Lawrence JG, Ochman H: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci USA. 1998, 95: 9413-9417. 10.1073/pnas.95.16.9413.PubMedPubMed CentralView ArticleGoogle Scholar
- Koski LB, Morton RA, Golding GB: Codon bias and base composition are poor indicators of horizontally transferred genes. Mol Biol Evol. 2001, 18: 404-412.PubMedView ArticleGoogle Scholar
- Sandberg R, Bränden C, Ernberg I, Cöster J: Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G + C content. Gene. 2003, 311: 35-42. 10.1016/S0378-1119(03)00581-X.PubMedView ArticleGoogle Scholar
- Nakashima H, Ota M, Nishikawa K, Ooi T: Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res. 1998, 5: 251-259. 10.1093/dnares/5.5.251.PubMedView ArticleGoogle Scholar
- Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003, 13: 145-158. 10.1101/gr.335003.PubMedPubMed CentralView ArticleGoogle Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res. 2003, 13: 693-702. 10.1101/gr.634603.PubMedPubMed CentralView ArticleGoogle Scholar
- Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999, 16: 1391-1399.PubMedView ArticleGoogle Scholar
- Bohlin J, Skjerve E, Ussery DW: Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS Comput Biol. 2008, 4: e1000057-10.1371/journal.pcbi.1000057.PubMedPubMed CentralView ArticleGoogle Scholar
- Dalevi D, Dubhashi D, Hermansson M: Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics. 2006, 22: 517-522. 10.1093/bioinformatics/btk029.PubMedView ArticleGoogle Scholar
- Sandberg R, Winberg G, Bränden C, Kaske A, Ernberg I, Cöster J: Capturing whole-genome characteristics in short sequences using a naive bayesian classifier. Genome Res. 2001, 11: 1404-1409. 10.1101/gr.186401.PubMedPubMed CentralView ArticleGoogle Scholar
- Scherer S, McPeek MS, Speed TP: Atypical regions in large genomic DNA sequences. Proc Natl Acad Sci USA. 1994, 91: 7134-7138. 10.1073/pnas.91.15.7134.PubMedPubMed CentralView ArticleGoogle Scholar
- Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005, 33: e6-10.1093/nar/gni004.PubMedPubMed CentralView ArticleGoogle Scholar
- van Passel MWJ, Kuramae EE, Luyf ACM, Bart A, Boekhout T: The reach of the genome signature in prokaryotes. BMC Evol Biol. 2006, 6: 84-10.1186/1471-2148-6-84.PubMedPubMed CentralView ArticleGoogle Scholar
- Blaisdell BE, Campbell AM, Karlin S: Similarities and dissimilarities of phage genomes. Proc Natl Acad Sci USA. 1996, 93: 5854-5859. 10.1073/pnas.93.12.5854.PubMedPubMed CentralView ArticleGoogle Scholar
- Mrázek J, Karlin S: Distinctive features of large complex virus genomes and proteomes. Proc Natl Acad Sci USA. 2007, 104: 5127-5132. 10.1073/pnas.0700429104.PubMedPubMed CentralView ArticleGoogle Scholar
- McHardy AC, Rigoutsos I: What's in the mix: phylogenetic classification of metagenome sequence samples. Curr Opin Microbiol. 2007, 10: 499-503. 10.1016/j.mib.2007.08.004.PubMedView ArticleGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007, 4: 495-500. 10.1038/nmeth1043.PubMedView ArticleGoogle Scholar
- Pignatelli M, Aparicio G, Blanquer I, Hernández V, Moya A, Tamames J: Metagenomics reveals our incomplete knowledge of global diversity. Bioinformatics. 2008, 24: 2124-2125. 10.1093/bioinformatics/btn355.PubMedPubMed CentralView ArticleGoogle Scholar
- Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 2005, 12: 281-290. 10.1093/dnares/dsi015.PubMedView ArticleGoogle Scholar
- Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol. 2004, 6: 938-947. 10.1111/j.1462-2920.2004.00624.x.PubMedView ArticleGoogle Scholar
- McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigooutsos I: Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods. 2007, 4: 63-72. 10.1038/nmeth976.PubMedView ArticleGoogle Scholar
- Willenbrock H, Friis C, Juncker AS, Ussery DW: An environmental signature for 323 microbial genomes based on codon adaptation indices. Genome Biol. 2006, 7: R114-10.1186/gb-2006-7-12-r114.PubMedPubMed CentralView ArticleGoogle Scholar
- Raes J, Foerstner KU, Bork P: Get the most out of your metagenome: computational analysis of environmental sequence data. Curr Opin Microbiol. 2007, 10: 490-498. 10.1016/j.mib.2007.09.001.PubMedView ArticleGoogle Scholar
- Paul S, Bag SK, Das S, Harvill ET, Dutta C: Molecular signature of hypersaline adaptation: insights from genome and proteome composition of halophilic prokaryotes. Genome Biol. 2008, 9: R70-10.1186/gb-2008-9-4-r70.PubMedPubMed CentralView ArticleGoogle Scholar
- Wilmes P, Andersson AF, Lefsrud MG, Wexler M, Shah M, Zhang B, Hettich RL, Bond PL, VerBerkmoes NC, Banfield JF: Community proteogenomics highlights microbial strain-variant protein expression within activated sludge performing enhanced biological phosphorus removal. ISME J. 2008, 2: 853-864. 10.1038/ismej.2008.38.PubMedView ArticleGoogle Scholar
- Druschel GK, Baker BJ, Gihring TM, Banfield JF: Acid mine drainage biogeochemistry at Iron Mountain, California. Geochem Trans. 2004, 5: 13-32. 10.1186/1467-4866-5-13.PubMed CentralView ArticleGoogle Scholar
- Baker BJ, Banfield JF: Microbial communities in acid mine drainage. FEMS Microbiol Ecol. 2003, 44: 139-152. 10.1016/S0168-6496(03)00028-X.PubMedView ArticleGoogle Scholar
- Lo I, Denef VJ, Verberkmoes NC, Shah MB, Goltsman D, DiBartolo G, Tyson GW, Allen EE, Ram RJ, Detter JC, Richardson P, Thelen MP, Hettich RL, Banfield JF: Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature. 2007, 446: 537-541. 10.1038/nature05624.PubMedView ArticleGoogle Scholar
- Goltsman DS, Denef VJ, Singer SW, VerBerkmoes NC, Lefsrud M, Mueller RS, Dick GJ, Sun CL, Wheeler KE, Zemla A, Baker BJ, Hauser L, Land M, Shah MB, Thelen MP, Hettich RL, Banfield JF: Community genomic and proteomic analyses of chemoautotrophic iron-oxidizing "Leptospirillum rubarum" (Group II) and "Leptospirillum ferrodiazotrophum" (Group III) bacteria in acid mine drainage biofilms. Appl Environ Microbiol. 2009, 75: 4599-4615. 10.1128/AEM.02943-08.PubMedPubMed CentralView ArticleGoogle Scholar
- Edwards KJ, Bond PL, Gihring TM, Banfield JF: An archaeal iron-oxidizing extreme acidophile important in acid mine drainage. Science. 2000, 287: 1796-1799. 10.1126/science.287.5459.1796.PubMedView ArticleGoogle Scholar
- Kohonen T: Self-organizing maps. 1997, New York: Springer-Verlag, 0:View ArticleGoogle Scholar
- Chan CK, Hsu AL, Halgamuge SK, Tang SL: Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics. 2008, 9: 215-10.1186/1471-2105-9-215.PubMedPubMed CentralView ArticleGoogle Scholar
- Ultsch A, Moerchen F: ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. Technical Report Department of Mathematics and Computer Science, University of Marburg, Germany. 2005, 46:Google Scholar
- Makarova KS, Grishin NV, Shabalina SA, Wolf YI, Koonin EV: A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct. 2006, 1: 7-10.1186/1745-6150-1-7.PubMedPubMed CentralView ArticleGoogle Scholar
- Reva ON, Tummler B: Differentiation of regions with atypical oligonucleotide composition in bacterial genomes. BMC Bioinformatics. 2005, 6: 251-10.1186/1471-2105-6-251.PubMedPubMed CentralView ArticleGoogle Scholar
- Pride DT, Wassenaar TM, Ghose C, Blaser MJ: Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC genomics. 2006, 7: 8-10.1186/1471-2164-7-8.PubMedPubMed CentralView ArticleGoogle Scholar
- Coleman JR, Papamichail D, Skiena S, Futcher B, Wimmer E, Mueller S: Virus attenuation by genome-scale changes in codon pair bias. Science. 2008, 320: 1784-1787. 10.1126/science.1155761.PubMedPubMed CentralView ArticleGoogle Scholar
- Macalady JL, Vestling MM, Baumler D, Boekelheide N, Kaspar CW, Banfield JF: Tetraether-linked membrane monolayers in Ferroplasma spp.: a key to survival in acid. Extremophiles. 2004, 8: 411-419. 10.1007/s00792-004-0404-5.PubMedView ArticleGoogle Scholar
- Ram RJ, VerBerkmoes C, Thelen MP, Tyson GW, Baker BJ, Blake RC, Shah M, Hettich RL, Banfield JF: Community Proteomics of a Natural Microbial Biofilm. Science. 2005, 308: 1915-1920. 10.1126/science. 1109070.PubMedView ArticleGoogle Scholar
- Chen SL, Lee W, Hottes AK, Shapiro L, McAdams HH: Codon usage between genomes is constrained by genome-wide mutational processes. Proc Natl Acad Sci USA. 2004, 101: 3480-3485. 10.1073/pnas.0307827100.PubMedPubMed CentralView ArticleGoogle Scholar
- Knight RD, Freeland SJ, Landweber LF: A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2001, 2:Google Scholar
- Rocha EP: Codon usage bias from tRNA's point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Res. 2004, 14: 2279-2286. 10.1101/gr.2896904.PubMedPubMed CentralView ArticleGoogle Scholar
- Ikemura T: Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J Mol Biol. 1981, 146: 1-21. 10.1016/0022-2836(81)90363-6.PubMedView ArticleGoogle Scholar
- Bailly-Bechet M, Vergassola M, Rocha E: Causes for the intriguing presence of tRNAs in phages. Genome Res. 2007, 17: 1486-1495. 10.1101/gr.6649807.PubMedPubMed CentralView ArticleGoogle Scholar
- Dong H, Nilsson L, Kurland CG: Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J Mol Biol. 1996, 260: 649-663. 10.1006/jmbi.1996.0428.PubMedView ArticleGoogle Scholar
- Bulmer M: The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991, 129: 897-907.PubMedPubMed CentralGoogle Scholar
- Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE: Variation in strength of selected codon usage bias among bacteria. Nucleic Acids Res. 2005, 33: 1141-1153. 10.1093/nar/gki242.PubMedPubMed CentralView ArticleGoogle Scholar
- Dethlefsen L, Schmidt TM: Performance of the translational apparatus varies with the ecological strategies of bacteria. J Bacteriol. 2007, 189: 3237-3245. 10.1128/JB.01686-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Klappenbach JA, Dunbar JM, Schmidt TM: rRNA operon copy number reflects ecological strategies of bacteria. Appl Environ Microbiol. 2000, 66: 1328-1333. 10.1128/AEM.66.4.1328-1333.2000.PubMedPubMed CentralView ArticleGoogle Scholar
- Kobayashi I: Behavior of restriction-modification systems as selfish mobile elements and their impact on genome evolution. Nucleic Acids Res. 2001, 29: 3742-3756. 10.1093/nar/29.18.3742.PubMedPubMed CentralView ArticleGoogle Scholar
- Pernthanler A, Dekas AE, Brown CT, Goffredi SK, Embaye T, Orphan VJ: Diverse syntrophic partnerships from deep-sea methane vents revealed by direct cell capture and metagenomics. Proc Natl Acad Sci USA. 2008, 105: 7052-7057. 10.1073/pnas.0711303105.View ArticleGoogle Scholar
- Podar M, Abulencia CB, Walcher M, Hutchison D, Zengler K, Garcia JA, Holland T, Cotton D, Hauser L, Keller M: Targeted access to the genomes of low-abundance organisms in complex microbial communities. Appl Environ Microbiol. 2007, 73: 3205-3214. 10.1128/AEM.02985-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Stepanauskas R, Sieracki ME: Matching phylogeny and metabolism in the uncultured marine bacteria, one cell at a time. Proc Natl Acad Sci USA. 2007, 104: 9052-9057. 10.1073/pnas.0700496104.PubMedPubMed CentralView ArticleGoogle Scholar
- Hallam SJ, Konstantinidis KT, Putnam N, Schleper C, Watanabe Y, Sugahara J, Preston C, Torre J, Richardson PM, DeLong EF: Genomic analysis of the uncultivated marine crenarchaeote Cenarchaem symbiosum. Proc Natl Acad Sci USA. 2006, 103: 18296-18301. 10.1073/pnas.0608549103.PubMedPubMed CentralView ArticleGoogle Scholar
- Pelletier E, Kreimeyer A, Bocs S, Rouy Z, Gyapay G, Chouari R, Rivière D, Ganesan A, Daegelen P, Sghir A, Cohen GN, Médigue C, Wessenbach J, Paslier DL: "Candidatus Cloacamonas Acidaminovorans": Genome sequence reconstruction provides a first glimpse of a new bacterial division. J Bacteriol. 2008, 190: 2572-2579. 10.1128/JB.01248-07.PubMedPubMed CentralView ArticleGoogle Scholar
- Garcia Martin H, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, Dalin E, Putnam NH, Shapiro HJ, Pangilinan JL, Rigoutsos I, Kyrpides NC, Blackall LL, McMahon KD, Hugenholtz P: Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol. 2006, 24: 1263-1269. 10.1038/nbt1247.PubMedView ArticleGoogle Scholar
- Strous M, Pelletier E, Mangenot S, Rattei T, Lehner A, Taylor MW, Horn M, Daims H, Bartol-Mavel D, Wincker P, Barbe V, Fonknechten N, Vallenet D, Segurens B, Schenowitz-Truong C, Medigue C, Collingro A, Snel B, Dutilh BE, Op den Camp HJM, Drift van der C, Cirpus I, Pas-Schoonen van de KT, Harhangi HR, van Niftrik L, Schmid M, Keltjens J, Vossenberg van de J, Kartal B, Meier H, et al: Deciphering the evolution and metabolism of an annamox bacterium from a community genome. Nature. 2006, 440: 790-794. 10.1038/nature04647.PubMedView ArticleGoogle Scholar
- Pham VD, Konstantinidis KT, Palden T, DeLong EF: Phylogenetic analyses of ribosomal DNA-containing bacterioplankton genome fragments from a 4000 m vertical profile in the North Pacific Subtropical Gyre. Environ Microbiol. 2008, 10: 2313-2330. 10.1111/j.1462-2920.2008.01657.x.PubMedView ArticleGoogle Scholar
- Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, Delong EF, Chisholm SW: Genomic islands and the ecology and evolution of Prochlorococcus. Science. 2006, 311: 1768-1770. 10.1126/science.1122050.PubMedView ArticleGoogle Scholar
- Cuadros-Orellana S, Martin-Cuadrado AB, Legault B, D'Auria G, Zhaxybayeva O, Papke RT, Rodriguez-Valera F: Genomic plasticity in prokaryotes: the case of the square haloarchaeon. ISME J. 2007, 1: 235-245. 10.1038/ismej.2007.35.PubMedView ArticleGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev. 2005, 15: 589-594. 10.1016/j.gde.2005.09.006.PubMedView ArticleGoogle Scholar
- Lawrence JG, Ochman H: Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol. 1997, 44: 383-397. 10.1007/PL00006158.PubMedView ArticleGoogle Scholar
- Karlin S: Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 2001, 9: 335-343. 10.1016/S0966-842X(01)02079-0.PubMedView ArticleGoogle Scholar
- Martin C, Diaz NN, Ontrup J, Nattkemper TW: Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification. Bioinformatics. 2008, 24: 1568-1574. 10.1093/bioinformatics/btn257.PubMedView ArticleGoogle Scholar
- Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar , Buchner A, Lai T, Steppi S, Jobb G, Forster W, Brettske I, Gerber S, Ginhart AW, Gross O, Grumann S, Hermann S, Jost R, Konig A, Liss T, Lussmann R, May M, Nonhoff B, Reichel B, Strehlow R, Stamatakis A, Stuckmann N, Vilbig A, Lenke M, Ludwig T, et al: ARB: a software environment for sequence data. Nucleic Acids Res. 2004, 32: 1363-1371. 10.1093/nar/gkh293.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.PubMedPubMed CentralView ArticleGoogle Scholar
- Databionic ESOM Tools. [http://databionic-esom.sourceforge.net]
- Silverman BW: Density estimation for statistics and data analysis. 1986, London: CRC PressView ArticleGoogle Scholar
- Hawth's Analysis Tools. [http://www.spatialecology.com/htools/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.