Single-cell genomics reveal low recombination frequencies in freshwater bacteria of the SAR11 clade
© Zaremba-Niedzwiedzka et al.; licensee BioMed Central Ltd. 2013
Received: 29 September 2013
Accepted: 28 November 2013
Published: 28 November 2013
The SAR11 group of Alphaproteobacteria is highly abundant in the oceans. It contains a recently diverged freshwater clade, which offers the opportunity to compare adaptations to salt- and freshwaters in a monophyletic bacterial group. However, there are no cultivated members of the freshwater SAR11 group and no genomes have been sequenced yet.
We isolated ten single SAR11 cells from three freshwater lakes and sequenced and assembled their genomes. A phylogeny based on 57 proteins indicates that the cells are organized into distinct microclusters. We show that the freshwater genomes have evolved primarily by the accumulation of nucleotide substitutions and that they have among the lowest ratio of recombination to mutation estimated for bacteria. In contrast, members of the marine SAR11 clade have one of the highest ratios. Additional metagenome reads from six lakes confirm low recombination frequencies for the genome overall and reveal lake-specific variations in microcluster abundances. We identify hypervariable regions with gene contents broadly similar to those in the hypervariable regions of the marine isolates, containing genes putatively coding for cell surface molecules.
We conclude that recombination rates differ dramatically in phylogenetic sister groups of the SAR11 clade adapted to freshwater and marine ecosystems. The results suggest that the transition from marine to freshwater systems has purged diversity and resulted in reduced opportunities for recombination with divergent members of the clade. The low recombination frequencies of the LD12 clade resemble the low genetic divergence of host-restricted pathogens that have recently shifted to a new host.
Microbial genomes change in gene content by duplications, deletions and horizontal gene transfers, and in sequence by nucleotide substitutions and homologous recombination. The relative contribution of recombination to sequence divergence has been determined in a wide range of microorganisms . Much of the results have been inferred from the analyses of rRNA genes and protein-coding genes used for multilocus sequence typing. Comparisons of such data between species have shown that the ratio at which a nucleotide becomes substituted as a result of recombination versus mutations (r/m) ranges from very low (<1) to extremely high (>10) . Because of these differences, some bacterial populations can be sorted into discrete subclusters, whereas others are best described as a continuum of sequence variants.
Bacteria with a clonal population structure, low levels of genetic diversity and low r/m ratios tend to be host-adapted. They have often experienced a population bottleneck associated with the adaptation to a single host and are referred to as ‘genetically monomorphic species’ . On the other extreme are environmental bacteria, such as the SAR11 group of Alphaproteobacteria, which can make up to 30% of the total marine bacterioplankton in the upper surface waters of the oceans [3, 4]. The r/m ratio for SAR11 isolates has been estimated to 63 [1, 5], which is one of the highest r/m ratios recorded for bacteria.
It has been hypothesized that a high recombination frequency in the marine SAR11 bacteria is the result of selection to ensure variability in the phage receptor protein sequences [1, 5]. Indeed, highly abundant SAR11-specific pelagiphages were recently identified in the oceans  and the presence of hypervariable regions (HVRs) in the marine SAR11 genomes that code for the biogenesis of outer membrane components  provides indirect support for co-evolutionary interactions with pelagiphages. The SAR11 bacteria are characterized by having very small genome sizes, in the 1.4 to 1.6 Mb range, and extremely small cell volumes, which is suggested to result from selection to enhance the surface to volume ratio [7–9].
The SAR11 group of aquatic bacteria contains a freshwater clade that has diverged relatively recently from the marine groups . The freshwater clade of the SAR11 group [10–12] was first identified in an Arctic Toolik lake, and is referred to as LD12 [13, 14]. Based on ribosomal RNA phylogenies, LD12 is currently classified as subtype IIIb of the SAR11 clade . Isotope trace studies have provided some information on organic substrate use  and single cell surveys indicate that LD12 constitutes 1% to 21% of freshwater bacterioplankton . The transition from marine to freshwater ecosystems was a unique event, which is thought to have happened only once in the evolutionary history of the SAR11 clade [12, 16]. Because of their close relatedness, freshwater and saltwater SAR11 lineages enable comparative studies of the population dynamics of bacteria that have shifted their ecological niche from the open oceans to physically constrained lake ecosystems. However, such studies have been hampered because there are no cultivation methods available for LD12, and thus, no genomes have yet been sequenced.
Single cell genomics is emerging as a new method to study the genomes of uncultivated microorganisms [17, 18]. Here, we report a comparative analysis of ten single-cell genomes from LD12 along with recruited metagenome reads from six lakes. The results indicate that the transition to freshwater has been associated with dramatic changes in the population dynamics of the SAR11 group of bacteria.
Results and discussion
Single cell genome sequencing of LD12
Details of the assembly and annotations
The LD12 phylogeny reveals distinct microclusters
Within subclade IIIb, three distinct micro-clusters were indicated with strong bootstrap support values, here referred to as the A, B and C groups (Figure 1). The A group contains the N17, L09 and D10 SAGs obtained from lakes Sparkling and Mendota, the B group the C06, J10 and L15 SAGs from lakes Mendota and Damariscotta, and the C group the P20 and M09 SAGs from lake Damariscotta. The SAG designated C07 is a sister taxa to the A group SAGs with 100% bootstrap support, but was not classified as an A group strain since the node suggesting this placement was very short. Finally, the analyses placed the B11 SAG from lake Damariscotta as the earliest diverging SAG of those examined here.
Functional categories of protein clusters inferred to have been gained and lost in the LD12 clade
Information storage and processing
RNA processing and modification
Translation, ribosomal structure and biogenesis
Chromatin structure and dynamics
Replication, recombination and repair
Coenzyme transport and metabolism
Nucleotide transport and metabolism
Secondary metabolites biosynthesis, transport and catabolism
Energy production and conversion
Inorganic ion transport and metabolism
Lipid transport and metabolism
Carbohydrate transport and metabolism
Amino acid transport and metabolism
Cellular processes and signaling
Posttranslational modification, protein turnover, chaperones
Signal transduction mechanisms
Cell wall/membrane/envelope biogenesis
Cell cycle control, cell division, chromosome partitioning
Intracellular trafficking, secretion, and vesicular transport
General function prediction only
Unmapped to the COG database
Mapped to the COG database
Low recombination rates in freshwater bacteria of the SAR11 clade
To quantify the influence of recombination to the divergences, we estimated the ratio at which recombination to mutations (r/m) contributes to the substitutions in the SAGs using ClonalFrame. For this analysis, we extracted sequences present in all SAGs except N17. After removal of segments with gaps using Gblocks, the alignment contained 31 genes in 25 kb of well aligned sequences (Table S2 in Additional file 1). Using this alignment, the r/m ratio was estimated to 0.14, which is at the lower end of r/m ratio estimates for bacteria. A re-analysis of 9 marine genomes using the same settings yielded a ratio of 61, consistent with the previously published estimate of 63 [1, 5]. The higher recombination rate estimate for the marine strains was not simply an effect of a higher divergence in the sequences used for the tests since the mean dS value of the freshwater dataset was several-fold higher than for the marine dataset.
Recruitment of lake metagenome sequences reveals the abundance profiles of the microclusters
List of metagenome data sets, sizes and fractions of LD12
Finally, we inferred the population recombination (rho) and mutation (theta) rates separately for the total set of reads recruited from each lake to the patchwork genome using a composite likelihood method specifically developed for assembled metagenome sequences . The analyses indicated a low, albeit measurable ratio of about 0.1 for the frequency of recombination relative to mutation events for each of the lakes. Such low genome-wide estimates for the LD12 lineage overall based on the metagenome data are consistent with the low r/m ratio estimated from the LD12 genome data.
Possible artifacts and biases when using SAGs
SAGs are useful to identify novel genes and lineages, but due to their partial natural and uneven coverage, we were worried that artifacts could be introduced that would make them unsuitable for population genomic analyses.
One concern was that the frequency of erroneously called base pairs in SAGs is much higher than in conventionally sequenced genomes, which could compromise mutation and recombination rate estimates from closely related pairs of SAGs. Unfortunately, since the LD12 cells have not been cultivated, no reference genomes are available that could be used to estimate the error rate. Instead, we turned to a benchmarking data set consisting of seven SAGs from Escherichia coli and eight SAGs from each of Meiothermus and Pedobacter, for which reference genomes sequenced by conventional whole-genome sequencing techniques were available. All SAGs were sequenced with the Illumina sequencing technology, and the LD12 SAGs were sequenced to a greater depth than the benchmark SAGs, which suggest that the datasets are comparable.
We aligned the assembled contigs of the benchmark SAGs to their corresponding reference genomes and estimated the error frequency to 5 × 10-6 SNPs per base pair on average (Figure S4 in Additional file 1). For comparison, the median frequency of polymorphisms for SAGs of the same microcluster types were in the range of 13 and 16 SNPs per kb, as inferred by aligning each SAG sequence to the corresponding consensus microcluster sequence for each of the 42 genes used to estimate the synonymous substitution frequencies (Figure S5 in Additional file 1) as well as for the 25 kb fragments used to estimate the r/m ratio (Figure S5 in Additional file 1). Thus, the overall sequence divergence for the most closely related pairs of SAGs was estimated to 1.5 × 10-2 SNPs per base pair on average, which is more than 10,000-fold higher than the error rate. The extremely low error rate of the benchmark data suggests that the mutation and recombination rate estimates of the SAR11 SAGs have not been inflated by sequencing errors.
Another concern was that missing data due to the uneven coverage might cause problems in the phylogenies. For example, 57 of the 58 proteins used to infer the phylogenetic relationships of the SAGs were identified in 6 SAGs on average, and in 9 SAGs at most (Table S3 in Additional file 1). This means that each protein alignment will have data missing from one or more SAGs. The missing sequences were managed by introducing gaps in the concatenated protein sequence alignment. For the single protein phylogenies we specifically selected genes for which sequence data were available for two or more SAGs from the A and B groups and from one or more SAGs of the C group. This procedure made it possible to compare the single protein topologies with the concatenated protein topology. The overall congruence in topologies indicated that missing data for one or more SAGs have not affected the diversification patterns. Moreover, our estimates of the relative frequency of recombination to mutation were based on sequence data present in all 10 SAGs, with no missing data. Thus, neither the tree topologies nor the mutation and recombination rate estimates have been affected by the uneven sequence coverage.
More seriously, missing data could compromise the gene flux analysis since missing sequences could incorrectly be interpreted as absent genes. To minimize this problem we used the total set of genes identified in all 10 SAGs to represent the LD12 core genome, excluding open reading frames present in single SAGs with no hits to any other genome. If we assume that the amplification bias is random, we expect most of the core genes to have been sampled at least once in these SAGs. But even so, we cannot exclude that a few core genes were not sampled due to biased amplification.
Thus, in the gene flux analysis, the terminal taxa called LD12 contains proteins identified in all 10 SAGs. This is different from the gene content of the other terminal taxa, in which the gene count represents all genes in one complete genome. Hence, the actual number of gains and losses on the branch to LD12 in the gene flux analyses are not directly comparable to the numbers on the other branches, nor are they proportional to the branch length to the LD12 clade. Moreover, the relative fraction of gained and lost genes depends on the penalties for gains and losses. To test how the penalties affected the gene flux analysis, we increased the penalty for gain in a stepwise manner from 2 to 5, upon which the number of gained genes gradually decreased from 238 to 134 clusters (Table S4 in Additional file 1). This suggests that 134 protein clusters have almost certainly been gained, while another 104 clusters have tentatively been gained once in the freshwater group or lost twice or more in the marine clades. Irrespective of this, it can be concluded that these clusters are variably present in the bacteria of the SAR11 clade, as also indicated by the observation that several of these genes were located in regions that were homologous to a hypervariable region in the marine SAR11 genomes.
Comparison of LD12 with human pathogen specialists
There is an interesting parallel between LD12 and human pathogen specialists in that both have low recombination frequencies and are nested within a group of much larger diversity. For example, Salmonella enterica has a broad host range of animals and a very high r/m ratio of 30.2 , that is, close to the r/m estimate for the marine SAR11 bacteria. In contrast, the human-restricted pathogen S. enterica serovar Typhi shows no evidence of recombination between isolates of Typhi , resembling the clonal population structure of the freshwater LD12 clade. Likewise, human pathogens such as Yersinia pestis, Bacillus anthracis, Burkholderia mallei, Mycobacterium tuberculosis and Bartonella quintana show extremely low levels of sequence diversity and are nested within genetically more diverse groups of bacteria with broader host ranges.
Most of the previously studied monomorphic lineages are pathogens, indicating a link between disease and transmission . As pointed out, however, the association between low genetic diversity and disease could reflect a discovery bias . Indeed, the identification of similar population dynamics in the SAR11 clade suggests that it is the bottleneck induced by the shift to a restricted growth niche, rather than the association with disease that purges the diversity in these populations irrespective of whether the adaptation is to a novel host or a new environment.
As these lineages adapt to their new niche, recombination with divergent members in the former environment is no longer possible. Recombination would then only happen within each adapting lineage, which would have little effect on genetic diversification, resulting in low r/m ratios. Moreover, each transmission event between host individuals may introduce new bottlenecks in case of the human pathogen specialists, depending on the mode of transmission. Likewise, geographic barriers and strong seasonal variability could potentially introduce additional bottlenecks after which recombination events will only occur between nearly identical genomes and may go undetected. This could lead to a situation in which each lake contains its own unique genotypes, with little mixing with strains from other lakes. The identification of a few highly abundant genotypes of microcluster type C in lake Damariscotta, not detected elsewhere, is consistent with this scenario. However, sequences affiliated with microclusters A and B could be identified in several geographically distant lakes, suggesting that dispersal occurs globally but that not all genotypes are present everywhere.
The overall sequence diversity of the classical monomorphic pathogens is in the range of only 0.1 to 2 SNPs per kilobase . In comparison, we identified about 15 SNPs per kilobase in core genomes for the most closely related LD12 strains. One explanation for this difference could be that the transition from salt to freshwater occurred earlier than the specialization to a single host. Indeed, the dN/dS ratios are much below one for genes in the LD12 core genome, suggesting that there has been enough time for selection to act on the accumulation of mutations after the shift.
Other characteristic features of strict host-specialization are reductions of genome size and low rates of horizontal gene transfer. Unfortunately, due to biased amplification and partial data for the individual SAGs, the frequency of deletions cannot be determined with any precision until cultivated strains of LD12 with sequenced genomes are available. However, the identification of a highly variable region in the SAGs that is homologous to the hypervariable regions of the marine SAR11 strains indicate that horizontal transfers have influenced the LD12 genomes. This may not be surprising since lakes contain many other groups of bacteria that provide a rich source of novel genes. As suggested for the marine SAR11 genomes, it is possible that the hypervariable regions in the freshwater SAR11 genomes evolve under selection for diversity of surface structures to evade phages and predators [6, 29]. If so, this suggests that selection for variability in surface structures need not necessarily be associated with high genome-wide recombination rates as in the marine SAR11 strains.
We have shown in this study that the relative contribution of recombination to the observed substitutions in fresh and saltwater bacteria of the SAR11 clade differs by more than two orders of magnitude. These results are remarkable in that the r/m ratio for the freshwater genomes is in the lower range of such estimates for bacteria, whereas the saltwater genomes represent the upper end of the spectrum . This implies that specialization to freshwater ecosystems has had a dramatic effect on the population dynamics of the SAR11 group of bacteria, much like the reduction in genetic diversity observed for human pathogen specialists.
Materials and methods
Sequencing and assembly
Single cells were sorted from freshwater samples and their DNA was amplified as described in Martinez-Garcia et al. . The LD12 single-cell genomes were identified through sequencing of the 16S rRNA genes using universal primers.
Both the LD12 and the benchmark single-cell genomes were generated at the US Department of Energy Joint Genome Institute (JGI) using Illumina technology. An Illumina standard shotgun library was constructed and sequenced using the Illumina v3 HiSeq 2000 platform, run mode 2 × 150 bp. All general aspects of library construction and sequencing performed at the JGI can be found at . Raw Illumina sequence data were filtered for known Illumina sequencing and library preparation artifacts and then screened and trimmed according to the k-mers present in the dataset. High-depth k-mers, presumably derived from the bias introduced by the multiple displacement amplification (MDA) reaction, cause problems in the assembly, especially if the k-mer depth varies by orders of magnitude for different regions of the genome. Reads representing highly abundant k-mers were removed such that no k-mers with a coverage of more than 30× were present after filtering. Reads with an average k-mer depth of less than 2× were removed.
The following steps were then performed for assembly. First, filtered Illumina reads were assembled using Velvet version 1.1.04 . The VelvetOptimiser script (version 2.1.7) was used with default optimization functions (n50 for k-mer choice, total number of base pairs in large contigs for cov_cutoff optimization). Second, 1 to 3 kbp simulated paired-end reads were created from Velvet contigs using the wgsim software. Third, the normalized Illumina reads were assembled together with simulated read pairs using Allpaths-LG (version 41043) . Parameters for assembly steps were: (i) VelvetOptimiser (--v --s 51 --e 71 --i 4 --t 1 --o ‘-ins_length 250 -min_contig_lgth 500’); (ii) wgsim (-e 0-1 100-2 100 -r 0 -R 0 -X 0); (iii) Allpaths-LG (prepareAllpathsParams: PHRED_64 = 1 PLOIDY = 1 FRAG_COVERAGE = 125 JUMP_COVERAGE = 25 LONG_JUMP_COV = 50, runAllpathsParams: THREADS = 8 RUN = std_pairs TARGETS = standard VAPI_WARN_ONLY = True OVERWRITE = True). The minimum coverage for a base to be called was 10 × .
Bioinformatic analysis of SAG sequences
The gene predictions from the JGI annotation (accession numbers in Table S1 in Additional file 1) were the basis of the analysis. The name and function annotations were verified for consistency within larger protein clusters, and can vary from the JGI annotation for individual SAGs. The SAG proteomes were clustered using orthoMCL  (following the default user guidelines) together with five SAR11 genomes (Ca. Pelagibacter ubique HTCC1062, HTCC1002, HTCC7211, IMCC9063 and HIMB114) and the alphaproteobacterial genomes used in Viklund et al. . Cluster annotations were based on ‘gene’ and ‘product’ NCBI annotations for all genes in the same cluster, and were used consistently for all cluster members.
Protein sequences of 57 alphaproteobacterial pan-orthologs from the SAGs and the SAR11 genomes were aligned with the linsi option of MAFFT v6.864b  and masked with Gblocks  (default settings except allowing for gaps in at most half of the taxa). For each protein the best model was chosen with ProtTest . Phylogenetic analysis of the concatenated protein alignment was performed with RAxML version 7.3.5 , using protein-specific protein models and the PROTCAT model with 100 quick bootstraps. Single gene trees were built with RAxML version 7.2.8  using a GTRGAMMA model, with 100 bootstraps.
Orthologous genes from orthoMCL present in at least two SAGs and two other SAR11 genomes were used for estimation of nucleotide sequence divergence levels in the LD12 population. The nucleotide sequences were aligned with the guidance of amino acid translations provided by TranslatorX  (−p F MAFFT v6.923  linsi option). The pairwise synonymous substitution frequencies (dS) were calculated by the yn00 program from PAML v4.5 [39, 40]. Sequences that covered less than 70% of the gene were removed, and also genes that had less than 70% of the alignment length used for the calculation were excluded, leaving a dataset of 775 genes.
Whole-genome alignments of nine LD12 SAGs, excluding N17, were performed with progressiveMauve (Mauve v2.3.1) with default settings  and well-aligned blocks were further extracted by Gblocks v. 0.91  with default stringent settings. The relative contribution of recombination and mutation to nucleotide substitutions (r/m) was estimated with ClonalFrame 1.2  (three chains with the following number of generations, burn in and sampling frequency, respectively: -x 1000000 -y 1000000 -z 100). Convergence was assessed for all parameters and the r/m ratio was the same for each chain. SAR11 results were obtained from the multilocus sequence typing data published , using ClonalFrame with the same settings. dS values were calculated for the genes in the ClonalFrame input datasets (20 freshwater, 9 marine) and mean values per gene were tested with Mann–Whitney U test, to verify that higher recombination in marine dataset is not due to higher sequence diversity.
The occurrence matrix (that is, presence/absence disregarding number of copies) of orthologous groups was made for each genome based on the clustering described above. The most parsimonious ancestral presence/absence patterns were inferred using PAUP* (acctran option, gain cost 2, loss cost 1). The patterns were plotted on the tree using Treegraph 2 . Gains and losses inferred on the LD12 branch were annotated with hmmer  (NCBI COG database as queries against the clusters database, e-value cutoff 10-10) choosing the best match for each cluster as the representative COG.
Bioinformatic analysis of the metagenomes
The largest scaffolds from different SAGs were checked for substantial overlaps to obtain the relative ordering. The starting point was the largest 250 kb scaffold from cell L15. Additional scaffolds were added in phylogenetic order, if there was a choice, resulting in the eight largest L15 scaffolds constituting the patchwork and complemented with sequence from the other SAGs. The process of joining overlapping scaffolds continued until all information from substantial overlaps (>10 kb) was utilized (Table S1 in Additional file 1). Non-overlapping or small scaffolds were placed at the end of each of the individual SAGs.
Metagenomic reads were recruited to the patchwork with nucmer  and further filtered (>75% identity and >150 bp alignment and >50% read length aligned). Cluster-based annotation of patchwork genes was used to obtain the corresponding SAG and other SAR11 sequences. Before phylogenetic analysis, metagenomic reads were cut to the gene boundary, removing reads with <150 bp overlap. The use of different SAGs as the reference for different regions was verified to have little effect on the recruitment results.
Nucmer alignments of the metagenomic reads were used for phylogenies. Reference SAG and other SAR11 nucleotide sequences were aligned with the guidance of amino acid translations by TranslatorX  (−p F MAFFT v6.92  linsi option). The metagenomic and the genomic alignment blocks were merged with mafft-profile . Maximum likelihood phylogenies with 100 bootstrap replicates were obtained using RAxML version 7.2.8 with the GTRGAMMA model .
Population recombination (rho) and mutation (theta) rates for the recruited reads were inferred by PIIM  (sequential mode, theta starting points 0.01 and 0.1). Xml input files for PIIM were generated from nucmer alignments and the corresponding read quality files.
The SAR11 SAGs listed in Table 1 have been deposited in GenBank under accession IDs: ATTB00000000, ATTD00000000, ATTC00000000, AQPD00000000, AQUE00000000, AQUF00000000, AQUG00000000, AQUH00000000, AQZA00000000, AZOF00000000. The benchmark data set is available on the JGI portal  under proposal ID 300869.
Alphaproteobacterial cluster of orthologous genes
Joint Genome Institute
Single amplified genome
single nucleotide polymorphism.
Financial support from the Swedish Research Council (grant numbers 349-2007-831, 621-2008-3259, 621-2011-4669-4669 to SGEA, 2009–3784, 2008–1923, 2012–3892 to SB), the European Union, the Göran Gustafsson Foundation, the Knut and Alice Wallenberg Foundation (grant numbers KAW-2011.0148, KAW-2012.0075 to SGEA), the National Science Foundation (DEB-841933, OCE-821374, EF-0633142 to RS) is acknowledged. Sequencing of single amplified genomes was conducted by the US Department of Energy Joint Genome Institute, supported by the Office of Science of the US Department of Energy under contract number DE-AC02-05CH11231. TW was also funded under DE-AC02-05CH11231. Computational resources were provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under Project p2006019 and p2009043.
- Vos M, Didelot X: A comparison of homologous recombination rates in bacteria and archaea. ISME J. 2009, 3: 199-208. 10.1038/ismej.2008.93.PubMedView ArticleGoogle Scholar
- Achtman M: Evolution, population structure, and phylogeography of genetically monomorphic bacterial pathogens. Annu Rev Microbiol. 2008, 62: 53-70. 10.1146/annurev.micro.62.081307.162832.PubMedView ArticleGoogle Scholar
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al: The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007, 5: 398-431.View ArticleGoogle Scholar
- Biers EJ, Sun S, Howard EC: Prokaryotic genomes and diversity in surface ocean waters: interrogating the Global Ocean Sampling metagenome. Appl Environ Microbiol. 2009, 75: 2221-2229. 10.1128/AEM.02118-08.PubMedPubMed CentralView ArticleGoogle Scholar
- Vergin KL, Tripp HJ, Wilhelm LJ, Denver DR, Rappe MS, Giovannoni SJ: High intraspecific recombination rate in a native population of Candidatus Pelagibacter ubique (SAR11). Environ Microbiol. 2007, 9: 2430-2440. 10.1111/j.1462-2920.2007.01361.x.PubMedView ArticleGoogle Scholar
- Zhao Y, Temperton B, Thrash JC, Schwalbach MS, Vergin KL, Landry ZC, Ellisman M, Deerinck T, Sullivan MB, Giovannoni SJ: Abundant SAR11 viruses in the ocean. Nature. 2013, 494: 357-360. 10.1038/nature11921.PubMedView ArticleGoogle Scholar
- Grote J, Thrash JC, Huggett MJ, Landry ZC, Carini P, Giovannoni SJ, Rappe MS: Streamlining and core genome conservation among highly divergent members of the SAR11 clade. Mbio. 2012, 3: e00252-12.PubMedPubMed CentralView ArticleGoogle Scholar
- Giovannoni SJ, Bibbs L, Cho JC, Stapels MD, Desiderio R, Vergin KL, Rappe MS, Laney S, Wilhelm LJ, Tripp HJ, Mathur EJ, Barofsky DF: Proteorhodopsin in the ubiquitous marine bacterium SAR11. Nature. 2005, 438: 82-85. 10.1038/nature04032.PubMedView ArticleGoogle Scholar
- Viklund J, Ettema TJG, Andersson SGE: Independent genome reduction and phylogenetic reclassification of the oceanic SAR11 clade. Mol Biol Evol. 2012, 29: 599-615. 10.1093/molbev/msr203.PubMedView ArticleGoogle Scholar
- Zwart G, Crump BC, Agterveld M, Hagen F, Han SK: Typical freshwater bacteria: an analysis of available 16S rRNA gene sequences from plankton of lakes and rivers. Aquat Microb Ecol. 2002, 28: 141-155.View ArticleGoogle Scholar
- Kan J, Evans SE, Chen F, Suzuki MT: Novel estuarine bacterioplankton in rRNA operon libraries from the Chesapeake Bay. Aquat Microb Ecol. 2008, 51: 55-66.View ArticleGoogle Scholar
- Logares R, Brate J, Heinrich F, Shalchian-Tabrizi K, Bertilsson S: Infrequent transitions between saline and fresh waters in one of the most abundant microbial lineages (SAR11). Mol Biol Evol. 2010, 27: 347-357. 10.1093/molbev/msp239.PubMedView ArticleGoogle Scholar
- Bahr M, Hobbie JE, Sogin ML: Bacterial diversity in an arctic lake: a freshwater SAR11 cluster. Aquat Microb Ecol. 1996, 11: 271-277.View ArticleGoogle Scholar
- Zwart G, Hiorns WD, Methe BA, Van Agterveld MP, Huismans R, Nold SC, Zehr JP, Laanbroek HJ: Nearly identical 16S rRNA sequences recovered from lakes in North America and Europe indicate the existence of clades of globally distributed freshwater bacteria. Syst Appl Microbiol. 1998, 21: 546-556. 10.1016/S0723-2020(98)80067-2.PubMedView ArticleGoogle Scholar
- Salcher MM, Pernthaler J, Posch T: Seasonal bloom dynamics and ecophysiology of the freshwater sister clade of SAR11 bacteria ‘that rule the waves’ (LD12). ISME J. 2011, 5: 1242-1252. 10.1038/ismej.2011.8.PubMedPubMed CentralView ArticleGoogle Scholar
- Logares R, Brate J, Bertilsson S, Clasen JL, Shalchian-Tabrizi K, Rengefors K: Infrequent marine-freshwater transitions in the microbial world. Trends Microbiol. 2009, 17: 414-422. 10.1016/j.tim.2009.05.010.PubMedView ArticleGoogle Scholar
- Stepanauskas R: Single cell genomics: an individual look at microbes. Curr Opin Microbiol. 2012, 15: 613-620. 10.1016/j.mib.2012.09.001.PubMedView ArticleGoogle Scholar
- Lasken RS: Genomic sequencing of uncultured microorganisms from single cells. Nat Rev Microbiol. 2012, 10: 631-640. 10.1038/nrmicro2857.PubMedView ArticleGoogle Scholar
- Martinez-Garcia M, Swan BK, Poulton NJ, Gomez ML, Masland D, Sieracki ME, Stepanauskas R: High-throughput single-cell sequencing identifies photoheterotrophs and chemoautotrophs in freshwater bacterioplankton. ISME J. 2012, 6: 113-123. 10.1038/ismej.2011.84.PubMedPubMed CentralView ArticleGoogle Scholar
- Morris RM, Vergin KL, Cho JC, Rappe MS, Carlson CA, Giovannoni SJ: Temporal and spatial response of bacterioplankton lineages to annual convective overturn at the Bermuda Atlantic time-series study site. Limnol Oceanogr. 2005, 50: 1687-1696. 10.4319/lo.2005.50.5.1687.View ArticleGoogle Scholar
- Johnson PL, Slatkin M: Inference of microbial recombination rates from metagenomic data. PLoS Genet. 2009, 5: e1000674-10.1371/journal.pgen.1000674.PubMedPubMed CentralView ArticleGoogle Scholar
- Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill F-X, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, Dolecek C, Achtman M, Dougan G: High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet. 2008, 40: 987-993. 10.1038/ng.195.PubMedPubMed CentralView ArticleGoogle Scholar
- Achtman M, Zurth K, Morelli C, Torrea G, Guiyoule A, Carniel E: Yersinia pestis, the cause of plague, is a recently emerged clone of Yersinia pseudotuberculosis. Proc Natl Acad Sci U S A. 1999, 96: 14043-14048. 10.1073/pnas.96.24.14043.PubMedPubMed CentralView ArticleGoogle Scholar
- Didelot X, Barker M, Falush D, Priest FG: Evolution of pathogenicity in the Bacillus cereus group. Syst Appl Microbiol. 2009, 32: 81-90. 10.1016/j.syapm.2009.01.001.PubMedView ArticleGoogle Scholar
- Godoy D, Randle G, Simpson AJ, Aanensen DM, Pitt TL, Kinoshita R, Spratt BG: Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, Burkholderia pseudomallei and Burkholderia mallei. J Clin Microbiol. 2003, 41: 2068-2079. 10.1128/JCM.41.5.2068-2079.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Comas I, Chakravartti J, Small PM, Galagan J, Niemann S, Kremer K, Ernst JD, Gagneux S: Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat Genet. 2010, 42: 498-503. 10.1038/ng.590.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Bai J-Y, Wang L-Y, Zeng L, Shi Y-S, Qiu Z-L, Ye H-H, Zhang X-F, Lu Q-B, Kosoy M, Liu W, Cao WC: Genetic diversity of Bartonella quintana in macaques suggests zoonotic origin of trench fever. Mol Ecol. 2013, 22: 2118-2127. 10.1111/mec.12261.PubMedView ArticleGoogle Scholar
- Achtman M: Insights from genomic comparisons of genetically monomorphic bacterial pathogens. Phil Trans R Soc Lond B Biol Sci. 2012, 367: 860-867. 10.1098/rstb.2011.0303.View ArticleGoogle Scholar
- Rodriguez-Valera F, Martin-Cuadrado A-B, Rodriguez-Brito B, Pasic L, Thingstad TF, Rohwer F, Mira A: Explaining microbial population genomics through phage predation. Nat Rev Microbiol. 2009, 7: 828-836. 10.1038/nrmicro2235.PubMedView ArticleGoogle Scholar
- DOE Joint Genome Institute. http://www.jgi.doe.gov/,
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 2011, 108: 1513-1518. 10.1073/pnas.1017351108.PubMedPubMed CentralView ArticleGoogle Scholar
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMedPubMed CentralView ArticleGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.PubMedPubMed CentralView ArticleGoogle Scholar
- Castresana J: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000, 17: 540-552. 10.1093/oxfordjournals.molbev.a026334.PubMedView ArticleGoogle Scholar
- Darriba D, Taboada GL, Doallo R, Posada D: ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 2011, 27: 1164-1165. 10.1093/bioinformatics/btr088.PubMedView ArticleGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 2688-2690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
- Abascal F, Zardoya R, Telford MJ: TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 2010, 38: W7-W13. 10.1093/nar/gkq291.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang ZH, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000, 17: 32-43. 10.1093/oxfordjournals.molbev.a026236.PubMedView ArticleGoogle Scholar
- Yang Z: PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591. 10.1093/molbev/msm088.PubMedView ArticleGoogle Scholar
- Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010, 5: e10209-10.1371/journal.pone.0010209.PubMedPubMed CentralView ArticleGoogle Scholar
- Didelot X, Falush D: Inference of bacterial microevolution using multilocus sequence data. Genetics. 2007, 175: 1251-1266.PubMedPubMed CentralView ArticleGoogle Scholar
- Stoever BC, Mueller KF: TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics. 2010, 11: 7-10.1186/1471-2105-11-7.View ArticleGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Comp Biol. 2011, 7: e1002195-10.1371/journal.pcbi.1002195.View ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMedPubMed CentralView ArticleGoogle Scholar
- Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008, 9: 286-298. 10.1093/bib/bbn013.PubMedView ArticleGoogle Scholar
- Markowitz VM, Chen IM, Palaniappan K, Chu K, Szeto E, Grechkin Y, Ratner A, Jacob B, Huang J, Williams P, Huntemann M, Anderson I, Mavromatis K, Ivanova NN, Kyrpides NC: IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res. 2012, 40: D115-D122. 10.1093/nar/gkr1044.PubMedPubMed CentralView ArticleGoogle Scholar
- Guy L, Kultima JR, Andersson SGE: genoPlotR: comparative gene and genome visualization in R. Bioinformatics. 2010, 26: 2334-2335. 10.1093/bioinformatics/btq413.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.