Bacterial epidemiology and biology - lessons from genome sequencing
© BioMed Central Ltd 2011
Published: 24 October 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 24 October 2011
Next-generation sequencing has ushered in a new era of microbial genomics, enabling the detailed historical and geographical tracing of bacteria. This is helping to shape our understanding of bacterial evolution.
The application of whole-genome sequencing affords the opportunity to generate bacterial nucleic acid sequence data of extraordinary resolution, making it possible to identify single base changes within entire genomes. The development of second (and third) generation sequencing technology has largely been driven by the desire to assess human genetic variation rapidly by mapping genome-wide single-nucleotide polymorphisms (SNPs). Several recent studies have applied similar analyses at the whole-genome level to the much smaller genomes of bacteria, providing data of very fine-scale resolution and enabling the evolutionary history of multiple strains within a clonal lineage to be determined [1–8]. Not surprisingly, these studies have focused on bacterial pathogens because of their importance in disease.
Distinguishing individual bacterial lineages within a species, initially by phenotypic and subsequently by genotypic typing techniques, has been the cornerstone of infectious disease epidemiology, allowing the identification and tracking of the organisms responsible for infection and disease. On a wider scale, molecular typing is central to determining the population structure and understanding the evolution of bacterial pathogens. To date, sequence-based typing approaches, such as multilocus sequence typing (MLST), have relied on variation within a few genes . Although such techniques are highly informative, they have limited resolution when applied to closely related isolates. Thus, they are often unsuitable for identifying fine-grained evolutionary events or for distinguishing clonal strains within a recent epidemic . This situation has now been improved by the application of next-generation sequencing to bacterial collections of known origin and provenance. We review recent examples in which such studies have informed our understanding of the epidemiology and evolution of bacterial pathogens.
Studies of the phylogeny of a species or of a clonal lineage within a species are highly dependent on the quantity and diversity of isolates sampled in the study. Recently, it has become possible to fully sequence significant numbers of isolates in strain collections, revealing new information on their temporal and spatial transmission dynamics.
The causative agent of the 'Black Death' has been a matter of considerable debate . Disease symptoms such as inflamed buboes led to the assumption that the 'Black Death' was a typical Y. pestis infection, but other etiologic agents, including hemorrhagic viruses, have been proposed on the basis of clinical and epidemiological information. The SNP analysis used to trace the phylohistory of Y. pestis  is a development of an earlier SNP analysis  that has already been used to provide information on ancient bacterial DNA. Specifically, it had been used to analyze DNA extracted from human dental pulp taken from medieval mass graves in several European locations . The analysis made it possible not only to amplify Y. pestis DNA taken from the victims but also to place these bacterial DNA sequences in the correct historical context through phylogenetic analysis. The results imply that at least two distinct ancient clones of Y. pestis were responsible for the 'Black Death', at least one of which appears to be extinct. These findings definitively show that Y. pestis was the causative agent of the 'Black Death'.
Cholera, caused by Vibrio cholerae, is another ancient scourge that has recently been studied by rapid genome sequencing. The current (seventh) cholera pandemic, recently manifest in Zimbabwe, Pakistan and Haiti, is caused by the El Tor biotype of serogroup O1 (El Tor O1) [13, 14]. Two clinical V. cholerae isolates from the current outbreak in Haiti were fully sequenced in less than 24 hours using third-generation sequencing with the PacBio RS sequencing system . Third-generation single-molecule real-time sequencing with this system involves direct observation of the DNA polymerase while it synthesizes a strand of DNA. It has advantages over second-generation sequencing in terms of increased read length and speed. Whole-genome sequence comparisons with reference strains confirmed that the Haitian cholera epidemic is clonal and El Tor O1, and suggested that the V. cholerae strain was introduced into Haiti by human activity from a distant geographic source . Although this was a limited study, the data suggest that a South Asian variant of V. cholerae El Tor O1 has recently been accidentally introduced into Haiti. This theory is consistent with the epidemiological evidence .
Genetic epidemiological studies are likely to prove particularly useful for tracing the routes of transmission and sources of infection for hospital-acquired infections. Whole-genome sequencing has been used to explore the phylogeny, horizontal gene transfer, recombination, and micro- and macroevolution of the major hospital-acquired pathogen Clostridium difficile [5, 17, 18]. This infection produces a wide range of symptoms from mild diarrhea to life-threatening pseudomembranous colitis, and characteristically occurs after treatment with broad-spectrum antibiotics. The hospital environment in which there are people undergoing antibiotic treatment provides a discrete ecosystem in which C. difficile persists and select virulent clones thrive. Consequently, C. difficile is the most frequent cause of nosocomial diarrhea worldwide [19, 20].
Phylogenetic analysis demonstrates that C. difficile is a genetically diverse species, with estimates of the date of the most recent common ancestor (MRCA) varying from 1.1 to 85 million years before present. By contrast, the disease-causing isolates (PCR-ribotypes 017s, 027s, and 078s) have arisen from multiple lineages over very short evolutionary timescales. This suggests that virulence has evolved independently in several highly epidemic lineages , and contradicts the notion that a single lineage evolved to become pathogenic. For example, the MRCA of the very recently emerged PCR-ribotype 027 hypervirulent lineage was confirmed at approximately 30 years ago, consistent with the dates of the transcontinental spread of C. difficile [5, 21, 22]. This has implications for the emergence of C. difficile as a human pathogen. Although C. difficile appears to be an ancient species, it was recognized as a pathogen only 30 years ago, indicating that genetic modifications, changes in interactions between host and pathogen, and factors such as human activity, hospital design and antibiotic use might have contributed to the recent emergence of C. difficile as a major pathogen.
The first large-scale whole-genome analysis of a clonal lineage within a species was undertaken for an epidemic sequence type (ST239) of the notorious hospital pathogen methicillin-resistant Staphylococcus aureus (MRSA). The collection of 63 geographically diverse ST239 representatives over four decades revealed clear evidence of both geographical grouping and intercontinental transmission over time . The resolution of the whole-genome SNP analysis was so fine that it revealed the microevolution of MRSA within a Thai hospital over a 7-month period. It was even able to determine when new strains were introduced from the community, or when they were acquired by person-to-person transmission within the hospital environment. Thus, very high-resolution phylogenetic analyses enable detailed epidemiological reconstructions at the global (that is, spread by modern transport) and local (that is, within hospital) levels, suggesting that such analyses are likely to be clinically applicable in the near future.
A similar whole-genome analysis of 240 Streptococcus pneumoniae isolates of the antibiotic-resistant PMEN1 clonal lineage (which is often serotype 23F) tracked the phylogeny of this recombinogenic lineage by SNP analysis, after accounting for the confounding effect of extensive recombination . The phylogeny of the PMEN1 clone confirms that it arose around 40 years ago. Analysis of genetic markers, together with the provenance of the strains, suggests that the introduction of the heptavalent PCV7 glycoconjugate vaccine in the US (designed against 7 of the 90 pneumococcal serotypes, including 23F) effectively led to the depletion of the resident 23F population. However, this selective pressure opened the niche to non-vaccine serotypes such as 19A. These studies strongly suggest that although some of these non-vaccine isolates were from the same lineage as the 23F serotype, they had acquired a new non-vaccine capsule type before the introduction of the vaccine. This demonstrates that, despite the remarkable adaptability of recombinogenic bacteria such as the pneumococcus, strong vaccine pressure can remove the population that expresses vaccine-type capsules before it can switch its capsule genes. The emergence of vaccine-escape strains such as the 19A serotype thus has important implications for the introduction of partial species coverage vaccines. Such interventions require vigilance and the genomic surveillance of all bacterial clones. Overall, this study shows the surprisingly rapid evolution of a recombinogenic bacterial pathogen that can be linked to clinical interventions such as antibiotic usage and the introduction of vaccines.
A novel aspect of this study was the use of information generated from whole-genome analysis to select representative strains for comparative transcriptome and mass spectroscopy SNP analyses [7, 23]. A comparative study of transcriptome expression in a subset of strains revealed that closely related strains, which are differentiated by apparently modest genetic changes, can have significantly divergent transcriptomes. Therefore, subtle genetic changes could have more significant phenotypic consequences than previously appreciated.
One advantage of whole-genome SNP analysis is that it could be used to identify non-synonymous mutations that are more likely to be influenced by evolutionary selective pressure than synonymous mutations or mutations in intergenic DNA. The identification of genes that have undergone non-synonymous mutations should provide clues to the evolutionary pressures experienced over time by a bacterial species or clonal lineages within a species. Traditionally, the selective forces acting on a bacterial genome were investigated by calculating the ratio of non-synonymous to synonymous substitutions (dN/dS) for a given species comparison . A ratio significantly less than 1 suggests strong purifying or stabilizing selection, whereas a ratio close to 1 suggests a neutral selection pressure, and a ratio greater than 1 indicates diversifying selection. For very closely related genomes (for example, those within clonal lineages), however, dN/dS can be close to 1 simply because there has been insufficient time for selection to act . This, combined with the very small number of mutations in individual genes within these lineages, actually makes the identification of genes under selection very difficult. Nevertheless, the increasing number of strains within bacterial collections that are currently being sequenced does allow some inferences relating to genetic selection.
The whole-genome sequencing of 19 isolates of Salmonella enterica serovar Typhi (S. Typhi), a human-restricted bacterial pathogen that causes typhoid fever, confirmed the very limited genetic variation within this species. The mean dN/dS of each isolate in comparison with the last common ancestor was approximately 0.66, suggesting that there has been a weak trend towards stabilizing selection since the occurrence of the MRCA) of S. Typhi . Detailed analysis of the SNPs showed little evidence of diversifying selection, antigenic variation or recombination between isolates. Only 38% of genes had any sequence variation at all, and the occurrence of variants in almost all of these genes appeared to match random expectation. Nevertheless, evidence for selection could be found by looking for SNPs that occurred independently on different branches of the tree (that is, for homoplasy, or convergent evolution). Examples of such non-synonymous mutations include those in gyrA that are responsible for resistance to quinolone-based antibiotics. No genes besides gyrA contained multiple homoplasic SNPs, and very few genes had an excess of non-synonymous SNPs, showing that unlike the strong adaptive selection for mutations conferring antibiotic resistance, there is little evidence for selective pressure for antigenic variation driven by immune selection. This is consistent with the previous assertion that most of the variants in the Typhi genome accumulate by genetic drift . The adaptive mutations evident in the gyrA gene highlight the strong selective pressure on the S. Typhi genome associated with antibiotic use in the human population. This is not particularly surprising as the fitness advantage associated with increased antibiotic resistance is very strong. The paucity of similar evidence for other adaptive mutations suggests that S. Typhi is under relatively little selective pressure from its human host, consistent with its long-term carriage within a protected niche, the gall-bladder.
The technique of identifying genes with a significant excess of SNPs has also been used in the Group A Streptococcus data set : just under 5% of the variable genes were found to have a statistical excess of variants. These included several surface proteins and virulence factors, along with a regulator, ropB, that controls multiple virulence genes, indicating that selection by the host is a significant factor in the recent evolution of this organism.
For C. difficile, dN/dS was calculated for different clonal lineages. The data from deeply diverging lineages provided evidence of strong purifying selection. For example, the average dN/dS between the divergent lineage 078 and the other strains tested is ≈ 0.08 . By contrast, for recently diverged lineages, such as the 027 ribotype, dN/dS is very close to 1, consistent with the delayed action of purifying selection  and again making it difficult to identify genes that are under selection. The extensive full-genome data in this study did, however, permit the identification of genes that had significantly increased rates of non-synonymous nucleotide polymorphism, thereby providing clues about the operation of selective forces in the host.
A similar approach for identifying genes that are under selection (that is, searching for homoplasic SNPs within the background of random SNPs that delineate the tree) was used in the MRSA ST239 study. Again, there was a relatively small number of homoplasic SNPs (less than 1%) in the core genome, but around 30% of these could be directly linked to the evolution of resistance to antibiotics currently in use (for example, quinolones, rifampicin, mupirocin and trimethoprim). These findings confirm that antibiotic use is a major driving force in the evolution of MRSA, and that this technique can detect recent selection . Nevertheless, the majority of homoplasies were found in genes for which no reason for selective pressure could be clearly identified. Understanding how and why these mutations are selected could provide novel information on the emergence of multidrug-resistant clonal lineages.
The ability to distinguish vertically acquired substitutions from horizontally acquired sequences is crucial to reconstructing phylogenies for recombinogenic organisms. This was recently attempted for the S. pneumoniae data set, in which 88% of the variants were estimated to have been introduced by recombination. Despite this, the relative likelihood that a polymorphism was introduced through recombination rather than by point mutation (r/m ratio) was estimated to be 7.2, less than the previously calculated value of approximately 66 from MLST data . By removing recombination events from the phylogeny, the number of homoplasic sites was reduced by 97%, and the apparent rate of SNP accumulation was much more consistent within the tree, thus considerably strengthening the core phylogeny and the inferences that could be made from it.
The large majority of the SNPs identified in these studies of recent clones are effectively neutral over these timescales, and thus they can be used to give a good estimate of the current mutation rate. Curiously, this estimate for S. pneumoniae and S. aureus is roughly 1,000 times greater than that estimated from the synonymous substitution rate between deeper bacterial lineages (such as Escherichia coli and Salmonella) . This apparent discrepancy can be reconciled by the fact that synonymous sites, commonly assumed to be neutral, are in fact under selection over longer time periods because of pressures such as G+C content and codon usage.
Advances in sequencing technologies have enabled the whole-genome phylogenies of multiple clonal isolates to be determined readily, but as with all microbial epidemiological investigations, the size and composition of the strain collection investigated is crucial for subsequent biological interpretations. This is particularly relevant for bacterial pathogens that reside in multiple niches: isolates are frequently collected only from patients. This might bias the data sets and provide an incomplete picture of the true diversity of a bacterial species, which might have alternative niches and reservoirs to humans. The lack of representation of the true diversity results in gaps in our evolutionary knowledge of a given pathogen. For example, there appears to be huge evolutionary distances between current epidemic clonal lineages of C. difficile. Nevertheless, because so much data can now be derived from bacterial isolates, it is now more important than ever to engage appropriately with clinicians and strain collectors and to acquire accurate strain provenance.
Current sequencing approaches are limited for more divergent bacterial pathogens. As more representatives of clonal lineages within a bacterial species are sequenced and as sequencing technology continues to improve, however, more diverse and panmictic bacterial populations will be sequenced. This might be particularly revealing for organisms such as Helicobacter pylori, which is resident in the stomachs of half of the human population. MLST analysis of multiple H. pylori strains has been used to re-construct and confirm human population expansion and migration in Africa, Europe and the Pacific . The higher resolution of whole-genome sequencing promises significant progress towards tracing human pre-history.
Notwithstanding these limitations, molecular epidemiology has clearly come of age for clonal bacterial pathogens. The major advantage of whole-genome sequencing is its power to discriminate between isolates, enabling the generation of robust phylogenies. This provides greater confidence in identifying the origins of infections and the routes of transmission, as demonstrated by the monitoring of patient-to-patient transfer of bacteria within a hospital [6, 29] or within a community . Such finely tuned transmission tracking will be vital in determining whether factors such as patient-to-patient transmission are important in the spread of the disease. In the future, this may facilitate the practice of proactive infectious disease surveillance to truncate or avert epidemics.
In contrast to most typing methods, whole-genome sequencing also facilitates the direct identification of gene losses and gene gains that can play a role in the evolution of a bacterial species or clonal lineage within a species. Such information has frequently identified the emergence of antibiotic resistance within populations, which is often associated with increased antibiotic usage. Nevertheless, other more subtle selective forces are also likely to be important in the emergence of bacterial pathogens, and our current knowledge in this area is lacking. High-throughput sequencing holds the promise of mapping more subtle associations between phenotype and genotype [30, 31]. The next few years will see an increase in the biological interpretation of such mutations using high-throughput in vitro assays and the selected testing of representative isolates in animal infection studies. This should further our understanding of the host, ecological, environmental and human forces that are important in the evolution of bacterial pathogens and enable further appropriate interventions to be made. Another area in which we lack knowledge is how and why some bacterial lineages appear to diminish, or even become extinct. For example, the S. pneumoniae lineage BM4200 is a multidrug-resistant serotype 23F isolate from 1978, but despite its similarity to the PMEN1 isolates, it is now seldom found . As genome sequencing, SNP detection and geospatial information become more accessible, these methods will continue to transform the way molecular epidemiology is used to study populations of bacterial pathogens.
We are now in a new era of high-throughput, sequence-based microbiology that will have important implications for health service providers working with infectious diseases. Rather than merely identifying a particular bacterium by culture, whole-genome sequencing will provide a better understanding of its origin and disease potential. As global positioning systems become more accessible and merge with molecular epidemiology, more accurate geospatial information on the origins of strains and outbreaks will become available. Just as the DNA fingerprinting of human microsatellites changed our lives through diverse applications from paternity testing to crime screen investigations, next-generation sequencing means that molecular epidemiology is set to be revolutionized for clonal bacterial pathogens. The next few years promise a voyage of discovery in terms of the attribution of sources and transmission tracking of bacteria, the understanding of how and why epidemic clones emerge or disappear, and ultimately the management and treatment of infectious diseases.
We wish to acknowledge research funding from The Wellcome Trust.