Skip to main content

Advertisement

Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences

Background

A major goal of metagenomics is to characterize the taxonomic composition of an environment. The most popular approach relies on 16S rRNA sequencing; however, this approach can generate biased estimates owing to differences in the copy number of the gene, even between closely related organisms, and owing to PCR artifacts. In addition, the taxonomic composition can also be determined from metagenomic shotgun sequences by matching reads against a database of reference sequences. One major limitation of the computational methods that have been used for this purpose is the use of a universal classification threshold for all genes at all taxonomic ranks.

Methods

We present a novel taxonomic profiler for metagenomic sequences, MetaPhyler [1], which relies on 31 phylogenetic marker genes as a taxonomic reference. Because genes can evolve at different rates and because shotgun reads contain gene fragments of different lengths, we propose that better classification results can be obtained by tuning the taxonomic classifier to the length of the gene fragment, to a particular gene and to the taxonomic rank. Our classifier uses different thresholds for each of these parameters, and these thresholds are automatically learned from the taxonomic structure of the reference database.

Results

We have randomly simulated about 300,000 DNA sequences of 60 bp and about 70,000 DNA sequences of 300 bp from phylogenetic marker genes. Table 1 shows the performance of the phylogenetic classifications from MetaPhyler, PhymmBL [2], MEGAN [3] and WebCARMA [4]. The query sequence itself was removed from the reference dataset when running the programs. The sensitivity of MetaPhyler is significantly higher than that of the other tools in all situations because our classifier is explicitly trained at each taxonomic rank.

Table 1 Comparison of sensitivity and precision.

In addition, we have created a simulated metagenomic sample comprising five genomes. Table 2 shows the taxonomic profiles estimated by different approaches. In this setting, MetaPhyler also outperforms the other approaches by more accurately reconstructing the true taxonomic distribution.

Table 2 Comparison of taxonomic profile estimations.

Conclusions

We have introduced a novel taxonomic classification method for analyzing the microbial diversity from whole metagenome shotgun sequences. Compared with previous approaches, MetaPhyler is more accurate at estimating the taxonomic profile, especially when taking into account the actual abundance of individual taxonomic groups.

References

  1. 1.

    MetaPhyler Software [http://metaphyler.cbcb.umd.edu/]

  2. 2.

    Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.Nat Methods 2009, 6:673–676.

  3. 3.

    Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data.Genome Res 2007, 17:377–386.

  4. 4.

    Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J: WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads.BMC Bioinformatics 2009, 10:430.

Download references

Author information

Correspondence to Mihai Pop.

Rights and permissions

Reprints and Permissions

About this article

Keywords

  • Taxonomic Composition
  • Shotgun Sequence
  • Individual Taxonomic Group
  • Taxonomic Rank
  • Metagenomic Sequence