- Open Access
Metagenomic biomarker discovery and explanation
Genome Biologyvolume 12, Article number: R60 (2011)
This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.
Biomarker discovery has proven to be one of the most broadly applicable and successful means of translating molecular and genomic data into clinical practice. Comparisons between healthy and diseased tissues have highlighted the importance of tasks such as class discovery (detecting novel subtypes of a disease) and class prediction (determining the subtype of a new sample) [1–4], and recent metagenomic assays have shown that human microbial communities can be used as biomarkers for host factors such as lifestyle [5–7] and disease [7–10]. As sequencing technology continues to develop and makes microbial biomarkers increasingly easily detected, this enables clinical diagnostic and microbiological applications through the comparison of microbial communities [11, 12].
The human microbiome, consisting of the total microbial complement associated with human hosts, is an important emerging area for metagenomic biomarker discovery [13, 14]. Changes in microbial abundances in the gut, oral cavity, and skin have been associated with disease states ranging from obesity [15–17] to psoriasis . More generally, the metagenomic study of microbial communities is an effective approach for identifying the microorganisms or microbial metabolic characteristics of any uncultured sample [19, 20]. Analyses of metagenomic data typically seek to identify the specific organisms, clades, operational taxonomic units, or pathways whose relative abundances differ between two or more groups of samples, and several features of microbial communities have been proposed as potential biomarkers for various disease states. For example, single pathogenic organisms can signal disease if present in a community [21, 22], and increases and decreases in community complexity have been observed in bacterial vaginosis  and Crohn's disease . Each of these different types of microbial biomarkers is correlated with disease phenotypes, but few bioinformatic methods exist to explain the class comparisons afforded by metagenomic data.
Identifying the most biologically informative features differentiating two or more phenotypes can be challenging in any genomics dataset, and this is particularly true for metagenomic biomarkers. Robust statistical tools are needed to ensure the reproducibility of conclusions drawn from metagenomic data, which is crucial for clinical application of the biological findings. Related challenges are associated with high-dimensional data regardless of the data type or experimental platform; the number of potential biomarkers, for example, is typically much higher than the number of samples [24–26]. Metagenomic analyses additionally present their own specific issues, including sequencing errors, chimeric reads [27, 28], and complex underlying biology; many microbial communities have been found to show remarkably high inter-subject variability. For example, large differences are detected even among the gut microbiomes of twins , and both human microbiomes and environmental communities are thought to be characterized by the presence of a long tail of rare organisms [30–32]. Moreover, simply identifying potential biomarkers without elucidating their biological consistency and roles is only a precursor to understanding the underlying mechanisms of microbe-microbe or host-microbe interactions . In many cases, it is necessary to explain not just how two biological samples differ, but why. This problem is referred to as class comparison: how can the differences between phenotypes such as tumor subtype or disease state be explained in terms of consistent biological pathways or molecular mechanisms?
A number of methods have been proposed for class discovery or comparison in metagenomic data. MEGAN  is a metagenomic analysis tool with recent additions for phylogenetic comparisons  and statistical analyses . MEGAN, however, can only compare single pairs of metagenomes, as is also the case with STAMP , which does introduce a concept of 'biological relevance' in the form of confidence intervals. UniFrac  compares sets of metagenomes at a strictly taxonomic level using phylogenetic distance, while MG-RAST , ShotgunFunctionalizeR , mothur , and METAREP  all process metagenomic data using standard statistical tests (mainly t-tests with some modifications). Most methods for community analysis from an ecological perspective rely on unsupervised cluster analyses based on principal component analysis  or principal coordinate analysis . These can successfully detect groups of related samples, but they fail to include prior knowledge of phenotypes or environmental conditions associated with the groups, and they generally do not identify the biological features responsible for group relationships. Metastats  is the only current method that explicitly couples statistical analysis (to assess whether metagenomes differ) with biomarker discovery (to detect features characterizing the differences) based on repeated t statistics and Fisher's tests on random permutations. However, none of these methods, even those offering nuanced analyses of metagenomic data, provide biological class explanations to establish statistical significance, biological consistency, and effect size estimation of predicted biomarkers.
In this work, we present the linear discriminant analysis (LDA) effect size (LEfSe) method to support high-dimensional class comparisons with a particular focus on metagenomic analyses. LEfSe determines the features (organisms, clades, operational taxonomic units, genes, or functions) most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance. Class comparison methods typically predict biomarkers consisting of features that violate a null hypothesis of no difference between classes; we additionally detect the subset of features with abundance patterns compatible with an algorithmically encoded biological hypothesis and estimate the sizes of the significant variations. In particular, effect size provides an estimation of the magnitude of the observed phenomenon due to each characterizing feature and it is thus a valuable tool for ranking the relevance of different biological aspects and for addressing further investigations and analyses. The introduction of prior biological knowledge in the method contributes to constrain the analysis and thus to address the challenges traditionally connected with high-dimensional data mining. LEfSe thus aims to support biologists by suggesting biomarkers that explain most of the effect differentiating phenotypes of interest (two or more) in biomarker discovery comparative and hypothesis-driven investigations. The visualization of the discovered biomarkers on taxonomic trees provides an effective means for summarizing the results in a biologically meaningful way, as this both statistically and visually captures the hierarchical relationships inherent in 16S-based taxonomies/phylogenies or in ontologies of pathways and biomolecular functions.
We validated this approach using data from human microbiomes, a mouse model of ulcerative colitis, and environmental samples, in each case predicting groups of organisms or operational taxonomic units that concisely differentiate the classes being compared. We further evaluated LEfSe using synthetic data, observing that it achieves a substantially better false positive rate compared to standard statistical tests, at the price of a moderately increased false negative rate (that can be adjusted as needed by the user). An implementation of LEfSe including a convenient graphical interface incorporated in the Galaxy framework [46, 47] is provided online at .
Results and discussion
LEfSe is an algorithm for high-dimensional biomarker discovery and explanation that identifies genomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions (or classes) (Figure 1). It emphasizes statistical significance, biological consistency and effect relevance, allowing researchers to identify differentially abundant features that are also consistent with biologically meaningful categories (subclasses; see Materials and methods). LEfSe first robustly identifies features that are statistically different among biological classes. It then performs additional tests to assess whether these differences are consistent with respect to expected biological behavior; for example, given some known population structure within a set of input samples, is a feature more abundant in all population subclasses or in just one? Specifically, we first use the non-parametric factorial Kruskal-Wallis (KW) sum-rank test  to detect features with significant differential abundance with respect to the class of interest; biological consistency is subsequently investigated using a set of pairwise tests among subclasses using the (unpaired) Wilcoxon rank-sum test [50, 51]. As a last step, LEfSe uses LDA  to estimate the effect size of each differentially abundant feature and, if desired by the investigator, to perform dimension reduction.
We have specifically designed LEfSe for biomarker discovery in metagenomic data. We thus summarize our results here from applying the tool to 16S rRNA gene and whole genome shotgun datasets to detect bacterial organisms and functional characteristics differentially abundant between two or more microbial environments. These include body sites within human microbiomes (mucosal surfaces and aerobic/anaerobic environments), adult and infant microbiomes, inflammatory bowel disease status in a mouse model, bacterial and viral environmental communities, and synthetic data for quantitative computational evaluation.
Taxa characterizing body sites within the human microbiome
Microbial community organization at multiple human body sites is an area of active current research, since both low- and high-throughput methods have shown both differences and overlaps among the microbiota of multiple body sites [53, 54]. We examined these differences in the 16S-based phylometagenomic dataset from 24 individuals enrolled in the Human Microbiome Project [13, 55]. A minimum of 5,000 16S rRNA gene sequences were obtained for 301 samples from 24 healthy subjects (12 male, 12 female) covering 18 body sites, including 6 main body site categories: the oral cavity (9 sub-sites sampled), the vagina (3 sub-sites sampled), the skin (2 sub-sites sampled), the retroauricular crease (2 sub-sites sampled), the nasal cavity (1 sample) and the gut (1 sample). We validated LEfSe by contrasting mucosal versus non-mucosal body site classes and by comparing three levels of aerobic environments (anaerobic, microaerobic, and aerobic). In both cases, the sub-sites within each class of body site were used as a biological subclass.
Mucosal surfaces are colonized by diverse bacteria; non-mucosal microbiomes are strongly enriched for Actinobacteria
Our first analysis focused on differences in microbiota composition between mucosal and non-mucosal body sites. The oral cavity, gut, and vaginal sites were classified as sources of mucosal communities and the anterior fossa (skin), nasal cavity, and retroauricular crease as non-mucosal. Mucosal environments differ greatly from the other body sites, characterized primarily by interaction with the human immune system, oxidative challenge, and hydration .
LEfSe provides three main outputs (Figure 2), describing the effect sizes of differences observed among mucosal/non-mucosal communities, the phylogenetic distribution of these differences based on the Ribosomal Database Project (RDP) bacterial taxonomy , and the raw data driving these effects. LEfSe detected 15 bacterial clades showing statistically significant and biologically consistent differences in non-mucosal body sites (Figure 2a).
The most differentially abundant bacterial taxa in non-mucosal body sites belong to phyla with prevalent aerobic members: Actinobacteria, Firmicutes, and Proteobacteria, including environmental organisms from the Betaproteobacteria and Gammaproteobacteria clades. Non-mucosal overrepresented genera include Propionibacterium, Staphylococcus (found exclusively in non-mucosal samples), Corynebacterium, and Pseudomonas. Also of note is the relevant representation of plastids from plant organisms (chloroplasts), for which the distribution of associated taxa varies, as some are limited to non-mucosal surfaces (environmental exposure and potentially cosmetic products) and others to the digestive track (ingested food). No clades are consistently present in all mucosal body sites, demonstrating the β-diversity of these communities (that is, differences among their population structure), but many taxa within Actinobacteria, Bacillales, and several other clades are relatively abundant at all non-mucosal sites. The within-subject β-diversity at all phylogenetic levels is highlighted in Additional file 1, quantifying the extent to which distances among different mucosal body sites are larger than the equivalent distances among non-mucosal sites. This leads to a lack of taxa common to all mucosal body sites, and therefore no taxa are determined by LEfSe to be characteristic of the mucosa as a whole.
The Actinomycetales are usually the most abundant phylogenetic unit (order level) in non-mucosal communities, with percentages higher than 90% in several skin samples and at most 20% in the great majority of the oral mucosal samples and substantially lower in the vagina and gut (Figure 2c). From a quantitative viewpoint, the taxonomic order Actinomycetales makes up essentially all of the detected members of the phylum Actinobacteria, except in the vaginal site, which reported a substantial Bifidobacteriales presence. Bifidobacteriales themselves are not detected as differentially abundant between mucosal and non-mucosal body sites, since this is a feature only of the vaginal samples and not of all mucosal body sites. The contrast of many clades' abundance versus distribution is striking; for example, the genera Alloscardovia, Parascardovia and Scardovia are present in all body sites at very low abundances, while Gardnerella is overrepresented only in vaginal samples, with over three orders of magnitude difference in abundance. A similar commonality of distribution was found for the Bacillales at an even lower abundance. At the genus level, Propionibacterium, Staphylococcus, Corynebacterium and Pseudomonas are differentiated by both distribution and abundance. The Staphylococcus genus in particular is detected by LEfSe with a very high LDA score (more than five orders of magnitude), reflecting marked abundance in non-mucosal sites (mean 10%, 18% and 21% in the skin, retroauricular crease and anterior nares body sites, respectively) and consistently low abundance in mucosal sites (mean less than 0.001%).
Classes with multiple levels: distinct aerobic, anaerobic, and microaerobic communities in the human microbiome
The roles of anaerobic metabolism in the commensal human microbiota have not yet been fully investigated due to the difficulty of studying these communities in culture. We thus further investigated the aerobicity characteristics of human microbial communities at a high level by grouping body sites into three classes with distinct levels of available molecular oxygen. The high-O2 exposure class includes body sites directly and permanently exposed to oxygen: skin, anterior nares and retroauricular crease. The mid-O2 exposure class includes the oral and vaginal body sites that can be directly, but not permanently, atmospherically exposed, and the low-O2 exposure class (the gut) is mainly anaerobic. The body sites included in the three classes may have other distinguishing features in addition to different oxygen exposure and, in general, these confounding factors can cause features unrelated with aerobiosis to be detected as biomarkers. However, the LEfSe biological consistency step assures that the detected biomarkers are characteristic of all the subclasses of a given class and with respect to all subclasses of the other classes. For example, the high-abundance of a bacterial clade in the mouth due to an oral-specific niche is not detected as a biomarker unless the same niche is also present in the vaginal samples (the other body site in the mid-O2 class) and not present in any high-O2 or low-O2 single body sites. So LEfSe will detect biomarkers more confidently connected with the aerobiosis characteristics than traditional methods that do not incorporate subclass information. Moreover, LEfSe is specifically able to analyze ordinal classes with multiple levels, and in agreement with established microbiology, we observed specific microbial clades ubiquitous within and characteristic to each of these three environments, detailed as follows (Figure 2d).
LEfSe allows ordinal classes with more than two levels to be analyzed in two different stringencies. The first requires significant taxa to differ between every pair of class values (that is, aerobicity in this example; see Materials and methods); the discovered biomarkers must accurately distinguish all individual classes (high-, mid-, and low-O2). In this example (Figure 2d; strict version), we detected 13 clades with LDA scores above 2, showing three distinct abundance levels. Alternatively, LEfSe can determine significant taxa differing in at least one (and possibly multiple) class value(s) (non-strict version); in other words, biomarkers that distinguish at least one individual class. Using this method (Figure 2e), we find 60 clades with LDA scores of at least 2.
Using either approach, each oxygen level is broadly characterized by a specific clade. The overall abundances of the Actinobacteria phylum are higher in body sites directly exposed to molecular oxygen with several members of the Actinomycetales order that colonize the skin. Actinomycetales includes the Propionibacterium genus, which is highly abundant on the skin, low in moderate-O2 environments, and absent from the gut. The Lactobacillales (primarily Bacilli) are specific to moderate O2 exposure levels, with conversely lower presences in the high-O2 exposure class, and are again absent from the gut. The Bacteroidaceae (particularly Bacteroides) are ubiquitous in anaerobic samples; interestingly, however, members of this family are more abundant in high oxygen availability conditions (particularly in skin and retroauricular crease) than in medium oxygen availability, showing the niche diversity within the phylogenetic branching. This is in agreement with observations that the microenvironment of many microbial consortia shows extreme biogeographical variation with respect to nutrients, metabolites, and oxygen availability [58, 59].
Bifidobacteria and additional clades are underrepresented in a mouse model of ulcerative colitis
Rodent models have been established to provide a uniquely accurate and tractable model for studying the gut microbiota, including the molecular and cellular mechanisms driving chronic intestinal inflammation [60–63]. In particular, mouse models of inflammatory bowel disease  facilitate a mechanistic evaluation of the contribution of the gut microbiota to the initiation and perpetuation of chronic intestinal inflammation, as occurs in human Crohn's disease and ulcerative colitis . One host molecular mechanism known to maintain the balance between immune regulation and the commensal microflora is T-bet, a transcription factor expressed in many immune cell subsets. Its loss in the absence of an adaptive immune system results in a highly penetrant and aggressive form of ulcerative colitis  that is specifically dependent on and transmissible through the gut flora. We thus sought to investigate the characteristics of the fecal microbiota in a mouse model of spontaneous colitis that occurs in a colony of Balb/c T-bet-/- × Rag2-/- mice using 16S rRNA gene metagenomic data [66, 67].
LEfSe was applied to the microbiota data of 20 T-bet-/- × Rag2-/- (case) and 10 Rag2-/- (control) mice (dataset provided in Additional File 10), finding 19 differentially abundant taxonomic clades (α = 0.01) with an LDA score higher than 2.0 (Figure 3). These differentially abundant clades were consonant with both our prior 16S rRNA-based sequence analysis using complete linkage hierarchical clustering and quantitative real time PCR-based experiments performed on the same fecal DNA samples . More specifically, the marked loss in Bifidobacteriaceae and Bifidobacterium associated with T-bet-/- × Rag2-/- we observed here may explain the positive responsiveness of this colitis to a Bifidobacterium animalis subsp. lactis fermented milk product validated with low-throughput approaches .
At the family level, the Rag2-/- enrichment of Bifidobacteriaceae, Porphyromonadaceae, Staphylococcaceae and the T-bet-/- × Rag2-/- enrichment of Lachnospiraceae confirm our reports in  using culture-based and quantitative real time PCR techniques. LEfSe's LDA score more informatively reorders these taxa relative to the P-values found for these families in our previous work, highlighting the Bifidobacteria and, interestingly, several clades within the Clostridia. These include the Rag2-/--specific Roseburia and Papillibacter genera belonging to T-bet-/- × Rag2-/--specific families (Lachnospiraceae and Ruminococcaceae). The significant presence of Metascardovia (Bifidobacteriaceae) in Rag2-/- mice is also interesting, as it may have a role similar to Bifidobacterium and because Metascardovia has been previously observed primarily in the oral cavity . This analysis both highlights the agreement of LEfSe's effect size estimation with respect to low-throughput confirmations and suggests additional clades to be further investigated experimentally.
A comparison with current metagenomic analysis tools using viral and microbial pathways from environmental data
We applied LEfSe to the environmental data of , a dataset with the goal of characterizing the functional roles of viromes (that is, viral metagenomes) versus microbiomes (that is, bacterial metagenomes). This task was used in  to characterize the Metastats algorithm on the same raw data. Among the 29 high-level functional roles (including unclassified roles) in the subsystem hierarchy of the SEED  and NMPDR  frameworks, LEfSe identifies only the 'Nucleosides and nucleotides' subsystem to be strictly differentially abundant among all environmental subclasses, specifically with higher levels in viromes than microbiomes. This is an accurate characterization of exactly the protein function most commonly encoded in viral genomes, whereas bacterial genomes of course encode a wide range of less specifically enriched functionality. When LEfSe is relaxed to detect significant variations consistent for at least one, rather than all, environmental subclasses, we additionally determine the 'Respiration' subsystem to be significantly enriched in microbiomes with respect to viromes, likely reflecting the uniformly aerobic bacterial metabolism captured by these data.
In addition to the Nucleosides and nucleotides and Respiration subsystems, Metastats  reports five other high-level functional roles as differentially abundant (P = 0.001). However, when taking the subclass structure into account across the sampled environments, these additional differences show much less consistent variation. This is demonstrated in Figure 4, which reports histograms of raw data for these cases and the different results of LEfSe, Metastats and the KW test alone. Moreover, since the subsystem framework is hierarchical (three levels), LEfSe's results include a cladogram showing the significant differences on each level (see Figure 4 for a two-level cladogram, and Additional file 2 for a three-level cladogram).
Considering all three levels of SEED functional specificity, LEfSe reports 59 subsystems to be more abundant in microbial metagenomes and only 7 that are more abundant in viral metagenomes (Additional file 3). Bacterial genomes encode a much greater quantity and diversity of biomolecular functionality than most viral genomes, and these differences are thus to be expected. However, they also highlight a consideration specific to most metagenomic (and, more generally, ecological) analyses, which typically analyze relative abundances. A few very common subsystems in viromes (that is, Nucleosides and nucleotides) will force the relative abundance of all other subsystems to decrease, resulting in apparent under-abundance. The subsystems detected to be virus-specific may thus show this trend in part due to the normalization of abundances in each sample. This issue is specific to neither LEfSe nor Metastats, however, and must be taken into account during interpretation of any relative abundance data, metagenomic or otherwise .
Functional activity within the infant and adult microbiota indicates post-weaning microbial specialization
Just as LEfSe can determine whether organisms or pathways are differentially abundant among several metagenomic samples, it can also focus on individual enzymes or orthologous groups. Kurokawa et al.  analyzed 13 gut metagenomes from nine adults and four unweaned infants in terms of the functions of orthologous gene families. They originally did this by comparing the COGs [74, 75] found in each metagenome to a reference database; later, White et al.  applied the Metastats algorithm to directly detect differences between infant and adult microbiomes. Using significance α values of 0.01 due to the low cardinality of the classes (in particular the infant class), LEfSe detected 366 COGs to be enriched in either adult or infant metagenomes, 17 of which have a LDA score higher than 3 (Additional file 4).
Among the 17 COG profiles with LEfSe scores higher than 3, 11 are also detected by Metastats. The six COGs not detected by Metastats (Additional file 5) are Outer membrane protein (COG1538) and Na+-driven multidrug efflux pump (COG0534), enriched in adults, and Transposase and inactivated derivatives (COG2801, COG2963), Transcriptional regulator/sugar kinase (COG1940) and Transcriptional regulator (COG1309), enriched in infants. All six COGs possess abundance profiles that are completely non-overlapping between infant and adult individuals (apart from COG1538, in which the lowest level in adults is slightly lower than the highest in infants) and are thus nominally quite discriminative. On the other hand, among the 192 COGs found by Metastats, 9 are not detected by LEfSe even at the lowest LDA score threshold (Additional file 6). All possess overlapping abundance values between infant and adult classes (at least two, and often more, of the highest samples in the less abundant class overlap the putatively more abundant class). This lack of discriminatory power precludes LEfSe from highlighting the differences as significant between adults and infants, particularly given the low number of infant samples.
Intriguingly, LEfSe's distinct list of functional activities in the core infant and adult microbiomes is suggestive of 'generalist' microbial activity during early life and specialization over time . In fact, inspecting the five differentially abundant COGs with the highest effect sizes for each class, we find for infants very high-level functional groups related to broad transcriptional regulation (COG1609, COG1940, COG1309 and COG3711). In adults, all five represent more specialized orthologous groups, including COG1629 (Outer membrane receptor proteins, mostly Fe transport), COG1595 (DNA-directed RNA polymerase specialized sigma subunit, sigma24 homolog), and COG4771 (Outer membrane receptor for ferrienterochelin and colicins). Since the number of differentially abundant COGs is very high (366), this observation was only highlighted at the top of the candidate biomarker list due to LEfSe's effect size quantification, which allows the most characteristic differences among classes to emerge. For the same reason, we can easily confirm that sugar metabolism plays a crucial role in the infant gut and iron metabolism in adults, as already stated in [45, 73]; the COGs with the highest LDA scores indeed possess sugar and glucose functional activities for infants and iron-related functionality for adults.
LEfSe achieves a very low false positive rate in synthetic data
We further investigated the ability of LEfSe to detect biomarkers using synthetic high-dimensional data (see Materials and methods for the description of the dataset) in comparison with the KW test alone (a non-parametric adaptation of the analysis of variance (ANOVA)) and with Metastats . The LDA effect size step of LEfSe is not considered here for simplicity, and the artificial data are detailed in Figure 5.
Theoretically, the settings of the first two experiments (Figure 5a,b) exactly match the application conditions for the KW test. The false positive rate (mean 2.5%, regardless of the distance between feature means and of the standard deviation of the normal distribution) is in fact consistent with the α value of 0.05, given that the negative features are half of the total. LEfSe behaved qualitatively very similar to KW, but with a considerably lower false positive rate (less than 0.5% in the great majority of the cases against a mean value of 2.5%) and a higher false negative rate. In biology, false positives are often perceived as more dramatic than false negatives [77–79]; this is often attributable to the fact that it is undesirable to invest in expensive experimental follow-up of false positives, whereas in high-throughput settings, a few true positives outweigh the false negatives that are left uninvestigated. With this motivation for minimizing false positives, we conclude that LEfSe performs at least as well as KW when no meaningful subclass structure is available. On the other hand, when subclasses can be identified internally to the classes and some of them do not agree with the trend among classes, LEfSe performs qualitatively and quantitatively much better than KW (Figure 5c). The false positives are in fact always substantially lower than KW, whereas the false negatives are higher only for very noisy features. Metastats  seems to achieve results very similar to KW (Additional file 7) with the same disadvantages with respect to LEfSe.
Gaining insight into the structure, organization, and function of microbial communities has been proposed as one of the major research challenges of the current decade , and it will be enabled by both experimental and computational metagenomic analyses. To this end, we have developed the LEfSe algorithm for comparative metagenomic studies, permitting the characterization of microbial taxa specific to an experimental or environmental condition, the detection of pathways and biological mechanisms over- or under-represented in different communities, and the identification of metagenomic biomarkers in mammalian microbiomes. LEfSe is shown here to be effective in detecting differentially abundant features in the human microbiome (characteristically mucosal or aerobic taxa) and in a mouse model of colitis. A comparison with existing statistical methods and state-of-the-art metagenomic analyses of environmental, infant gut microbiome, and synthetic data shows that LEfSe consistently provides lower false positive rates and can effectively aid in explaining the biology underlying differences in microbial communities.
These findings demonstrate that a concept of class explanation including both statistical and biological significance is highly beneficial in tackling the statistical challenges associated with high-dimensional biomarker discovery [28, 81, 82]. Specifically, LEfSe determines features potentially able to explain the differences among conditions rather than the features that simply possess uneven distributions among classes. This is distinct from most current statistical approaches  and akin to the incorporation of biological prior knowledge that has proven highly successful in recent genome-wide association studies [83–85]. Moreover, particularly in (often noisy) metagenomic datasets, effect size can serve as an orthogonal measure to complement ranking biomarkers based on P-values alone. Differences between classes can be very statistically significant (low P-value) but so small that they are unlikely to be biologically responsible for phenotypic differences. On the other hand, a biomarker with a relatively large P-value (for example, 0.01) may correspond to a large effect size, with statistical significance diminished by technical noise. LEfSe investigates both aspects computationally by testing both the consistency and the effect size of differences in feature abundance among classes with respect to the structure of the problem. This is performed subsequently to standard statistical significance tests and is integrated in LEfSe by assessing biologically meaningful groups of samples among subclasses within each condition. This coupling of statistical approaches with biological consistency and effect size estimation alleviates possible artifacts or statistical inhomogeneity known to be common in metagenomic data, for example, extreme variability among subjects or the presence of a long tail of rare organisms [32, 86]. Similarly, while multiple hypothesis corrected statistical significance speaks to the potential reproducibility of a result, estimation of effect size in high-dimensional settings is crucial for addressing biological consistency and interpretability.
The biology highlighted by these investigations speaks to the potential of metagenomics for both microbial ecology and translational applications. For example, certain bacterial clades are frequently detected as biomarkers even in diverse environments, suggesting that some species can adapt in surprisingly condition-specific manners. Staphylococcus and the Bacillales, for example, are discriminative for mucosal tissues, aerobic conditions, and murine colitis, whereas no Proteobacteria consistently characterize any of these conditions, even though they always represent a substantial portion of the communities. These observations may reflect extensive microenvironmental heterogeneity and the coexistence of generalist and specialist bacteria [87–89].
In addition to these insights into microbiology, metagenomic biomarkers, including the abundances of specific organisms, abundances of entire clades, or the presence/absence of specific organisms, can serve to describe host phenotypes, lifestyle, diet, and disease as well [5–10]. If the depletion of Bifidobacterium species in ulcerative colitis proves to occur early in human disease etiology, this and comparable shifts in the microbiota have potential applications in the detection of human disorders [90, 91], especially as shifts in some bacterial consortia can be detected easily and inexpensively. Oral microbial biomarkers, for example, can be easily acquired and analyzed with microarray chips targeted for bacterial profiling . These appear particularly promising for clinical applications , as the microbial communities in the saliva seem to represent one potential proxy for other human microbiota . Other important clinical applications of metagenomic analyses include probiotic treatments [94, 95] and microbiome transplantation [96–99] for gastrointestinal diseases.
LEfSe, the computational approach to biomarker class comparisons detailed here, thus contributes to the understanding of microbial communities and guides biologists in detecting novel metagenomic biomarkers. The algorithm's effectiveness on real and synthetic data has been highlighted by several experiments in which we successfully characterized both host-associated microbiota and environmental microbiomes in multiple contexts. To support ongoing metagenomic analyses, we have implemented LEfSe as a user-friendly web application that can provide both raw data and publication-ready graphical results, including reports of detected microbial variation on taxonomic trees for visual and biological summarization. LEfSe is freely available online in the Galaxy workflow framework [46, 47] at the following link .
Materials and methods
The LEfSe algorithm is introduced in overview in the Results section, and Figure 6 illustrates in detail the format of the input (a matrix with n rows and m columns) and the three steps performed by the computational tool: the KW rank sum test  on classes, the pairwise Wilcoxon test [50, 51] between subclasses of different classes, and the LDA  on the relevant features.
Each of the n features is represented with a positive-valued vector containing its abundances in the m samples, and each sample is associated with values describing its class and, optionally, subclass and/or originating subject. The factorial KW rank sum test is applied to each feature with respect to the class factor; the subclass and subject information are used as stratifying subgroups when present. Features that, according to the KW rank sum test, do not violate the null hypothesis of identical value distribution among classes (with default P-value, α = 0.05) are not further analyzed. The pairwise Wilcoxon test is applied to retained features belonging to subclasses of different classes. For each feature, the pairwise Wilcoxon test is not satisfied if at least one comparison between subclasses has a P-value higher than the chosen α or if the sign of variation is not equal among all comparisons. For example, if a feature appears in samples from two classes with three subclasses each, all nine comparisons between subclasses in different classes must violate the null hypothesis, and all signs of the differences between medians must be consistent. The features that pass the pairwise Wilcoxon test are considered successful biomarkers. An LDA model is finally built with the class as dependent variable and the remaining feature values, subclass, and subject values as independent variables. This model is used to estimate their effect sizes, which are obtained by averaging the differences between class means (using unmodified feature values) with the differences between class means along the first linear discriminant axis, which equally weights features' variability and discriminatory power. The LDA score for each biomarker is obtained computing the logarithm (base 10) of this value after being scaled in the [1,106] interval and, regardless of the absolute values of the LDA score, it induces the ranking of biomarker relevance. For robustness, LDA is additionally supported by bootstrapping (default 30-fold) and subsequent averaging.
LEfSe's first two steps employ non-parametric tests because of the nature of metagenomic data. Relative abundances will, in most cases, violate the main assumption of typical parametric tests (normal population in each class), whereas non-parametric tests are much more robust to the underlying distribution of the data since they are distribution-free approaches. The only assumption of the Wilcoxon and KW tests is that the distributions in each class are identically shaped with possible differences in the medians. For example, the bimodal or multimodal abundance distribution of an organism violates the assumptions of parametric tests but not those of non-parametric tests, unless the number of peaks in the distribution (or, more generally, the shape of the distribution) also changes among classes. LDA is used for effect size estimation as our experiments determined it to more accurately estimate biological consistency compared to approaches like differences in group means/medians or support vector machines (SVMs) . A comparison between LDA and SVM approaches for effect size estimation on the murine model of ulcerative colitis (for which low-throughput biological validations of biomarkers are available in ) is reported in our supplemental material (Additional files 8 and 9) and shows the advantages of LDA with respect to upranking features of potential biological interest. Theoretically, this is motivated by LDA's ability to find the axis of highest variance and SVM's focus on features' combined predictive power rather than single feature relevance. Note that as we are performing class comparison rather than class prediction, it is worth specifying that the effect size estimation accuracy of an algorithm is not directly connected with its predictive ability (for example, SVM approaches are generally considered more accurate than LDA for prediction).
Comparisons with more than two classes require special strategies for applying the Wilcoxon and LDA steps, whereas the factorial KW test is already appropriate for this setting. Our multiclass strategy for the Wilcoxon test depends on the problem-specific strategy chosen by the user to define features differentially distributed among the n classes. In the most stringent strategy, we require that all n abundance profiles of a feature are statistically significantly distinct among all n classes. This strategy, called 'strict', is implemented by requiring that all Wilcoxon tests between classes are significant. A more permissive strategy, called 'non-strict', considers a feature as a biomarker if at least one class is significantly different from all others. The more permissive strategy thus needs to satisfy only a subset of the Wilcoxon tests. Regardless of the strategy, the LDA step always reports the highest score detected among all pairwise class comparisons.
Subclass structure variants encoding different biological hypotheses
Different interpretations of the biomarker class comparison problem are implemented in LEfSe by modifying the requirements for pairwise Wilcoxon comparisons among subclasses. If classes contain subclasses that represent distinct strata, we test only comparisons within each identical subclass (Figure 4). For example, to assess the effect of a treatment on two sub-types of the same disease, we compare pre- and post-treatment levels within each subclass and require that the trend observed at the class level is significant independently for both subclasses. To implement this variant, LEfSe performs the Wilcoxon step only comparing subclasses with the same name. Alternatively, subclasses may represent covariates within which feature levels may vary but for which the problem does not dictate explicit stratification (Figure 2). In both settings, we explicitly require all the pairwise comparison to reject the null hypothesis for detecting the biomarker; thus, no multiple testing corrections are needed.
Subclasses containing few samples
When few samples are available, non-parametric tests like the Wilcoxon have reduced power to detect differences. This can affect LEfSe when subclasses are very small, preventing the overall test from even rejecting the null hypothesis. For this reason, small subclasses should be avoided when possible, for example, by excluding them from the problem or by grouping together all subclasses with small cardinalities. For cases in which removing or grouping subclasses is not possible or disrupts the biological consistency of the analysis, LEfSe substitutes the Wilcoxon test with a test to compare whether subclass medians differ with the expected sign. The user can choose the subclass cardinality threshold at which this median comparison is substituted for the Wilcoxon test.
Except as stated otherwise in Results, all experiments in this study were run with LEfSe's α parameter for pairwise tests set to 0.05 for both class normality and subclass tests, and the threshold on the logarithmic score of LDA analysis was set to 2.0. The stringency of these parameters is easily tunable (also through the web interface) and allows the user to detect biomarkers with lower P-values and/or higher effect size in order, for example, to prioritize additional biological experiments and validations. All LDA scores are determined by bootstrapping over 30 cycles, each sampling two-thirds of the data with replacement, with the maximum influence of the LDA coefficients in the LDA score of three orders of magnitude.
Except as stated otherwise, taxonomic abundances for 16S samples were generated from filtered sequence reads using the RDP classifier , with confidences below 80% rebinned to 'uncertain'. For all the datasets described below, the final input for LEfSe is a matrix of relative abundances obtained from the read counts with per-sample normalization to sum to one. Witten-Bell smoothing  was used to accommodate rare types, but due to LEfSe's non-parametric approach, this has minimal effect on the discovered biomarkers and on the LDA score. This also allows our biomarker discovery method to avoid most effects of sequence quality issues as long as any sequencing biases are homogeneous among different conditions, as no specific assumptions on the statistical distribution and noise model are made by the algorithm as is standard for non-parametric approaches.
Human microbiome data
The 16S rRNA-based phylometagenomic dataset of the normal (healthy) human microbiome was made available through the Human Microbiome Project , and consists of 454 FLX Titanium sequences spanning the V3 to V5 variable regions obtained for 301 samples from 24 healthy subjects (12 male, 12 female) enrolled at a single clinical site in Houston, TX. These samples cover 18 different body sites, including 6 main body site categories: the oral cavity (9 samples), the gut (1 sample), the vagina (3 samples), the retroauricular crease (2 samples), the nasal cavity (1 sample) and the skin (2 samples). Detailed protocols used for enrollment, sampling, DNA extraction, 16S amplification and sequencing are available on the Human Microbiome Project Data Analysis and Coordination Center website , and are also described elsewhere [55, 56]. In brief, genomic DNA was isolated using the Mo Bio PowerSoil kit  and subjected to 16S amplifications using primers designed incorporating the FLX Titanium adapters and a sample barcode sequence, allowing directional sequencing covering variable regions V5 to partial V3 (primers: 357F 5'-CCTACGGGAGGCAGCAG-3' and 926R 5'-CCGTCAATTCMTTTRAGT-3'). Resulting sequences were processed using a data curation pipeline implemented in mothur , which reduces the sequencing error rate to less than 0.06% as validated on a mock community. As part of the pipeline parameters, to pass the initial quality control step, one unambiguous mismatch to the sample barcode and two mismatches to the PCR amplification primers were allowed. Sequences with an ambiguous base call or a homopolymer longer than eight nucleotides were removed from subsequent analyses, as suggested previously . Based on the supplied quality scores, all sequences were trimmed when a base call with a score below 20 was encountered. All sequences were aligned using a NAST-based sequence aligner to a custom reference based on the SILVA alignment [106, 107]. Sequences that were shorter than 200 bp or that did not align to the anticipated region of the reference alignment were removed from further analysis. Chimeric sequences were identified using the mothur implementation of the ChimeraSlayer algorithm . Unique reads were classified with the MSU RDP classifier v2.2  using the taxonomy proposed by , maintained at the RDP (RDP 10 database, version 6). The 16S rRNA reads are available in the Sequence Read Archive at .
T-bet-/- × Rag2-/- and Rag2-/- mouse data
T-bet-/- × Rag2-/- and Rag2-/- mice, their husbandry, and their chow have been described in . The animal studies and experiments were approved and carried out according to Harvard University's Standing Committee on Animals as well as National Institutes of Health guidelines. Collection, processing, and extraction of DNA from fecal samples were performed as described in . The V5 and V6 regions of the 16S rRNA gene were targeted for amplification and multiplex pyrosequencing with error-correcting barcodes. Sequencing was performed using a Roche FLX Genome Sequencer at DNAVision (Charleroi, Belgium) and data were preprocessed to remove sequences with low-quality scores. There were 7,579 ± 2,379 high-quality 16S reads per sample with a mean read length of 278 bp.
Viral and microbial environmental data
We retrieved from the online supplemental material of  the 80 available metagenomes (42 viromes, 38 microbiomes). We identified three environments containing at least seven samples and grouped them into coral, hyper-saline, and marine subclasses; the fourth subclass, other, groups all environments with few samples.
Infant and adult microbiome data
The COG profiles of the nine adult and four unweaned infant microbiomes were obtained from the supplemental material of  and used unmodified in this study.
We built three collections of artificial datasets in order to compare LEfSe to KW and Metastats. All datasets have 1,000 features and 100 samples belonging evenly to two classes, and the values are sampled from a Gaussian normal distribution. The samples in the two classes are further organized in four subclasses (two per class) with equal cardinality. Of the 1,000 features, 500 features have different means across classes and should thus be detected as biomarkers (positive features), the other 500 features are evenly distributed among classes or among at least one subclass in both classes and should not be detected as discriminative (negative features). The methods are evaluated assessing the false positive rate (number of erroneously detected biomarkers with respect to the total number of features) and the false negative rate (number of correctly detected non-discriminant features with respect to the total number of features, that is, sensitivity). The three collections of datasets (graphically shown in Figure 5) differ in the distribution of values in the subclasses and in the mean/standard deviation of the normal distribution. (a) The subclasses in the same class have the same parameters (thus the subclass organization is meaningless). Negative features all have μ = 10,000 and σ = 100, whereas one class of the positive features has μ = 10,000 - t (σ = 100) and the other μ = 10,000 + t (σ = 100) where t is a parameter ranging from 1 to 150. The performances of all methods are assessed at regular steps of the t parameter. (b) Datasets in this collection are defined in the same way as collection (a) but with t = 1,000 for all datasets and σ ranging from 1,000 to 10,000. (c) The negative class in the third collection has different subclass distribution. In particular, the second subclass of the first class has the same mean of the first subclass of the second class. The other two subclasses have different means (μ = 10,000 - t and μ = 10,000 + t, t = 1,000), but the feature is not considered differential since the difference is not consistent between subclasses. The positive features are defined in the same way as dataset (b).
Implementation and availability of the method
LEfSe is implemented in Python and makes use of R statistical functions in the coin  and MASS  libraries through the rpy2 library  and of the matplotlib  library for graphical output. LEfSe is provided with a graphical interface in the Galaxy framework [46, 47], which allows the user to select parameters (the primary three stringency parameters, the multiclass setting, and other computational, statistical, and graphical preferences), to pipeline data between modules in a workflow framework, to generate publication-quality graphical outputs, and to combine these results with other statistical and metagenomic analyses. LEfSe is available at .
linear discriminant analysis
linear discriminant analysis effect size
polymerase chain reaction
Ribosomal Database Project
support vector machines.
Golub TR: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.
Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer GLOSSARY. Lancet. 2002, 359: 572-577. 10.1016/S0140-6736(02)07746-2.
Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew YE, Haviv I, Australian Ovarian Cancer Study Group, Gertig D, DeFazio A, Bowtell DD: Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008, 14: 5198-5208. 10.1158/1078-0432.CCR-08-0196.
Wei X, Li K-C: Exploring the within- and between-class correlation distributions for tumor classification. Proc Natl Acad Sci USA. 2010, 107: 6737-6742. 10.1073/pnas.0910140107.
De Filippo C, Cavalieri D, Di Paola M, Ramazzotti M, Poullet JB, Massart S, Collini S, Pieraccini G, Lionetti P: Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proc Natl Acad Sci USA. 2010, 107: 14691-14696. 10.1073/pnas.1005963107.
Turnbaugh PJ, Bäckhed F, Fulton L, Gordon JI: Diet-induced obesity is linked to marked but reversible alterations in the mouse distal gut microbiome. Cell Host Microbe. 2008, 3: 213-223. 10.1016/j.chom.2008.02.015.
Ley RE, Peterson Da, Gordon JI: Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell. 2006, 124: 837-848. 10.1016/j.cell.2006.02.017.
Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K, Pelletier E, Frangeul L, Nalin R, Jarrin C, Chardon P, Marteau P, Roca J, Dore J: Reduced diversity of faecal microbiota in Crohn's disease revealed by a metagenomic approach. Gut. 2006, 55: 205-211. 10.1136/gut.2005.073817.
Sokol H, Seksik P, Furet JP, Firmesse O, Nion-Larmurier I, Beaugerie L, Cosnes J, Corthier G, Marteau P, Doré J: Low counts of Faecalibacterium prausnitzii in colitis microbiota. Inflamm Bowel Dis. 2009, 15: 1183-1189. 10.1002/ibd.20903.
Ordovas JM, Mooser V: Metagenomics: the role of the microbiome in cardiovascular diseases. Curr Opin Lipidol. 2006, 17: 157-161. 10.1097/01.mol.0000217897.75068.ba.
Zhang L, Henson BS, Camargo PM, Wong DT: The clinical value of salivary biomarkers for periodontal disease. Periodontology 2000. 2009, 51: 25-37. 10.1111/j.1600-0757.2009.00315.x.
Zhang L, Farrell JJ, Zhou H, Elashoff D, Akin D, Park NH, Chia D, Wong DT: Salivary transcriptomic biomarkers for detection of resectable pancreatic cancer. Gastroenterology. 2010, 138: 949-957. 10.1053/j.gastro.2009.11.010. e947
NIH HMP Working Group, Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, Baker CC, Di Francesco V, Howcroft TK, Karp RW, Lunsford RD, Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R, Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J, et al: The NIH Human Microbiome Project. Genome Res. 2009, 19: 2317-2323.
Hamady M, Fraser-Liggett CM, Turnbaugh PJ, Ley RE, Knight R, Gordon JI: The Human Microbiome Project. Nature. 2007, 449: 804-810. 10.1038/nature06244.
Magrini V, Turnbaugh PJ, Ley RE, Mardis ER, Mahowald MA, Gordon JI: An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006, 444: 1027-1131. 10.1038/nature05414.
Duncan SH, Lobley GE, Holtrop G, Ince J, Johnstone aM, Louis P, Flint HJ: Human colonic microbiota associated with diet, obesity and weight loss. Int J Obesity (Lond). 2008, 32: 1720-1724. 10.1038/ijo.2008.155.
Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, Gordon JI: The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med. 2009, 1: 6ra14-10.1126/scitranslmed.3000322.
Gao Z, Tseng C-h, Strober BE, Pei Z, Blaser MJ: Substantial alterations of the cutaneous bacterial biota in psoriatic lesions. PloS One. 2008, 3: e2719-10.1371/journal.pone.0002719.
Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science. 2005, 308: 554-557. 10.1126/science.1107851.
Solovyev VV, Allen EE, Ram RJ, Rokhsar DS, Chapman J, Richardson PM, Tyson GW, Rubin EM, Banfield JF, Hugenholtz P: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428: 37-43. 10.1038/nature02340.
Lecuit M, Lortholary O: Immunoproliferative small intestinal disease associated with Campylobacter jejuni. Med Mal Infect. 2005, 35 (Suppl 2): S56-58.
Relman DA, Schmidt TM, MacDermott RP, Falkow S: Identification of the uncultured bacillus of Whipple's disease. N Engl J Med. 1992, 327: 293-301. 10.1056/NEJM199207303270501.
Oakley BB, Fiedler TL, Marrazzo JM, Fredricks DN: Diversity of human vaginal bacterial communities and associations with clinically defined bacterial vaginosis. Appl Environ Microbiol. 2008, 74: 4898-4909. 10.1128/AEM.02884-07.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.
Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3-
Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan Ea, Wang Y: The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008, 8: 37-49. 10.1038/nrc2294.
Swan Ka, Curtis DE, McKusick KB, Voinov AV, Mapa Fa, Cancilla MR: High-throughput gene mapping in Caenorhabditis elegans. Genome Res. 2002, 12: 1100-1105.
Wooley JC, Ye Y: Metagenomics: facts and artifacts, and computational challenges*. J Comput Sci Technol. 2009, 25: 71-81.
Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature. 2009, 457: 480-484. 10.1038/nature07540.
Pedrós-Alió C: Marine microbial diversity: can it be determined?. Trends Microbiol. 2006, 14: 257-263. 10.1016/j.tim.2006.04.007.
Sogin ML, Morrison HG, Huber Ja, Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA. 2006, 103: 12115-12120. 10.1073/pnas.0605127103.
Gobet A, Quince C, Ramette A: Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets. Nucleic Acids Res. 2010, 38: e155-10.1093/nar/gkq545.
Dethlefsen L, McFall-Ngai M, Relman DA: An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature. 2007, 449: 811-818. 10.1038/nature06245.
Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res. 2007, 17: 377-386. 10.1101/gr.5969107.
Mitra S, Gilbert JA, Field D, Huson DH: Comparison of multiple metagenomes using phylogenetic networks based on ecological indices. ISME J. 2010, 4: 1236-1242. 10.1038/ismej.2010.51.
Mitra S, Klar B, Huson DH: Visual and statistical comparison of metagenomes. Bioinformatics. 2009, 25: 1849-1855. 10.1093/bioinformatics/btp341.
Parks DH, Beiko RG: Identifying biologically relevant differences between metagenomic communities. Bioinformatics. 2010, 26: 715-721. 10.1093/bioinformatics/btq041.
Lozupone C, Knight R: UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005, 71: 8228-8235. 10.1128/AEM.71.12.8228-8235.2005.
Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008, 9: 386-10.1186/1471-2105-9-386.
Kristiansson E, Hugenholtz P, Dalevi D: ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics. 2009, 25: 2737-2738. 10.1093/bioinformatics/btp508.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75: 7537-7541. 10.1128/AEM.01541-09.
Goll J, Rusch D, Tanenbaum DM, Thiagarajan M, Li K, Methé BA, Yooseph S: METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics. 2010, 26: 2631-2632. 10.1093/bioinformatics/btq455.
Jolliffe IT: Principal Component Analysis. 1986, New York: Springer-Verlag
Gower JC: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966, 53: 325-338.
White JR, Nagarajan N, Pop M: Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 2009, 5: e1000352-10.1371/journal.pcbi.1000352.
Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11: R86-10.1186/gb-2010-11-8-r86.
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010, Chapter 19: Unit 19.10.1-21-
Kruskal WH, Wallis WA: Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952, 47: 583-621. 10.2307/2280779.
Wilcoxon F: Individual comparisons by ranking methods. Biometrics. 1945, 1: 80-83. 10.2307/3001968.
Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947, 18: 50-60. 10.1214/aoms/1177730491.
Fisher RA: The use of multiple measurements in taxonomic problems. Ann Eugenics. 1936, 7: 179-188. 10.1111/j.1469-1809.1936.tb02137.x.
Dal Bello F, Hertel C: Oral cavity as natural reservoir for intestinal lactobacilli. Syst Appl Microbiol. 2006, 29: 69-76. 10.1016/j.syapm.2005.07.002.
Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R: Bacterial community variation in human body habitats across space and time. Science. 2009, 326: 1694-1697. 10.1126/science.1177486.
Human Microbiome Project clinical sampling protocol. [http://hmpdacc.org/micro_analysis/microbiome_sampling.php]
Turner JR: Intestinal mucosal barrier function in health and disease. Nat Rev Immunol. 2009, 9: 799-809. 10.1038/nri2653.
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009, 37: D141-145. 10.1093/nar/gkn879.
Hilbert F, Scherwitzel M, Paulsen P, Szostak MP: Survival of Campylobacter jejuni under conditions of atmospheric oxygen tension with the support of Pseudomonas spp. Appl Environ Microbiol. 2010, 76: 5911-5917. 10.1128/AEM.01532-10.
Godon J-J, Morinière J, Moletta M, Gaillac M, Bru V, Delgènes J-P: Rarity associated with specific ecological niches in the bacterial world: the 'Synergistes' example. Environ Microbiol. 2005, 7: 213-224. 10.1111/j.1462-2920.2004.00693.x.
Shah Sa, Simpson SJ, Brown LF, Comiskey M, de Jong YP, Allen D, Terhorst C: Development of colonic adenocarcinomas in a mouse model of ulcerative colitis. Inflamm Bowel Dis. 1998, 4: 196-202.
Pizarro T: Mouse models for the study of Crohn's disease. Trends Mol Med. 2003, 9: 218-222. 10.1016/S1471-4914(03)00052-2.
Panwala CM, Jones JC, Viney JL: A novel model of inflammatory bowel disease: mice deficient for the multiple drug resistance gene, mdr1a, spontaneously develop colitis. J Immunol. 1998, 161: 5733-5744.
Wirtz S, Neurath MF: Mouse models of inflammatory bowel disease. Adv Drug Delivery Rev. 2007, 59: 1073-1083. 10.1016/j.addr.2007.07.003.
Sartor RB: Mechanisms of disease: pathogenesis of Crohn's disease and ulcerative colitis. Nat Clin Pract Gastroenterol Hepatol. 2006, 3: 390-407. 10.1038/ncpgasthep0528.
Garrett WS, Lord GM, Punit S, Lugo-Villarino G, Mazmanian SK, Ito S, Glickman JN, Glimcher LH: Communicable ulcerative colitis induced by T-bet deficiency in the innate immune system. Cell. 2007, 131: 33-45. 10.1016/j.cell.2007.08.017.
Garrett WS, Gallini CA, Yatsunenko T, Michaud M, DuBois A, Delaney ML, Punit S, Karlsson M, Bry L, Glickman JN, Gordon JI, Onderdonk AB, Glimcher LH: Enterobacteriaceae act in concert with the gut microbiota to induce spontaneous and maternally transmitted colitis. Cell Host Microbe. 2010, 8: 292-300. 10.1016/j.chom.2010.08.004.
Veiga P, Gallini CA, Beal C, Michaud M, Delaney ML, DuBois A, Khlebnikov A, van Hylckama Vlieg JE, Punit S, Glickman JN, Onderdonk A, Glimcher LH, Garrett WS: Bifidobacterium animalis subsp. lactis fermented milk product reduces inflammation by altering a niche for colitogenic microbes. Proc Natl Acad Sci USA. 2010, 107: 18132-18137. 10.1073/pnas.1011737107.
Masaaki O, Yoshimi B, Kai-P L, Nobuko M: Metascardovia criceti Gen. Nov., Sp. Nov., from hamster dental plaque. Microbiol Immunol. 2007, 51: 747-754.
Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes. Nature. 2008, 452: 629-632. 10.1038/nature06810.
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33: 5691-5702. 10.1093/nar/gki866.
Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, Sobral B, Stevens R, White O, Di Francesco V: National Institute of Allergy and Infectious Diseases bioinformatics resource centers: new assets for pathogen informatics. Infect Immun. 2007, 75: 3212-3219. 10.1128/IAI.00105-07.
Krebs CJ: Ecology: The Experimental Analysis of Distribution and Abundance. 2008, Benjamin Cummings
Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M: Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 2007, 14: 169-181. 10.1093/dnares/dsm018.
Tatusov RL: A genomic perspective on protein families. Science. 1997, 278: 631-637. 10.1126/science.278.5338.631.
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29: 22-28. 10.1093/nar/29.1.22.
Turroni F, Foroni E, Pizzetti P, Giubellini V, Ribbera A, Merusi P, Cagnasso P, Bizzarri B, de'Angelis GL, Shanahan F, van Sinderen D, Ventura M: Exploring the diversity of the bifidobacterial population in the human intestinal tract. Appl Environ Microbiol. 2009, 75: 1534-1545. 10.1128/AEM.02216-08.
Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005, 21: 3017-3024. 10.1093/bioinformatics/bti448.
Suzuki Y, Nei M: False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol Biol Evol. 2004, 21: 914-921. 10.1093/molbev/msh098.
Boulesteix A-L: Over-optimism in bioinformatics research. Bioinformatics. 2010, 26: 437-439. 10.1093/bioinformatics/btp648.
2020 visions. Nature. 2010, 463: 26-32.
Hamady M, Knight R: Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res. 2009, 19: 1141-1152. 10.1101/gr.085464.108.
Wooley JC, Godzik A, Friedberg I: A primer on metagenomics. PLoS Comput Biol. 2010, 6: e1000667-10.1371/journal.pcbi.1000667.
Ritchie MD: Using prior knowledge and genome-wide association to identify pathways involved in multiple sclerosis. Genome Med. 2009, 1: 65-10.1186/gm65.
Tintle N, Lantieri F, Lebrec J, Sohns M, Ballard D, Bickeböller H: Inclusion of a priori information in genome-wide association analysis. Genet Epidemiol. 2009, 33 (Suppl 1): S74-80.
Lin W-Y, Lee W-C: Incorporating prior knowledge to facilitate discoveries in a genome-wide association study on age-related macular degeneration. BMC Res Notes. 2010, 3: 26-10.1186/1756-0500-3-26.
Reeder J, Knight R: The 'rare biosphere': a reality check. Nat Methods. 2009, 6: 636-637. 10.1038/nmeth0909-636.
Taylor MW, Schupp PJ, Dahllof I, Kjelleberg S, Steinberg PD: Host specificity in marine sponge-associated bacteria, and potential implications for marine microbial diversity. Environ Microbiol. 2003, 6: 121-130. 10.1046/j.1462-2920.2003.00545.x.
Tamames J, Abellán JJ, Pignatelli M, Camacho A, Moya A: Environmental distribution of prokaryotic taxa. BMC Microbiol. 2010, 10: 85-10.1186/1471-2180-10-85.
Kassen R: The experimental evolution of specialists, generalists, and the maintenance of diversity. J Evol Biol. 2002, 15: 173-190. 10.1046/j.1420-9101.2002.00377.x.
Frank DN, Pace NR, Peterson DA, Gordon JI: Metagenomic approaches for defining the pathogenesis of inflammatory bowel diseases. Cell Host Microbe. 2008, 3: 417-427. 10.1016/j.chom.2008.05.001.
Young C, Sharma R, Handfield M, Mai V, Neu J: Biomarkers for infants at risk for necrotizing enterocolitis: clues to prevention?. Pediatric Res. 2009, 65: 91R-97R. 10.1203/PDR.0b013e31819dba7d.
Asikainen S, Doğan B, Turgut Z, Paster BJ, Bodur A, Oscarsson J: Specified species in gingival crevicular fluid predict bacterial diversity. PLoS ONE. 2010, 5: e13589-10.1371/journal.pone.0013589.
Wong D, Zhang L, Farrell J, Zhou H, Elashoff D, Gao K, Paster B: Salivary biomarkers for pancreatic cancer detection. J Clin Oncol. 2009, 27: 4630-
Culligan EP, Hill C, Sleator RD: Probiotics and gastrointestinal disease: successes, problems and future prospects. Gut Pathog. 2009, 1: 19-10.1186/1757-4749-1-19.
Preidis GA, Versalovic J: Targeting the human microbiome with antibiotics, probiotics, and prebiotics: gastroenterology enters the metagenomics era. Gastroenterology. 2009, 136: 2015-2031. 10.1053/j.gastro.2009.01.072.
Borody TJ, Warren EF, Leis S, Surace R, Ashman O: Treatment of ulcerative colitis using fecal bacteriotherapy. J Clin Gastroenterol. 2003, 37: 42-47. 10.1097/00004836-200307000-00012.
Khoruts A, Dicksved J, Jansson JK, Sadowsky MJ: Changes in the composition of the human fecal microbiome after bacteriotherapy for recurrent Clostridium difficile-associated diarrhea. J Clin Gastroenterol. 2010, 44: 354-360.
Manichanh C, Reeder J, Gibert P, Varela E, Llopis M, Antolin M, Guigo R, Knight R, Guarner F: Reshaping the gut microbiome with bacterial transplantation and antibiotic intake. Genome Res. 2010, 20: 1411-1419. 10.1101/gr.107987.110.
You D, Franzos MA: Successful treatment of fulminant Clostridium difficile infection with fecal bacteriotherapy. Ann Intern Med. 2008, 148: 632-633.
Chang Y-w, Lin C-j: Feature ranking using linear SVM. J Machine Learning Res. 2008, 3: 53-64.
Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007, 73: 5261-5267. 10.1128/AEM.00062-07.
Bell TC, Cleary JG, Witten IH: Text Compression. 1990, Prentice-Hall, Inc
HMP Data Analysis and Coordination Center. [http://www.hmpdacc.org/tools_protocols/tools_protocols.php]
Mo Bio PowerSoil kit. [http://www.mobio.com/]
Huse SM, Huber Ja, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8: R143-10.1186/gb-2007-8-7-r143.
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.
Schloss PD: A high-throughput DNA sequence aligner for microbial ecology studies. PloS ONE. 2009, 4: e8230-10.1371/journal.pone.0008230.
Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E, Methé B, DeSantis TZ, Human Microbiome Consortium, Petrosino JF, Knight R, Birren BW: Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 2011, 21: 494-504. 10.1101/gr.112730.110.
Garrity GM, Lilburn TG, Cole JR, Harrison SH, Euzeby J, Tindall BJ: Taxonomic Outline of the Bacteria and Archaea. 2007, [http://www.taxonomicoutline.org/index.php/toba/article/viewFile/190/223]
Sequence Read Archive: SRP002012 Human Microbiome Project 454 Clinical Production Pilot (PPS). [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002012#]
Hothorn TH, Hornik K, van De Wiel MA, Zeileis A: Implementing a class of permutation tests: the coin package. J Stat Software. 2008, 28: 1-23.
Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, Springer, 4
Hunter JD: Matplotlib: a 2D graphics environment. Computing Sci Eng. 2007, 9: 90-95.
We would like to thank the entire Human Microbiome Project consortium, including the four sequencing centers (the Broad Institute, Washington University, Baylor College of Medicine, and the J Craig Venter Institute), associated investigators from many additional institutions, and the NIH Office of the Director Roadmap Initiative. This work was supported in part by grant DE017106 from the National Institute of Dental and Craniofacial Research (JI), NIH grants AI078942 (WSG) and Burroughs Wellcome Fund (WSG), and was funded by NIH 1R01HG005969 to CH.
NS and CH conceived the study; NS and LM implemented the methodology; NS: JI: LW: DG: WG: and CH analyzed the results; NS: JI: LW: DG: WG: and CH wrote the manuscript. All authors read and approved the manuscript in its final form.