 Method
 Open access
 Published:
tmap: an integrative framework based on topological data analysis for populationscale microbiome stratification and association studies
Genome Biology volume 20, Article number: 293 (2019)
Abstract
Untangling the complex variations of microbiome associated with largescale host phenotypes or environment types challenges the currently available analytic methods. Here, we present tmap, an integrative framework based on topological data analysis for populationscale microbiome stratification and association studies. The performance of tmap in detecting nonlinear patterns is validated by different scenarios of simulation, which clearly demonstrate its superiority over the most commonly used methods. Application of tmap to several populationscale microbiomes extensively demonstrates its strength in revealing microbiomeassociated host or environmental features and in understanding the systematic interrelations among their association patterns. tmap is available at https://github.com/GPZBioinfo/tmap.
Background
Microbiomewide association studies (MWAS) capture the variation and dynamics of microbiome associated with host phenotypes or environment types [1,2,3,4,5]. In order to identify underappreciated but significant microbiomeassociated host factors, comprehensive sample metadata of largescale populations are usually collected [6,7,8,9,10]. For instance, the gut microbiomes of a healthy population were shown to be associated with hosts’ stool consistency and medication, which may confound the identification of diseaserelated markers [7]; and links among microbiome, metabolome, and diet of individual hosts were characterized with a crowdsourced cohort of over 10,000 citizen scientists [6]. However, these analyses failed to account for different association patterns among subpopulations [11] without methods for mapping a myriad of host phenotypes to complex microbiome profiles.
Identifying association patterns in a highdimensional space of both populationscale host phenotypes and microbiome features poses challenges to existing analytic methods for microbiome analysis [12, 13]. Most of the currently available methods, such as metagenomeSeq [14], DESeq2 [15], Metastats [16], LEfSe [17], and MaAsLin [18], are mainly based on either statistical test for differential microbiome abundance or linear regression to identify associated covariates. Differential abundance testing examines each microbiome feature individually without considering correlations among taxa [14,15,16,17,18,19]. Alternatively, dimension reduction methods can be used to project highdimensional microbiome profiles to lowdimensional spaces for pattern discovery and association, such as principal coordinates analysis (PCoA) and principal component analysis (PCA) [13]. The resulted ordination axes can be utilized by linear regression to identify microbiomeassociated host covariates using the envfit method in the vegan package [20]. There are also methods using distance matrix of microbiome betadiversity for association analysis by fitting linear models to the distances or testing dissimilarities between groups, such as adonis and ANOSIM [21, 22]. These linear methods may not be able to capture nonlinear patterns of hostmicrobiome association in a highdimensional microbiome dataset.
Stratification of populationscale human microbiome has been used to reveal subgroups of hosts with distinct microbiome configurations such as the identification of enterotypes [23]. Current methods for microbiome stratification or clustering analysis include partitionbased clustering (for instance, kmeans and kmedoids clustering) and Dirichlet multinomial mixture (DMM) models [11, 23,24,25]. However, these methods have limitations when applied to highdimensional datasets. For example, kmeans clustering may fail to separate nonlinear local manifold structures imbedded in a highdimensional space [26], while the distribution assumption of DMM may not be fully met in real microbiome datasets [11]. Moreover, these two methods are unsupervised and divide microbiome samples into groups regardless of their associated metadata. Thus, to perform sophisticated stratification of populationscale microbiome samples, it is necessary to develop methods that can identify nonlinear local structures and can be supervised by host phenotypes.
Here, we present tmap, a method based on topological data analysis and network representation for stratification and association study of highdimensional microbiome data. This method is motivated by using advanced largescale data mining techniques to capture subtle and nonlinear patterns of highdimensional datasets [27,28,29,30]. We adopted the Mapper algorithm for topological data analysis (TDA) [31], which has demonstrated its powerful abilities in analyzing complex biological and medical data [32,33,34,35]. This algorithm allows us to construct an informative and compact network representation of highdimensional dataset. We developed tmap to extend the algorithm for microbiome study and utilize the network representation as an integrated framework for both association and stratification of populationscale microbiome data. This framework enables us to identify association of taxa or metadata within the entire network and to extract enrichment subnetworks of different association patterns. We validated the performance of tmap in detecting nonlinear hostmicrobiome association in different scenarios using synthetic microbiome data. Our method successfully identified most of the simulated nonlinear associations, which are hard to be detected with other methods (average ROC AUC of 0.95, 0.55, 0.89, and 0.63 for tmap, envfit, adonis, and ANOSIM respectively). Applying tmap to two populationscale human gut microbiome datasets of the Flemish Gut Flora Project (FGFP) [7] and American Gut Project (AGP) [6] extensively demonstrated its strengths in identifying nonlinear patterns and subpopulation enrichments of microbial taxa and in revealing microbiome stratifications associated with lifestyles. We also applied tmap to an even larger scale dataset of the Earth Microbiome Project (EMP) [5] to illustrate the multiscale interrelations among environment types.
Results
tmap was developed as an integrative framework for analyzing populationscale microbiome data to identify association of host phenotypes with highdimensional microbiome profiles and microbiome stratification enriched with host phenotypes or environment types. Underlying tmap is a bioinformatic pipeline of topological data analysis and network enrichment analysis for the discovery and visualization of the patterns of microbiome variation. The workflow of tmap consists of three major steps (Fig. 1). The first step uses the Mapper algorithm [31] (Fig. 2a, see the “Methods” section for details) to transform highdimensional microbiome profiles into a TDA network to represent the microbiome variation. A node in the network represents a group of samples with highly similar microbiome profiles, and a link between two nodes indicates that common samples are shared between the two groups of samples corresponding to the nodes. Therefore, a TDA network captures both local (as local connections between nodes) and global (as global connections between nodes) patterns of microbiome variation. The second step uses a modified version of the spatial analysis of functional enrichment (SAFE) algorithm [36] (Fig. 2b, see the “Methods” section for details) to map the values of a target variable (metadata or microbiome features) into the TDA network to generate its subnetwork enrichment scores (designated SAFE scores, one score on each node individually). For a given target variable, such as age, a SAFE score on a node quantifies, statistically, the enrichment level of its values in the samples belonging to the subnetwork centered around the node (determined by a threshold of network neighborhood, see the “Methods” section for details). Together, all the SAFE scores of a target variable form a vector of values, measuring all local enrichment levels on all the nodes in a TDA network. In the last step of tmap, vectors of SAFE scores for each metadata or microbiome features can be compared to each other to characterize their interrelations, reflecting the similarities of their enrichment patterns in the network. This step allows us to rank driver taxa of microbiome variation, to perform ordination analysis of SAFE scores, and to calculate coenrichment relations between metadata and microbiome features. With tmap, one can explore how microbiome and the host are associated at different scales, and analyze how different host factors are related to each other attributable to the corresponding microbiome variations. The utility of tmap framework in analyzing the datasets of synthetic microbiomes, human gut microbiomes, and the earth microbiome will be demonstrated and discussed as follows.
Detecting nonlinear enrichment and association patterns in synthetic microbiomes
We first applied tmap on synthetic datasets to evaluate its ability to detect microbiome enrichment and association patterns. In order to simulate microbiomes following the distribution of species diversity and abundance in real datasets, we used SparseDOSSA [37] to generate synthetic microbiome data, which is a Bayesian hierarchical model and estimates species abundance parameters based on training microbiomes [37, 38]. The synthetic datasets closely resembled their training microbiome data as shown in PCoA (BrayCurtis dissimilarity, Additional file 1: Figure S1). Based on these synthetic datasets, we simulated host factors associated with the microbiomes in different scenarios, including linear or nonlinear associations, or the mix of both of them (see the “Methods” section). After that, the performance of tmap in identifying the simulated associations was compared with the most commonly used methods (including envfit, adonis, and ANOSIM) to validate the advantage of tmap in nonlinear settings.
In the case of detecting linear associations, tmap and the other methods in comparison exhibited similar performance (average ROC AUC of 1.00 for all of them, Fig. 3d, e). We further evaluate the performance of tmap in detecting nonlinear associations via simulating different nonlinear enrichment patterns in a microbiome landscape. Both symmetric and asymmetric multiple (two or three) local enrichments were included in our simulation for the detection of nonlinear associations (Fig. 3a–c). The organization of the centers of the local enrichments in each simulated association determines the strength of either linear or nonlinear effect of the association. For instance, an association pattern of symmetric multiple local enrichments can be detected by nonlinear methods, while the projection of this pattern to any linear direction will result in an averaged linear effect of almost zero (Fig. 3a, c). In contrast, asymmetric multiple local enrichments can be detected by nonlinear methods and also by linear methods because of the averaged linear effect greater than zero (Fig. 3b). Different performances in detecting the simulated nonlinear associations were observed for the four methods (average ROC AUC of 0.95, 0.55, 0.89, and 0.63 for tmap, envfit, adonis, and ANOSIM respectively, Fig. 3e, Additional file 20: Table S1, Additional file 2: Figure S2). Although adonis achieved an average ROC AUC of 0.89, which is the best among the three methods in comparison, tmap still significantly improved upon adonis (average AUC of 0.95, p value = 8.11e^{−29}, Additional file 2: Figure S2). The weaker performance of adonis and the other two methods is most likely due to the linear regression technique used by these methods [20,21,22], which can only identify linear association between host factors and microbiome variation. In contrast, tmap is based on network enrichment analysis, which can analyze both linear and nonlinear associations. At last, we evaluated all the methods in a more realistic scenario that consists of both linear and nonlinear associations (see the “Methods” section). Like the above nonlinearonly scenario, tmap still had a significantly better outcome in this mixed scenario (average ROC AUC of 0.98, 0.82, 0.93, and 0.73 for tmap, envfit, adonis, and ANOSIM respectively, Fig. 3e, Additional file 20: Table S1, Additional file 2: Figure S2). Similar improvement by tmap over the other three methods was observed when we varied the number of simulated metadata associated with the microbiome (Additional file 2: Figure S2). Overall, tmap can detect both linear and nonlinear microbiome associations with comparable performance, based on network enrichment analysis rather than linear regression. As validated in the synthetic microbiomes, our method is capable of detecting various kinds of association microbiome patterns. Moreover, in addition to the patterns of multiple local enrichment, tmap is also capable of detecting other types of nonlinear patterns, such as circular or spiral enrichments (Additional file 19: Text S1, Additional file 3: Figure S3).
Improving identification of human gut microbiome stratifications associated with host covariates
Host factors may be associated with gut microbiome in different ways, local or global, linear or nonlinear. In the study of the Flemish Gut Flora Project (FGFP), it was shown that only about 7.63% of the microbiome variation can be explained by the identified host covariates using linear association models [7]. tmap was applied to analyze the FGFP cohort data aiming at improving the identification and interpretation of such populationscale microbiomehost associations, especially in discerning nonlinear and local patterns.
Overall, host covariates identified by tmap showed a consistent overlap with that identified by envfit, particularly for the covariates ranked in the top 19 (Kendall’s tau, cutoff p value = 0.05, R^{2} = 0.50, Fig. 4a). Like that in the original study, Bristol stool scale (BSS) score is the top associated covariate identified by tmap. However, there were differences in the ranking of several host covariates, including time since previous relief, mean corpuscular hemoglobin (MCH), and shift work (Fig. 4a). We speculate that the difference might be due to nonlinear association patterns that are captured by tmap but not by envfit. The observed differences were also supported by adonis and ANOSIM, except for time since previous relief, which showed a weak and nonsignificant association with the microbiomes in ANOSIM (R value = 0, p value = 0.10, Additional file 4: Figure S4, Additional file 21: Table S2). We also noticed that ANOSIM was unable to detect most of the host covariates (only six among the total 69 covariates, cutoff p value = 0.05, FDR corrected [39]) that are found to be significantly associated with the microbiomes by all the other methods (67 of total 69 covariates, cutoff p value = 0.05, FDR corrected, Additional file 21: Table S2). As shown in the TDA network, we found that the enrichment scores of time since previous relief are lower compared to other highly enriched covariates (Fig. 4e). These low scores can be explained by the large variance of time since previous relief observed among samples in the local subnetworks (Additional file 5: Figure S5). On the other hand, enrichment scores of MCH are comparable to that of BSS, consistent with their ranking by tmap although the ranking of MCH by envfit is much lower (Fig. 4b, c).
Associations identified by tmap can be further stratified into subgroups in the microbiome landscape to characterize subpopulationspecific microbiome features. For instance, pet past 3 months appeared to be enriched within two subgroups (Fig. 4d), characterized by different enriched genera. Salmonalla and Yersinia were found to be enriched in the first group, whereas Anaerofustis and Acetanaerobacterium were enriched in the second group (Additional file 22: Table S3). These observations are supported by previous studies, in which Salmonalla and Yersinia were often isolated from healthy cats and dogs [40]. Anaerofustis and Acetanaerobacterium were also found in healthy pets and human gut, but their influence is yet to be understood [41, 42]. Taken together, tmap allows the identification of host covariates with multiple enrichment subgroups and their related microbiome features, which may help to explain microbiome variability among subpopulations and identify specific biomarkers for disease diagnosis and treatment.
Systematic characterization of interrelations between host covariates and microbiome taxa based on more efficient stratification and association analyses
Systematic characterization of interrelations among host factors correlated to the gut microbiota is valuable in understanding hostmicrobiome interaction. By transforming the values of host factors into SAFE scores, we were able to quantify the relations between host factors and taxa. Principal component analysis showed that the overall enrichment patterns (represented by SAFE scores) are explained mainly by the top genera as identified by SAFE enriched scores, including Faecalibacterium, unclassified Ruminococcus, and Bacteroides (Fig. 5a, Additional file 23: Table S4). Therefore, mapping the taxa abundance to the TDA network (by SAFE scores of taxa) allows us to identify driver species that contribute to microbiome variation and to understand how they are related to each other by PCA. This analysis also confirmed the important host covariates identified in the original study, such as gender, hemoglobin, time since previous relief, and HDL cholesterol (Fig. 5a, Additional file 6: Figure S6). We further analyzed their interrelations via coenrichment network analysis of their SAFE scores (see the “Methods” section). The result showed that male was coenriched with Roseburia, Blautia, Faecalibacterium, and hemoglobin, whereas female was coenriched with unclassified Bifidobacteriaceae, unclassified Bacillales, Alistipes, and HDL cholesterol (Fig. 5b, c). These results are consistent with the welldocumented facts that serum hemoglobin concentration in healthy male is often higher than female, whereas healthy female tends to have higher HDL cholesterol [43, 44]. By including both taxa and the related host factors in coenrichment network analysis, a systematic view can be obtained to illustrate the association of blood parameters and gender with the gut microbiome, which would be valuable for understanding their possible interaction or confounding effect.
Disease and medication have been found to contribute significantly to gut microbiota variation [45, 46]. With tmap, we also explored how disease, medication, and microbiome would relate to each other in the FGFP cohort. The original study only found that βlactam antibiotic caused significant increase in the abundance of specific genera. Our analysis identified seven out of the 13 drugs significantly coenriched with three diseases, along with different associated microbiomes (Fig. 5d, e). For instance, both osmotic laxatives and βlactam antibiotic were coenriched with chronic fatigue syndrome, but with distinct subnetworks of the microbiome (Fig. 5d, e), highlighting the possibility of drugspecific microbiome response. We also found that commonly used drugs for ulcerative colitis (UC), such as the antiinflammatory drug, immune system suppressors, and βlactam antibiotics were coenriched with UC, and are associated with different microbiomes. These results demonstrated that tmap may improve systematic and integrative analysis of microbiome and host phenotypes based on more efficient stratification and association methods.
Indepth stratification of human gut microbiome associated with country and lifestyle
Human gut microbiome from different countries was shown to form clusters of distinct community compositions, which were proposed as enterotypes [23]. As a means of stratification of human gut microbiome, clustering approach has been used for enterotype analysis to identify microbiome configurations with distinct responses to drugs or diets [11, 23, 47]. Alternatively, tmap provides another stratification approach based on enrichment patterns of taxa abundance. We applied both approaches to the microbiome data from the American Gut Project (AGP), which comprises microbiome samples from over 10,000 citizen scientists [6]. The results showed that both approaches were able to reveal a global pattern of stratifications in the microbiome landscape, driven by different taxa (Fig. 6a, b). In addition, tmap also detected local indepth stratifications of samples and their driver taxa (Fig. 6a). For instance, the Firmicutes enterotype (ET F) was subdivided into several local stratifications enriched with different taxa of Firmicutes, including Roseburia, Clostridium cluster IV, unclassified Lachnospiraceae, Coprococcus, Ruminococcus, and unclassified Ruminococcaceae (Additional file 24: Table S5). In contrast to enterotype stratification requiring the number of clusters to be set before analysis [23], tmap automatically identifies stratifications of samples based on taxa SAFE scores and the TDA network of microbiome. Our approach to stratifying microbiome variations according to the enrichment patterns of taxa can help to alleviate the problem of inferring discretized enterotypes from the continuous changes of microbiome taxa [48]. Further examination of these ET F local stratifications revealed their associations with host covariates. Existence of ET F local stratifications was also observed in the FGFP microbiomes (Additional file 7: Figure S7). This implies that augmenting known enterotypes with these local stratification patterns will further dissect populationscale microbiome variations for the identification of stratificationspecific microbiome markers and their links with host phenotypes.
Stratification of a populationscale microbiome could be attributed to interactions between host phenotypes and the gut microbiota [11]. We performed ordination analysis of the SAFE scores to reveal the interrelations between the host covariates and taxa accounting for the variation of the AGP microbiomes (Additional file 8: Figure S8, see the “Methods” section). Two of the most prominent host covariates are countries (USA or UK), which were coenriched with different microbial taxa (Fig. 6c). For instance, USA samples were coenriched with Bacteroides, whereas UK samples were coenriched with unclassified Ruminococcaceae. These coenriched taxa have also been identified in the above enterotype and stratification analysis, indicating that the stratification is most likely associated with countries. As reported in previous studies, Bacteroides is an enterotypedriven genus and has been associated with a carnivorous dietary habit [23, 47]. With the available metadata on host lifestyles and dietary habits, we also performed coenrichment network analysis based on their SAFE scores (see the “Methods” section). The resulted networks showed that most of the host factors and taxa were coenriched with two hubs, corresponding to the two countries (Fig. 6d–f). The coenrichments reflect different lifestyles associated with the two countries. For instance, UK samples were coenriched with homecooked meals frequency (daily), milk cheese frequency (daily), whole grain frequency (daily), and vegetable frequency (daily). On the other hand, USA samples were coenriched with antibiotic history (6 months), multivitamin, and unspecified alcohol types. Together, the coenrichment networks indicate that the stratification of the AGP microbiomes can be further linked to lifestyles associated with different countries. This analysis demonstrated the strength of tmap in providing an integrative framework both for stratifying microbiomes and for illustrating the interrelations among host factors contributing to the stratification.
Illustrating the multiscale pattern of the earth microbiome and environment types
In addition to the human gut microbiome datasets analyzed above, we also applied tmap to the largescale microbiome samples from the Earth Microbiome Project (EMP) to extract their ecological patterns [5]. The original study elucidated a multiscale pattern of microbiome diversity of different environment types. In addition, via combination of wellestablished microbial ecology knowledge [49, 50] and the observed microbial diversity of the EMP samples, the EMP Ontology (EMPO) was proposed for the classification of environmental microbiome samples [5]. Our reanalysis found that the SAFE enriched scores of metadata were consistent with their EMPO levels; that is, larger scores (reflecting bigger enrichment subnetworks) correspond to lower levels, and vice versa (Fig. 7a, b). As expected, classes of EMPO level1 were ranked at the top, followed by classes of EMPO level2, and then by EMPO level3 (Fig. 7a). We also found that some of the Environment Ontology (ENVO) descriptors had SAFE enriched scores comparable to that of EMPO classes (Fig. 7b). For instance, the ENVO level1 descriptors of terrestrial and aquatic biome were ranked among the top, close to EMPO level1 classes. But these ENVO descriptors were associated with different subnetworks to that of EMPO (Additional file 9: Figure S9), indicating that they can classify environmental microbiomes in a different way. Furthermore, geographical covariates, such as elevation and latitude, were ranked among the classes of EMPO level2 or level3, suggesting their roles in characterizing environment types. It is also worth noting that technical indicators, such as extraction center, sequence length, and platform, were among classes of EMPO level2 and therefore needed to be considered as strong confounding factors in the metaanalysis of the EMP samples (Fig. 7a, b).
In addition, tmap was used to construct a coenrichment network of the EMPO metadata to reveal their interrelations. As shown in Fig. 7c, the EMPO classes were interconnected in a network whose configuration reflects their hierarchy levels, with nodes of lower level to have more connections with other nodes in the network. A further analysis of the hostassociated samples revealed finer interrelations between hosts and their microbiome stratification, which could complement the EMPO system with more detailed classifications (Fig. 7d, Additional file 10: Figure S10). We also tried to identify driver taxa of different environment types by including both metadata and taxa in a TDA network. The result showed that distinct groups of microbial taxa were enriched with different EMPO level1 classes (Fig. 7c, Additional file 9: Figure S9). Taken together, applying tmap to the largescale EMP dataset enabled us to effectively extract the hidden multiscale ecological patterns and interrelations of environment types associated with the earth microbiome and to identify their detailed stratification for finer classifications.
Discussion and conclusions
tmap is an integrative method for analyzing populationscale microbiome variation, allowing both stratification of complex microbiome landscape and association of metadata of hosts or environmental types. This method is powered by the Mapper algorithm [31] for topological data analysis, which has been shown effective in summarizing largescale highdimensional datasets and in capturing complex patterns. In contrast to other widely used linear regressionbased methods, including envfit, adonis, and ANOSIM, tmap is capable of identifying complex nonlinear patterns in both synthetic and real microbiome datasets, allowed by the employed subnetwork enrichment analysis. Furthermore, the subnetwork enrichment analysis enables tmap to calculate the SAFE scores for systematically mapping all host metadata onto the underlying microbiome variation to extract their associations and interrelations. The potential of tmap in populationscale microbiome studies was extensively demonstrated in our reanalysis of three published datasets [5,6,7], i.e., nonlinear trends and subpopulation enrichments of microbial taxa identified in the FGFP dataset, microbiome stratifications associated with countries and lifestyles revealed in the AGP dataset, and associations of microbiomes with earth environment types in different scales illustrated in the EMP dataset. These results indicated that tmap is capable of bringing more specific and comprehensive insights to the microbiome datasets with the related population metadata.
To achieve a better performance of tmap on a given microbiome dataset, we provide clustering and topological covering parameters that can be tuned, along with optimization functions for their choice (http://tmap.readthedocs.io). As the Mapper algorithm employs densitybased clustering and discards unclustered samples, tmap would perform better with a larger dataset that includes more data points from a microbiome landscape to construct a more faithful topological network representation. Another important aspect of the algorithm is the choice of the filter functions, which depends on the nature of the studied dataset and research questions to be addressed. In practice, dimension reduction methods are the most commonly used filters [30]. For microbiome data, one such method is PCoA, which has helped reveal many biological insights, such as microbiome variations among human body sites [51]. Major components of microbiome variation can be captured by the first two or three PCs of PCoA in microbiome studies [5,6,7], but there may still be a large quantity of variation remained in other PCs. For instance, there is a relatively large proportion of variance uncaptured by the first two PCs (67.91%, 87.10%, and 80.71% for the FGFP, AGP, and EMP datasets), reflecting the dimensional nature of these microbiome variations (Additional file 11: Figure S11). It is worth noting that tmap has an advantage of recovering the distance information from the original highdimensional space that is not captured by the PCs used as filters, such as the first two PCs of PCoA (Additional file 12: Figure S12). As illustrated in Additional file 13: Figure S13, the clustering step of tmap uses the original distance to calculate and recover distinct clusters of samples in their original highdimensional space, which otherwise might be superimposed on each other due to the loss of variation in the projection space. As a result, tmap is able to capture far more variation than that in the projection space by the PCs used as filters (tmap vs PCoA, regression R^{2} of 0.80 vs 0.51 for the FGFP dataset using PC1 and PC2 of PCoA, see the “Methods” section and Additional file 14: Figure S14 for more details). The recovered variance is also evident by comparing the projected distances between samples within a cover to their network distances constructed by tmap (tmap vs PCoA, CV of 1.90 vs 0.55 for the FGFP dataset, see Additional file 15: Figure S15 for other datasets).
In principle, tmap can use more than two PCs as filters, but increasing the number of filters (e.g., n) will lead to an exponential increase in the number of covers to be generated (e.g., 10^{n} covers for a resolution of 10 on each filter), which then will result in too few samples to be clustered or a TDA network too sparse to have sufficient connectivity among nodes (see the “Methods” section for more details on network sparseness). As shown in Additional file 14: Figure S14, using the first four PCs as filters resulted in a TDA network with high sparseness (e.g., 82.88% for the FGFP dataset in PCoA). To ensure that there are enough and dense samples to be binned within each cover for clustering analysis and to obtain a TDA network with sufficient connection between nodes, we used only the first two PCs of PCoA (or twodimensional projection in tSNE or UMAP) as filters in tmap.
Although regression improvement can be achieved by projection using more PCs (or components to be projected), the performance of tmap in capturing the variation of the original highdimensional space of microbiome samples appears robust to the number of PCs used as filters (tmap vs projection space, regression R^{2} of 0.64 ± 0.10 vs 0.46 ± 0.14, Additional file 14: Figure S14). Otherwise, the use of many PCs (such as four PCs) will lead tmap to generate a TDA network with few connections among nodes and therefore will compromise its ability to recover sufficient original microbiome variation (Additional file 14: Figure S14).
In the case that a set of target variables (e.g., a subset of metadata or microbiome features, rather than the overall microbiome variation) are of interest, we propose a supervised strategy to help choose the most suitable PCs to be used as filters. This strategy is based on the proportion of total variance explained by each PC [52], multiplied by the enriched SAFE scores of the target variables, to rank and select a subset of PCs as filters for final TDA network construction (see the “Methods” section and Additional file 16: Figure S16 for details). Applying this strategy to the FGFP dataset showed that the first two PCs (PC1 and PC2) actually gave the largest aggregated scores when all the metadata were taken as target variables (42.39% of the cumulative aggregated score). Moreover, given different target variables of interest, different PCs may be selected as the most informative filters (e.g., PC 8, 3 for age and PC 1, 2 for BSS in the FGFP dataset).
In addition to PCoA, which is widely used in microbiome analysis [13], other dimension reduction methods can also be used as filters, especially nonlinear and largescale methods, such as tSNE [29] and UMAP [27]. To examine the robustness of different dimension reduction methods as filters in tmap to detect hostmicrobiome associations, we applied PCoA, tSNE, and UMAP to the FGFP dataset and compared their results with that of envfit. All three methods shared a significant common subset of host covariates in their top 10’s (4 in top 10, permutation test p value = 1e^{−4}, see the “Methods” section for details, Additional file 17: Figure S17). Furthermore, all three methods consistently supported the observed differences in the effect size of association for the four host covariates as identified in the comparison between tmap and envfit (Additional file 17: Figure S17). In future research, one may expect that it is possible to combine the results from different parameters and filters, to construct an integrative TDA network for multiscale mapping [53] and for large collections of microbiome datasets.
Subnetwork enrichment analysis of the SAFE algorithm allows tmap to transform the values of a target variable into a vector of SAFE scores, corresponding to each node in a TDA network. Therefore, the association patterns of target variables can be compared quantitatively by their vectors of SAFE scores, using ordination or coenrichment analysis (Fig. 1). For each target variable, a networklevel association (designated SAFE enriched score) can be obtained by filtering and summing its SAFE scores of individual nodes (see the “Methods” for details). Like the Rsquared in linear regression, the SAFE enriched score can be used as an effect size to compare between different host covariates for their associations with microbiome variation. It is worth noting, however, that the SAFE score is different from the correlation coefficient in linear regression in two aspects. First, it is able to detect subtle and complex associations, both linear and nonlinear, as demonstrated in our analysis of synthetic and realworld datasets. Second, SAFE scores can form a vector of values, representing all local subnetwork associations, which can be subjected to further analysis of the interrelationships between metadata. In contrast, the correlation coefficient in linear regression is only a value of correlation, which cannot be used to analyze interrelations between the association patterns of metadata, as we have done with SAFE scores.
Furthermore, SAFE scores allow us to use coenrichment analysis to scrutinize whether interrelations between target variables represent confounding effects or biological associations with microbiome variations. For instance, a significant coenrichment between a host covariate (such as Gender) and a taxon (such as Roseburia) may represent the outcome of hostmicrobiome interactions. Instead, a coenrichment between a medication (such as βlactam antibiotic) and a disease (such as chronic fatigue syndrome) is likely due to a confounding effect. Therefore, although SAFE scores are calculated independently (via independent random shuffle) for each metadata or microbiome features, coenrichment analysis would capture both biologically meaningful intercorrelations and confounding effects. Interpretation of these interrelations should be based on our knowledge of hostmicrobiome interactions and the background of studies, as we have demonstrated in the analysis of the FGFP, AGP, and EMP datasets.
In conclusion, tmap is an integrative framework for analyzing populationscale microbiome variations and their association with hosts or environments. Based on topological data analysis, it is able to capture complex microbiome variations from highdimensional datasets and recover the lost variation in their lowdimensional projection or embedding. Moreover, TDA network representation and subnetwork enrichment analysis endows tmap with the ability to extract complex hostmicrobiome association patterns, especially nonlinear associations that are hard to detect with currently available methods. In microbiome research, given our inadequate knowledge of the dynamics and complexity of hostmicrobiome interactions, especially at populationscale, innovative datadriven methods for discovering complex patterns of hostmicrobiome association are urgently needed [54]. In this regard, tmap could provide insights from both microbiome stratification and association analysis to inform further hypothesisdriven microbiome studies. tmap is provided as a software freely available at https://github.com/GPZBioinfo/tmap, along with detailed tutorials and online documents (https://tmap.readthedocs.io).
Methods
Mapper algorithm for microbiome data analysis
tmap is based on the Mapper algorithm [31] for topological data analysis (TDA) to transform highdimensional microbiome profiles of individual samples into a network representation that captures both local and global topological patterns from the profiles (Fig. 2a). This algorithm begins with projection of highdimensional data points (representing microbiome profiles) into a lowdimensional space using filter functions. Usually, dimension reduction methods are used as filters to generate coordinates of data points in a lowdimensional space [30]. For instance, when PCA is used for dimension reduction, either one or two principal component(s) can be used as filter. If PC1 is used as filter, the coordinates of the points along PC1 will be generated. If both PC1 and PC2 are used as filter, the coordinates of the points in a twodimensional space will be generated. Other functions may also be used as filter, such as the eccentricity or density of a dataset, or even a subset of the original dimensions, as long as they can generate coordinates of data points in a lowdimensional space.
After projection of the data points into a lowdimensional space, the covering step of Mapper partitions the space into a number of overlapping covers with equal size. The purpose of covering is to use covers of the lowdimensional space to capture its topological properties, i.e., a cover represents a local neighborhood of the projected data points. Therefore, a collection of covers represents all neighborhood information of the projected data points. To connect neighborhoods that are close to each other, overlaps between covers are retained in the covering step of the algorithm. As a result, covers and their overlaps capture both local neighborhood of points and their global interconnections.
Information of original distances between data points might be lost after dimension reduction or projection. For example, two points that are far apart in the highdimensional space might be projected as close neighbors in the lowdimensional space. The clustering step of Mapper is performed to retain the distance information of the highdimensional space. For each cover, points binned within it are clustered into different clusters based on their distances in the original space rather than their projected distances. For instance, after applying PCoA to microbiome profiles, all the subsets of samples that fall within each cover (a twodimensional cover if the first two coordinates are used, such as PC1 and PC2) will be clustered independently. As a result, samples within a cover would be clustered into different clusters if they are quite different in their original microbiome profiles, although they might be close to each other in the projection space of PC1 and PC2.
The last step of Mapper is to generate a TDA network, in which a node represents a group of samples that are clustered together and a link occurs between two nodes if they share common samples in their clusters. The TDA network provides a compressive representation of highdimensional microbiome profiles for exploring microbiome variations and for stratification and association analysis.
SAFE algorithm and SAFE score
Once a TDA network of microbiome profiles is constructed, the values of metadata or microbiome features can be individually mapped to the network (Fig. 2b). For a given metadata, e.g., age, this mapping assigns a numerical attribute (called mapping score) to each node, by calculating its averaged values among samples in the node (e.g., mean age). We assume that a target variable nonrandomly distributed among the TDA network indicates that its association with the underlying microbiome profiles is significant. Intuitively, if the nodes with high mapping scores are neighbors, interconnected to each other in the network, a pattern of subnetwork enrichment of these nodes can be observed. The significance level of the observed pattern can be calculated by permuting the mapping scores along the whole network. A nonrandom enrichment pattern will have significantly higher scores for the subnetwork compared to the randomly permuted scores.
To implement the above idea, tmap adopts the spatial analysis of functional enrichment (SAFE) algorithm for network enrichment analysis [36]. The algorithm was developed as a systematic method for annotating biological network and examining their functional association. We modified the original algorithm to calculate an enrichment score (designated SAFE score) for each node in a TDA network for a given target variable (metadata or microbiome features), as described in the following steps (and as depicted in Fig. 2b):

1)
This algorithm starts with a TDA network and a target variable of metadata (e.g., age) or taxa. For each node u in the network, SAFE defines a local neighborhood of u by identifying any other nodes that are closer than or equal to a maximum distance threshold (d) to u. Node distance is measured as the shortest path length between nodes. By default, the maximum distance threshold d is set to be equal to the 0.5th percentile of all pairwise node distances in the network.

2)
For each node, SAFE sums the values of neighbor nodes for a target variable as an observed neighborhood score (S_{observed}). Meanwhile, permuted neighborhood scores (S_{permuted}) are obtained by randomly shuffling the target variable among nodes in the network. The enrichment significance of the observed neighborhood score (P) is measured as the probability that a random score will fall between the observed neighborhood score (S_{observed}) and the largest value of all scores (via ranking of both observed and permuted scores, as illustrated in Fig. 2b). Finally, the enrichment significance (P) is transformed into an enrichment score (O), designated as SAFE score, which is normalized in a range from 0 to 1 as below:
$$ {O}_u=\frac{{\log}_{10}\left(\max \left({P}_u,\frac{1}{n+1}\right)\right)}{{\log}_{10}\left(\frac{1}{n+1}\right)} $$
where n is the number of shuffles, P_{u} is the significance of enrichment of node u, and O_{u} is the SAFE score of node u. Random shuffle is performed independently for each target variable.

3)
A node is considered to be significantly enriched under a cutoff p value of 0.05 (which can be tuned in tmap). This cutoff p value can be translated to a cutoff SAFE score as below:
Under the above cutoff value (O_{cutoff}), SAFE enriched score is defined as the sum of SAFE scores of all significantly enriched nodes in a TDA network to measure the overall enrichment significance in the whole network, which can be used to filter or rank metadata or taxa.
Microbiome datasets and sample metadata
The FGFP, AGP, and EMP microbiomes were collected from the data repositories provided in their publications, along with sample metadata (host phenotypes or environment types). We used the available OTU/sOTU tables from the original studies to avoid bias of 16S rRNA sequence data processing for result comparison [55]. The FGFP dataset comprises 1106 fecal samples and 69 identified host covariates, which were classified into seven metadata categories (anthropometric, lifestyle, blood parameters, health, bowel habit, dietary habit, and medication). The AGP dataset comprises 9496 fecal samples and 451 selfreported metadata. The EMP dataset comprises 2000 samples, and their metadata was downloaded from the EMP FTP site (see the “Availability of data and materials” section). The original OTU/sOTU tables were rarified and normalized to obtain an equal number of reads for each sample before further analysis. Betadiversity (BrayCurtis or unweighted UniFrac) distance matrix was calculated with scikitbio (http://scikitbio.org), followed by principal coordinates analysis (PCoA). For the AGP and EMP dataset, in which representative sequences were available, we reannotated their taxonomy using the usearch sintax software (using a cutoff bootstrap value of 0.8) [56, 57]. We used genuslevel profiles for the analysis in this study. To perform network enrichment analysis with tmap, categorical metadata was transformed into onehot encoding with scikitlearn. Only metadata that is collected for at least 90% of samples was retained for further analysis. Missing values were filled with medians for all the retained metadata.
Parameters for topological data analysis and network enrichment analysis
We used the same betadiversity as the original studies (BrayCurtis distance matrix for FGFP, unweighted UniFrac distance matrix for AGP and EMP) for result comparison. tmap used the first two principal coordinates of PCoA as lenses (filters). Different topological and clustering parameters were chosen for the datasets depending on their sample size and microbiome variation (FGFP: overlap = 0.75, resolution = 40, and percentile eps threshold = 95th; AGP: overlap = 0.95, resolution = 120, and percentile eps threshold = 90th; EMP: overlap = 0.75, resolution = 45, and percentile eps threshold = 95th). An online guide is available on how to choose proper parameters for a given dataset (see the “Availability of data and materials” section). After obtaining a TDA network for a dataset, metadata or taxon abundance was individually mapped to the network by enrichment analysis, using the SAFE algorithm. p values were calculated for the observed SAFE scores for each node in the TDA network by permutation test (iterations = 5000) and were FDR corrected for all the nodes. Nodes with a p value of ≥ 0.05 (FDR corrected) were considered significantly enriched and were used to calculate the SAFE enriched score for metadata or taxa.
Identifying and ranking microbiomeassociated covariates
For the FGFP dataset, ranking of the target variables of microbiomeassociated host covariates was compared between envfit, adonis, ANOSIM, and tmap. Ten thousand permutations were used in envfit, adonis, and ANOSIM. Effect sizes were used to rank the covariates by these methods (Rsquared of envfit and adonis, R value of ANOSIM and SAFE enriched score of tmap). Kendall’s tau test was used to statistically compare the rankings of envfit and tmap. Significant p value can be obtained for a consistent ranking of a subset of covariates, which are examined in a stepwise test from top to bottom. Results of the first two stepwise test were absent because Kendall’s tau test is valid only for a ranking with more than two covariates.
Synthetic microbiomes and simulation of associations between microbiome and metadata
Synthetic microbiome datasets were generated with SparseDOSSA, using a Bayesian hierarchical lognormal distribution model to simulate species abundances [37]. Model parameters are estimated by fitting to a reference microbiome dataset. Four microbiome datasets, including the default template dataset of SparseDOSSA, FGFP dataset, AGP dataset, and EMP dataset, were used to train the model independently and the best one was chosen to further simulate associations of metadata (Additional file 1: Figure S1). Associations between metadata and microbiome were simulated by mapping values of metadata onto the PCoA spaces of microbiome variation (PC1 and PC2, using BrayCurtis distance matrix) via various functions. Both linear and nonlinear associations were simulated with the corresponding mapping functions as follows.
Linear associations were generated by the following function:
where the coefficients a and b are randomly chosen from the range of [− 1, 1] for each metadata; PC1 and PC2 are the coordinates of a microbiome sample in the twodimensional PCoA space.
Nonlinear associations of multiple local enrichments were simulated by mapping Gaussian mixtures onto the PCoA space, using the following function:
where n (that is 2 or 3 in our simulation) is the number of Gaussians to be simulated in the mixture; (μ_{i1}, μ_{i2}) is the center of the ith Gaussian in the PCoA space, and σ is the standard deviation; PC1 and PC2 are the coordinates of a microbiome sample in the twodimensional PCoA space.
In order to use ANOSIM for microbiome association analysis, we also simulated categorical variables with linear or nonlinear patterns of associations. Binary discretization of continuous variables (with simulated linear associations as described above) was performed to obtain categorical variables. Based on the median of continuous variable, data points (samples in a PCoA space, PC1 and PC2) were assigned to two categorical groups (labeled as “True” if larger than the median, labeled as “False” otherwise, Additional file 18: Figure S18). For the simulation of categorical variables with nonlinear associations, we used an approach similar to the above simulation of multiple local enrichments. Instead of Gaussian mixtures, this approach picks multiple circular areas from the PCoA space and assigns samples within the areas as “True” and other samples as “False.” First, a number of random samples were selected from the PCoA space to be used as centers. For each categorical variable, this number is randomly chosen in the range from 1 to 5. Second, for each area, the 50 samples that are closest to its center (including the center itself) were included, according to their Euclidean distances on the PCoA space. If a selected sample is already included in other circular areas, it will be skipped and the next closest one is considered. Therefore, the ratio of sample sizes between the two categorical groups (“True” or “False”) was kept in the range from 1:9 to 1:1, given that there were a total of 500 samples in our simulation. As in the case of continuous variables, a mixed simulation comprises both linear and nonlinear associations, in a ratio of 1:3 in their numbers of categorical variables.
We used the default template microbiome dataset and model parameters of SparseDOSSA to generate synthetic microbiomes consisted of 500 samples. Three scenarios were designed to compare the performance of tmap and other methods in detecting associated metadata, including scenarios of linear associations only, nonlinear associations only, and the mix of both of them. In the first two scenarios, 50 associated metadata were generated according to the above mapping functions as positive cases to be detected; 50 random shuffles of the generated metadata were used as negative cases. In the mixed scenarios, four kinds of associations (200 in total, 50 for each kind) were generated and mixed, including Gaussian mixture with three symmetric centers, Gaussian mixture with three asymmetric centers, Gaussian mixture with two centers, and linear distribution as described above. At the same time, random shuffles of the generated metadata were used as negative cases. Performance in detecting positive cases of associated metadata was compared between tmap (measured by SAFE enriched score) and other methods (measured by p value) via receiver operating characteristic curve (ROC) and area under the curve (AUC) score. Significant difference between AUC scores (100 repeats of simulations) was accessed by MannWhitney U test. More details and the codes for the simulation can be found in the online Jupyter notebook (see the “Availability of data and materials” section).
Stratification, enterotype, and ordination analysis with SAFE scores
Taxadriven stratification of the TDA network of microbiome variation was obtained by identifying the most significant enriched genus (with the highest SAFE score among all genera) for each node in the network. A cutoff value of SAFE score ≥ 0.35 (corresponding to a negative logtransformed p value of 0.05 with 5000 iterations by the SAFE algorithm) was used to filter out nodes with no significant genus. The stratification was visualized by coloring the TDA network according to the enriched genera, which resulted in enterotypelike clusters in the microbiome landscape. Each cluster was highlighted by a color specific to its enriched genus. For comparison, traditional enterotype analysis was also performed using the partitioning around medoids (PAM) method (JensenShannon divergence, and a preset number of three clusters) [23]. Ordination of the SAFE scores of metadata and taxa was done by PCA, to visualize how they relate to each other after mapping to the microbiome variation. Metadata or taxa that share similar enrichment subnetworks will be close to each other within the PCA space.
Coenrichment network analysis
SAFE scores of metadata or taxa contain information about their coenrichment patterns on a TDA network, which can be used to calculate their interrelations when accounting for their association with microbiome variation. First, for each feature (metadata or taxa), we separated all the nodes into two groups: one group of enriched nodes (as defined in the above SAFE algorithm) and another group of the remaining nodes. Therefore, for each pair of features, a contingency table can be obtained based on the combination of their node groups. Next, Fisher’s exact test was used to examine the independence of node groups, and its p value was used for coenrichment network construction. Only positive dependence of node groups was considered as a coenrichment relationship. The resulted network was filtered using a threshold of 0.5th percentile of the p values (FDR corrected). The negative logtransformed p value of the test was used as edge weight for each pair of coenriched features in a coenrichment network.
Selection and evaluation of filter functions in tmap for microbiome data analysis
We used PCoA for the projection of the FGFP microbiome profiles to a lowdimensional space and demonstrated a strategy on the selection and evaluation of PCs as filters in tmap given a chosen set of target variables of interest. First, each individual PC was used as the only filter in tmap to construct a TDA network. After that, SAFE enriched scores of the chosen target variables were calculated based on the network and were summed to quantify the overall association of all the variables. The summed score for each PC was then multiplied by the ratio of variance explained of the PC to calculate an aggregated score. At last, all the PCs were ranked according to their aggregated scores. Accumulation curve of the ranked aggregated scores was then used to determine how many and which PCs to be selected as filters in a final analysis given a specified threshold value (see Additional file 16: Figure S16 for a workflow of the steps). The specified threshold value determines how much of the cumulative aggregated scores to be kept by the selected PCs (e.g., 70%).
Different dimension reduction methods, including PCoA, tSNE, and UMAP, were compared to evaluate the robustness of tmap. In the comparison, two components were selected from each method to be used as filters in tmap. And all the methods used the same set of parameters (overlap = 0.75, resolution = 40, and percentile eps threshold = 95th). To assess the significance of the observed number of common covariates in the top 10’s of the rankings from each method, permutation test is used. In detail, three pseudorankings were obtained by permuting the 69 covariates three times independently for each iteration to calculate a random number of common covariates in the top 10’s from the pseudorankings. After 10,000 iteration, the observed value was compared with the random values to obtain its rank (r) in a descending order, and p value was calculated as r/10000.
Recovering complex microbiome variations from highdimensional space
tmap is able to recover the original complex microbiome variations that are lost in a lowdimensional projection space, such as in the PCoA space of PC1 and PC2 (Additional file 13: Figure S13). Linear regression analysis was performed to quantify the variation captured by tmap than that in the lowdimensional projection after using filters of dimension reduction methods. Rsquared (R^{2}) was obtained from the linear regression between the original distance in the original highdimensional microbiome profiles and the projected distance in the projection space, or the network distance in tmap, respectively. In this study, the original distance is the BrayCurtis distance between samples in their original highdimensional microbiome profiles. The projected distance is the Euclidean distance between samples in the projection space. Network distance was measured as the minimal number of edges to be traversed (or shortest path) between each pair of nodes. Because network distances were calculated between nodes, in which nodes are groups of samples, the corresponding original distances between samples from two nodes (u, v) are calculated as below:
where n and m are the number of samples in node u and v respectively; u_{i} is the ith sample in node u, and v_{j} the jth sample in node v; and d (u_{i}, v_{j}) is the BrayCurtis distance between sample u_{i} and v_{j}.
In this study, we defined a sparseness metric of a TDA network to quantify the overall connectivity among its nodes as below:
where n is the total number of nodes in the TDA network and u and v are two nodes in the network. Selfconnections (e.g., u = v) are also counted. The greater the value of the sparseness of a TDA network indicates that the larger number of node pairs that are not connected and therefore cannot be measured by the network distance.
Comparison between lowdimensional projection and tmap using the above regression analysis was performed for different dimension reduction methods (including PCoA, PCA, tSNE, UMAP) and also for different number of components used as filters (from top two to four components). We also compared samples that are binned within a same cover in the projection space to measure the difference in variance captured by different methods; that is, distances between samples or nodes from different covers were not included in the comparison. The obtained network distances and projected distances (from within each cover) were normalized into the range of [0, 1], by dividing the distance to the maximum distance from the overall network or the overall projection space respectively. Coefficient of variation (CV), the ratio of the standard deviation to the mean, was calculated for both the network distances and the projected distances to compare the variance captured by each of the methods (tmap vs dimension reduction) when constrained within individual covers.
Availability of data and materials
GitHub source of tmap is available under the GNU General Public License v3.0 at https://github.com/GPZBioinfo/tmap [58]. Online documents of tmap: https://tmap.readthedocs.io/. The version of the source code of tmap during the current study is available at https://doi.org/10.5281/zenodo.3228979 [59]. Jupyter notebooks of the source codes for data analysis, simulated datasets, microbiome datasets, and sample metadata: https://github.com/GPZBioinfo/tmap_notebook. Datasets used in this study were obtained from the following public sources: FGFP dataset [7] from ENA (https://www.ebi.ac.uk/ega/studies/EGAS00001001689) [60], AGP dataset [6] from ENA (https://www.ebi.ac.uk/ena/data/view/PRJEB11419) [61], and EMP dataset [5] from Earth Microbiome Project (ftp://ftp.microbio.me/emp/release1) [62].
References
Gilbert JA, Quinn RA, Debelius J, Xu ZZ, Morton J, Garg N, Jansson JK, Dorrestein PC, Knight R. Microbiomewide association studies link dynamic microbial consortia to disease. Nature. 2016;535:94–103.
Wang J, Jia H. Metagenomewide association studies: finemining the microbiome. Nat Rev Microbiol. 2016;14:508–22.
Gilbert JA, Blaser MJ, Caporaso JG, Jansson JK, Lynch SV, Knight R. Current understanding of the human microbiome. Nat Med. 2018;24:392–400.
Gilbert JA, Jansson JK, Knight R. Earth Microbiome Project and Global Systems Biology. mSystems. 2018;3:e00217–17.
Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons SM, Ackermann G, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551:457–63.
McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen Y, et al. American gut: an open platform for citizen science microbiome research. mSystems. 2018;3:e00031–18.
Falony G, Joossens M, VieiraSilva S, Wang J, Darzi Y, Faust K, Kurilshikov A, Bonder MJ, VallesColomer M, Vandeputte D, et al. Populationlevel analysis of gut microbiome variation. Science. 2016;352:560–4.
Zhernakova A, Kurilshikov A, Bonder MJ, Tigchelaar EF, Schirmer M, Vatanen T, Mujagic Z, Vila AV, Falony G, VieiraSilva S, et al. Populationbased metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352:565–9.
He Y, Wu W, Zheng HM, Li P, McDonald D, Sheng HF, Chen MX, Chen ZH, Ji GY, Zheng ZD, et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat Med. 2018;24:1532–5.
Deschasaux M, Bouter KE, Prodan A, Levin E, Groen AK, Herrema H, Tremaroli V, Bakker GJ, Attaye I, PintoSietsma SJ, et al. Depicting the composition of gut microbiota in a population with varied ethnic origins but shared geography. Nat Med. 2018;24:1526–31.
Costea PI, Hildebrand F, Arumugam M, Backhed F, Blaser MJ, Bushman FD, de Vos WM, Ehrlich SD, Fraser CM, Hattori M, et al. Enterotypes in the landscape of gut microbial community composition. Nat Microbiol. 2018;3:8–16.
Mallick H, Ma S, Franzosa EA, Vatanen T, Morgan XC, Huttenhower C. Experimental design and quantitative analysis of microbial community multiomics. Genome Biol. 2017;18:228.
Knight R, Vrbanac A, Taylor BC, Aksenov A, Callewaert C, Debelius J, Gonzalez A, Kosciolek T, McCall LI, McDonald D, et al. Best practices for analysing microbiomes. Nat Rev Microbiol. 2018;16:410–22.
Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013;10:1200–2.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:550.
Paulson JN, Pop M, Bravo HC. Metastats: an improved statistical method for analysis of metagenomic data. Genome Biol. 2011;12:12.
Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12:R60.
Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’hara R, Simpson GL, Solymos P, Stevens MHH. Wagner H: vegan: community ecology package. 2018. R package version 2.55. https://CRAN.Rproject.org/package=vegan. Accessed 16 Nov 2019.
Zapala MA, Schork NJ. Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proc Natl Acad Sci U S A. 2006;103:19430–5.
Clarke KR. Nonparametric multivariate analyses of changes in community structure. Aust J Ecol. 1993;18:117–43.
Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, Fernandes GR, Tap J, Bruls T, Batto JM, et al. Enterotypes of the human gut microbiome. Nature. 2011;473:174–80.
Holmes I, Harris K, Quince C. Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS One. 2012;7:e30126.
Ding T, Schloss PD. Dynamics and associations of microbial community types across the human body. Nature. 2014;509:357–60.
Chazal F, Guibas LJ, Oudot SY, Skraba P. Persistencebased clustering in Riemannian manifolds. J ACM. 2013;60:41.
McInnes L, Healy J: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426 2018.
Amir el AD, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, Pe’er D: viSNE enables visualization of high dimensional singlecell data and reveals phenotypic heterogeneity of leukemia. Nat Biotechnol 2013, 31:545–552.
Lvd M, Hinton G. Visualizing data using tSNE. J Mach Learn Res. 2008;9:2579–605.
Lum PY, Singh G, Lehman A, Ishkanov T, VejdemoJohansson M, Alagappan M, Carlsson J, Carlsson G. Extracting insights from the shape of complex data using topology. Sci Rep. 2013;3:1236.
Singh G, Mémoli F, Carlsson GE. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In: SPBG; 2007. p. 91–100.
Li L, Cheng WY, Glicksberg BS, Gottesman O, Tamler R, Chen R, Bottinger EP, Dudley JT. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci Transl Med. 2015;7:311ra174.
Chan JM, Carlsson G, Rabadan R. Topology of viral evolution. Proc Natl Acad Sci U S A. 2013;110:18566–71.
Rizvi AH, Camara PG, Kandror EK, Roberts TJ, Schieren I, Maniatis T, Rabadan R. Singlecell topological RNAseq analysis reveals insights into cellular differentiation and development. Nat Biotechnol. 2017;35:551–60.
Nicolau M, Levine AJ, Carlsson G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci U S A. 2011;108:7265–70.
Baryshnikova A. Systematic functional annotation and visualization of biological networks. Cell Syst. 2016;2:412–21.
Ren B, Schwager E, Tickle T, Huttenhower C. sparseDOSSA: Sparse Data Observations for Simulating Synthetic Abundance. 2019. R package version 1.10.0. Accessed 16 Nov 2019.
Schwager E, Mallick H, Ventz S, Huttenhower C. A Bayesian method for detecting pairwise associations in compositional data. PLoS Comput Biol. 2017;13:e1005852.
Benjamini Y, Hochberg Y. Controlling the false discovery rate  a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57:289–300.
Kocabiyik AL, Cetin C, Dedicova D. Detection of Salmonella spp. in stray dogs in Bursa Province, Turkey: first isolation of Salmonella Corvallis from dogs. J Vet Med. 2006;53:194–6.
Gnanandarajah JS, Johnson TJ, Kim HB, Abrahante JE, Lulich JP, Murtaugh MP. Comparative faecal microbiota of dogs with and without calcium oxalate stones. J Appl Microbiol. 2012;113:745–56.
Kang DW, Park JG, Ilhan ZE, Wallstrom G, Labaer J, Adams JB, KrajmalnikBrown R. Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children. PLoS One. 2013;8:e68322.
Johnson JL, Slentz CA, Duscha BD, Samsa GP, McCartney JS, Houmard JA, Kraus WE. Gender and racial differences in lipoprotein subclass distributions: the STRRIDE study. Atherosclerosis. 2004;176:371–7.
Murphy WG. The sex difference in haemoglobin levels in adults  mechanisms, causes, and consequences. Blood Rev. 2014;28:41–7.
Forslund K, Hildebrand F, Nielsen T, Falony G, Le Chatelier E, Sunagawa S, Prifti E, VieiraSilva S, Gudmundsdottir V, Pedersen HK, et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature. 2015;528:262–6.
Maier L, Pruteanu M, Kuhn M, Zeller G, Telzerow A, Anderson EE, Brochado AR, Fernandez KC, Dose H, Mori H, et al. Extensive impact of nonantibiotic drugs on human gut bacteria. Nature. 2018;555:623–8.
Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, Keilbaugh SA, Bewtra M, Knights D, Walters WA, Knight R, et al. Linking longterm dietary patterns with gut microbial enterotypes. Science. 2011;334:105–8.
Knights D, Ward TL, McKinlay CE, Miller H, Gonzalez A, McDonald D, Knight R. Rethinking “enterotypes”. Cell Host Microbe. 2014;16:433–7.
Lozupone CA, Knight R. Global patterns in bacterial diversity. Proc Natl Acad Sci U S A. 2007;104:11436–40.
Ley RE, Lozupone CA, Hamady M, Knight R, Gordon JI. Worlds within worlds: evolution of the vertebrate gut microbiota. Nat Rev Microbiol. 2008;6:776–88.
Huttenhower C, Gevers D, Knight R, et al. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–14.
Kiers HAL. Modern multidimensional scaling: theory and applications. Psychometrika. 1999;64:95–7.
Dey TK, Mémoli F, Wang Y: Multiscale mapper: topological summarization via codomain covers. In Proceedings of the twentyseventh annual acmsiam symposium on discrete algorithms. SIAM; 2016: 997–1013.
Tripathi A, Marotz C, Gonzalez A, VazquezBaeza Y, Song SJ, Bouslimani A, McDonald D, Zhu Q, Sanders JG, Smarr L, et al. Are microbiome studies ready for hypothesisdriven research? Curr Opin Microbiol. 2018;44:61–9.
Sinha R, AbuAli G, Vogtmann E, Fodor AA, Ren B, Amir A, Schwager E, Crabtree J, Ma S, Microbiome quality control project C, et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat Biotechnol. 2017;35:1077–86.
Edgar R. SINTAX: a simple nonBayesian taxonomy classifier for 16S and ITS sequences. BioRxiv. 2016; https://doi.org/10.1101/074161.
Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods. 2013;10:996–8.
Liao TH, Wei YC, Zhou HK. tmap: an integrative framework based on topological data analysis for populationscale microbiome stratification and association studies. GitHub 2019. https://github.com/GPZBioinfo/tmap Release version 1.1. Accessed 16 Nov 2019.
Liao TH, Wei YC, Zhou HK. tmap: an integrative framework based on topological data analysis for populationscale microbiome stratification and association studies. Zenodo. 2019; https://doi.org/10.5281/zenodo.3228979. Release version 1.1. Accessed 16 Nov 2019.
Falony G, Joossens M, VieiraSilva S, Wang J, Darzi Y, Faust K, Kurilshikov A, Bonder MJ, VallesColomer M, Vandeputte D, et al: Populationlevel analysis of gut microbiome variation. Datasets. Eur Nucleotide Arch 2016. https://www.ebi.ac.uk/ega/studies/EGAS00001001689 Accessed 1 Oct 2019.
McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen YF, et al: American Gut Project. Datasets. Eur Nucleotide Arch 2018. https://www.ebi.ac.uk/ena/data/view/PRJEB11419 Accessed 1 Oct 2019.
Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons SM, Ackermann G, et al: A communal catalogue reveals Earth’s multiscale microbial diversity. Datasets. Earth Microbiome Project 2017. ftp://ftp.microbio.me/emp/release1 Accessed 1 Oct 2019.
Acknowledgements
We thank Huansheng Gao and Peijian Qiu for their help in improving the Python codes for network plotting and coloring and Xuechen Chen for her help in improving the manuscript. We also thank the HPC computational support from BioMed Big Data Center of PICB of SINH.
Review history
The review history is available as Additional file 25.
Additional information
Peer review information: Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
This work was supported by the Shenzhen Science and Technology Innovation Committee (Basic Science Research Grant JCYJ20170818154941048 to ZHOU Haokui), National Key R&D Program of China (2017YFC0907505 and 2018YFA0900704), Key Research Program of the Chinese Academy of Sciences (KFZDSW2195), and International Partnership Program of Chinese Academy of Sciences (153D31KYSB20170121). This work was also funded by the Shenzhen Peacock Team Plan (KQTD2015033117210153 and KQTD2016112915000294). Financial support was also provided by the BioMed Big Data Center of PICB of SINH, and Shandong Academician Workstation Program #170401 (to ZHAO GuoPing), Shandong Children’s Microbiome Center, Qilu Children’s Hospital of Shandong University.
Author information
Authors and Affiliations
Contributions
HKZ and GPZ designed the study. THL, YCW, and HKZ developed the software and performed the analysis. All authors wrote, read, and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Figure S1.
Comparison between simulated data and training data for different microbiome datasets via principal coordinate analysis (PCoA). BrayCurtis distance matrix is used in the analysis. PCoA plots show the similarity between simulated data and training data of (a) FGFP, (b) EMP, (c) AGP and (d) the demo data of SparseDOSSA, respectively.
Additional file 2: Figure S2.
Performance of tmap in detecting linear and nonlinear patterns of simulated microbiome associations for different number of metadata. Receiver operating characteristic (ROC) curves are used to compare the performance between (a) tmap and adonis, (b) tmap and ANOSIM, (c) tmap and envfit , in detecting microbiomeassociated metadata. Categorical metadata are used for the comparison between tmap and ANOSIM. Continuous metadata are used in other cases. Three scenarios of association with different number of metadata are examined (including linearonly, nonlinearonly and a mix of both). The shaded areas indicate 95% confidence intervals (100 repeats). (d) Onesided (greater) ttest is used to test the significance of improved area under the curve (AUC) scores of tmap over the other three methods (envfit, adonis and ANOSIM respectively).
Additional file 3: Figure S3.
Illustrations of tmap in the detection of associations of simulated metadata. Color legend (from blue to red) indicates values of metadata (from small to large). Network color represents SAFE scores on each node.
Additional file 4: Figure S4.
Comparison of rankings of host covariates associated with the FGFP microbiomes using envfit, adonis, ANOSIM and tmap.
Additional file 5: Figure S5.
Example of large variances of a host covariate in a local subnetwork that lead to low SAFE scores. Left, PCoA plot of samples colored according to the host covariate of time since previous relief. Right, TDA network colored according to the SAFE scores of time since previous relief. The zoomed area shows a local subnetwork with a large variance of the covariate, which results in low SAFE scores. Node colors are based on their SAFE scores, from red (large values) to blue (small values).
Additional file 6: Figure S6.
Illustrations of TDA network enrichment analysis of metadata compared with PCoA. (a,c,e) PCoA plots of microbiome samples of the FGFP cohort, colored according to the covariates of Gender:F, Time since previous relief and HDL cholesterol, respectively. (b,d,f) TDA network enrichment scores (SAFE scores) of the covariates of Gender:F, Time since previous relief and HDL cholesterol, respectively. Colors are based on their values, from red (large values) to blue (small values).
Additional file 7: Figure S7.
Indepth stratification of the FGFP microbiomes. (a) Stratification based on traditional enterotype analysis. Nodes are colored based on enterotype driver taxa. (b) Stratification based on the most enriched taxon, which is identified from all taxa by comparing their SAFE scores on each node. Node colors are based on the identified taxon. Only taxa enriched in more than 100 nodes are highlighted. Remaining unstratified nodes (with no enriched taxa) are colored in gray.
Additional file 8: Figure S8.
PCA of the SAFE scores of taxa and host covariates shows the overall pattern of their associations with the AGP microbiomes. The top 10 covariates and taxa identified by SAFE enriched scores are highlighted (markers with edge color of gray) and annotated with their names. Host covariates are colored based on metadata category, and taxa are in red. Marker size is scaled according to the SAFE enriched score of metadata or taxa.
Additional file 9: Figure S9.
Comparison of TDA network enrichment patterns between classes of EMPO level1 and ENVO_biome level1. Enriched subnetworks of the EMP microbiomes are identified and colored based on the classes of EMPO level1 (a) and classes of ENVO_biome level1 (b), respectively. Only enriched nodes are colored and showed in the network. The remaining nodes are colored in gray.
Additional file 10: Figure S10.
TDA network enrichment patterns of hostassociated microbiomes. Nodes in the TDA network of the EMP microbiomes are colored based on their enriched host. Classification of the hosts are curated manually. Only enriched nodes are colored and showed in the network. The remaining nodes are colored in gray.
Additional file 11: Figure S11.
Proportion of total variance explained by each PC in PCoA of the FGFP, AGP and EMP datasets.
Additional file 12: Figure S12.
Regression between original distance and projected distance or network distance for the FGFP, AGP and EMP datasets respectively. PC1 and PC2 of PCoA are used as filters for tmap analysis.
Additional file 13: Figure S13.
Recovering microbiome variations from their original high dimensional space by tmap. After the projection of high dimensional microbiome profiles into a low dimensional space (such as PC1 and PC2 of PCoA), original variations might be lost. For example, two distinct clusters of samples (colored orange and blue) from the original space are superimposed on each other in the projection space. The clustering step of tmap performs clustering analysis using their original distances to recover the separated clusters from the original space. The recovered variation is captured by tmap in its network representation.
Additional file 14: Figure S14.
Better regression performance of tmap in capturing microbiome variations in high dimensional space than dimension reduction methods. Each panel shows the linear regressions between the original distance and the projected distance (at the left), or the network distance (at the right), for different dimension reduction methods (panels along the row), and for different number of components used as filters (panels along the column). Rsquared (R^{2}) is shown for each regression. The proportion of pairs of nodes that are not connected is indicated by the TDA network sparseness metric.
Additional file 15: Figure S15.
More variance captured by TDA network distance in tmap than the projected distance for samples within a same cover. For each pair of nodes constructed from the same cover, projected distances (colored blue) and network distances (colored red), both normalized into the range of [0, 1], are plotted against their original distances. Coefficient of variation (CV) is shown to indicate the variances captured for different datasets: (a) the AGP dataset, (b) the EMP dataset, (c) the FGFP dataset.
Additional file 16: Figure S16.
Workflow of evaluation and selection of PCs as filters in tmap for a subset of target variables. The workflow begins with principal coordinates analysis (PCoA) of a microbiome dataset to obtain individual PCs and the proportion of total variance explained by each PC. Aggregated scores of the chosen target variables are then calculated for each PC by using the PC as filter in tmap. Ranking and accumulation curve of the aggregated scores is then employed to select the most suitable PCs for a final tmap analysis, according to a specified threshold of the cumulative aggregated scores. Details of score calculation are shown alongside each step.
Additional file 17: Figure S17.
Comparison of different reduction methods as filters in tmap. Bristol stool scale (BSS), time since previous relief, mean corpuscular hemoglobin (MCH) and shift work are indicated in red. The four common host covariates in the top 10’s of the rankings from different methods are shown in bold text.
Additional file 18: Figure S18.
Illustrations of simulated nonlinear associations for categorical metadata.
Additional file 19: Text S1.
Descriptions of the simulations of circular and spiral association patterns.
Additional file 20: Table S1.
Comparison of the performances in detecting simulated metadata between envfit, adonis, ANOSIM and tmap.
Additional file 21: Table S2
. Detection of host covariates significantly associated with the FGFP microbiomes using envfit, adonis, ANOSIM and tmap.
Additional file 22: Table S3.
Coenrichment subnetworks of pet past 3 months and its coenriched features of the FGFP microbiomes.
Additional file 23: Table S4.
Ordination results of the SAFE scores of metadata and taxa for the datasets of FGFP, AGP and EMP.
Additional file 24: Table S5.
Comparison of the stratification of the AGP microbiomes between tmap and PAM based clustering.
Additional file 25:
Review history.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liao, T., Wei, Y., Luo, M. et al. tmap: an integrative framework based on topological data analysis for populationscale microbiome stratification and association studies. Genome Biol 20, 293 (2019). https://doi.org/10.1186/s1305901918714
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305901918714