Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

The human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce, and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This reveals some biomarkers to be disease-specific, with others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02306-1.

, which included sam les with ty e 2 dia etes (T2 and non-dia etic ( controls and information a out metformin treatment a ut ut of the check.confoundersfunction for the data from Forslund et al. shows that only T2 cases were treated with metformin, su estin that metformin-treatment could confound the associations etween micro iome features and T2 Analysis of ariance (usin ran ed a undance data shows that many s ecies differ as much (or more y metformin treatment as y T2 status treme cases of confoundin are hi hli hted ot size is ro ortional to the mean relati e a undance across sam les c elati e a undance of Entero acter acaea s are si ni cantly lar er for metformin-treated (metformin T2 cases com ared to metformin-ne ati e (metformin-T2 or controls ( -alues from ilco on test SIAMCAT models can easily distin uish etween metformin T2 and controls and etween metformin T2 cases and metformin-T2 cases n the other hand, metformin-T2 cases and controls are harder to distin uish See Fi ure 1 in Forslund et al. as reference Figure S3: Metagenomic am e are more imi ar it in u ect t an acro u ect To illustrate that metagenomic measurements are more similar within subjects than across subjects, all pairwise Bray-Curtis dissimilarities were calculated for in-house datasets with repeated measurements for different subjects.Dissimilarity values are displayed as boxplots and coloured depending on whether the two samples came from the same subject or from two different subjects.The dissimilarities for samples from the same subject are significantly lower than across subject for all datasets (P-values from ilcoxon test .Boxes denote the across all values with the median as a thic blac line and the whis ers extending up to the most extreme points within .-fold .or abo e. or many classi cation tas s our analysis indicates no separation of groups based on the Bray-Curtis distance hereas machine learning models can be trained to accurately distinguish bet een the t o groups. rincipal Coordinate plots based on the Bray-Curtis distance for the classi cation tas s highlighted in (a).Control samples are sho n as grey dots and disease samples are sho n in red irrespecti e of the disease.or the classi cation tas s in the upper ro there is a good separation both based on the Bray-Curtis distance and the machine learning analysis.or the tas s in the lo er ro ho e er accurate machine learning models can be trained but there is no apparent separation based on the Bray-Curtis distance.c qui alent plot as in (a) but based on the uclidean distance after log-transformation.d qui alent plots as in (b) but based on the uclidean distance after log-transformation.Whereas the choice of feature selection cutoff is less important for profiles generated with the RDP profiler the other data types seem to profit the more features are included especially in the case of U An profiles.o es denote the R across all alues with the median as a thick black line and the whiskers e tending up to the most e treme points within .fold R. Accuracy as measured by AUROC is displayed for all parameter combinations and all datasets of different input types broken down by the different normali ation methods included in the parameter e ploration.he resulting accuracy is barely impacted by the choice of normali ation method when the model is trained with the Random orest classifier.or the other two machine learning algorithms howe er the nai e total sum scaling normali ation is not sufficient for optimal performance.o es denote the R across all alues with the median as a thick black line and the whiskers e tending up to the most e treme points within .fold R. false positive rate ( R) in cross validation, t at is, at w ic c toff of t e control samples wo ld be incorrectl classified as diseased en t e trained ML model is applied to samples from ot er datasets, two sit ations can arise eit er t e e ternal test set contains cases from t e same disease (DIS-T ) in addition to corresponding controls (CTR-T ) (top bo ) or t e e ternal test set contains cases (DIS-T ) from a different disease (bottom bo ) In t e former case, standard cross-st d eval ations can be cond cted (see Additional File 1: Fig. S9) or general cross-st d eval ation of model performance t at is also applicable across different diseases, we introd ce two additional meas res irst, we calc late cross-st d portabilit from a R C anal sis between a tr e-positive rate estimated from crossvalidation cases (DIS-TR) and a false-positive rate estimated from e ternal controls (CTR-T ), w ic we rescale to t e interval between and for convenience nalogo sl to a standard C, t is meas re capt res ow well e ternal controls (CTR-T ) can be separated from t e cases contained in t e cross-validation data set (DIS-TR) Low cross-st d portabilit val es indicate t at t ere is no separation between cases and e ternal controls, meaning t at t e model wo ld s ow an increased false-positive rate on control samples from ot er datasets Second, we calc late t e prediction rate for e ternal cases (DIS-T ) at a prediction c toff t at corresponds to a false-positive rate of ad sted on t e cross-validation data set or data sets wit cases from t e same disease, t is eval ation amo nts to assessing prediction rate (of t e same disease, i e tr e-positive rate) across data sets In contrast, if t e e ternal st d is for a different disease t an t e cross-validation data set, t is meas re antifies to w ic e tent t e model e ibits an elevated false positive rate for ot er diseases T is co ld be d e to tec nical differences between st dies (w ic wo ld also be re ected in a low cross-st d portabilit ) or d e to biological similarit between diseases (if t e same microbial mar ers are enric ed in bot diseases, one wo ld e pect an elevated prediction on t e ot er disease as well) T is meas re is t s a pro for t e disease-specificit of t e ML model ( ), (c), and (d) ill strate t e wit in-st d and test set predictions for selected e amples from o r ML meta-anal sis for transfer across data sets for t e same (a) or different diseases (c), (d) wit (d) presenting an e treme e ample of iss es wit bot cross-st d portabilit and disease-specificit mbers indicate t e t pe of comparison eval ation meas re ta en across data sets (as indicated on t e rig t-and side of t e plots)     prediction rate on the same disease (upper row) and on other diseases (lower row) are compared for all models between the reference method for control augmentation (as described and displayed in the main text) and variations of that approach.Here, it is important to acknowledge that "control" is not a clear concept and its definition varies greatly across studies.Nonetheless, it is useful at an operational level to enrich for asymptomatic individuals, therefore reducing bias that could result from unintended comparisons to patients with a different disease.The reference method consisted in the addition of five times the number of controls sampled randomly in each cross-validation split from a set of three large (>250 samples) cohort studies (see Methods in the main text).The other approaches are defined as follows: The n i e o el are models without augmentation, shown as a baseline to visualize the improvements in model transfer achieved by control-augmentation (reference method).While cross-study portability and disease-specificity generally improve, the prediction rate on the same disease (e.g. for different studies including cases of colorectal cancer or Crohn's disease) is sometimes reduced by control-augmentation, reflecting a general tradeoff between sensitivity and specificity.Control-ug ent tion it i il r t et was used for a subset of datasets in our ML meta-analysis that clustered together (all datasets included samples from the same population and were generated in the same laboratory; Yu et al.When training on those datasets, controls were randomly sampled from the other datasets listed above.Using similar datasets in the control-augmentation does not lead to the same improvements that are seen when using a more diverse set of studies to augment.Control-ug ent tion it t o ti e t e nu er o ontrol le is very similar to the reference method in that controls are sampled from the same cohort studies as in the reference method but a lower number of controls is added (twice the number instead five times the number of controls).The method performs very similarly albeit slightly worse compared to the reference method.; more data sets were necessary to obtain a sufficient number of control samples for the reference controlaugmentation approach.We used only samples reported as controls and filtered out repeated samples from the same subject, whenever applicable.The results of this methods are also similar to the reference approach, but lead to a further decrease in prediction on the same disease.Lastly, we used ontrol-ug ent tion it r n o t et , for which we randomly sampled five datasets out of the meta-analysis set and used their control samples to augment the training set.The resulting augmented model was not evaluated for model transfer on the datasets which were used for augmentation.This method behaves similarly to the control-augmentation with other datasets with only minor differences to the reference method.

Figure S1 :
Figure S1: SIAMCAT reproduces the results of previous machine learning meta-analyses To show how SIAMCAT can reproduce the results of previous meta-analyses, we reanalyzed the data from Duvallet et al (a) and Pasolli et al (b) (see references in the main text SIAMCAT wor ows were implemented to fully recapitulate the machine learnin wor ows as descri ed in the respective pu lications, usin the randomFore M al orithm in oth cases Cross-validation performance uanti ed y AC for discriminatin etween diseased patients and controls is indicated y diamonds with lac orders ( con dence intervals denoted y horizontal lines for the SIAMCAT reproduction The A C values reported in the pu lications are indicated y diamonds without orders The sample size of each dataset is iven as additional panel (cut at and iven y num ers instead or all classi cation tas s, the reported results fall within the con dence interval of the SIAMCAT results

Figure S4 :Figure S5 :Figure S6 :
Figure S4: Large-scale application of the SIAMCAT machine learning workflow to human gut metagenomic disease association studies in the curatedMetagenomicsData package a Application of SIAMCAT machine learning workflows to taxonomic profiles generated from fecal shotgun metagenomes using MetaPhlAn2 as available from curatedMetagenomicData Pasolli et al. 2 Cross validation performance for discriminating between diseased patients and controls uantified b the area under the C curve A C is indicated b diamonds confidence intervals denoted b hori ontal lines with sample si e per dataset given as additional panel cut at 2 and given b numbers instead See Ta le and Additional File : Ta le S for information about the included datasets and ke for disease abbreviations Application of SIAMCAT machine learning workflows to functional profiles obtained from MAn 2 as provided b curatedMetagenomicData Pasolli et al. 2 for the same datasets as in a

Figure S7 :Figure S8 :
Figure S7: Classification accuracy is not impacted by choice of profiler (a) Model accuracy as measured by AUROC is plotted for the best-performing parameter set (see Figure 4 in the main text and Additional File 1: Fig. S4) for taxonomic and functional profiles derived from the same dataset.rofiles from mO Us and egg O .are plotted against each other as are profiles from Meta hlAn and UMAn .Overall the accuracies are very ell correlated ( earson s r.P e-) indicating that taxonomic and functional profiles lead to very similar model performances across a ide range of classification tas s. ot si e is proportional to the number of samples per classification tas and dots are colored according to the disease.ee Table1for a ey of the disease abbreviations.(b) or those classification tas s that involve the same dataset and the same disease and for hich both mO Us and Meta hlAn profiles are available all AUROC values from the complete parameter set exploration are sho n as boxplots ith the color indicating the t o different profilers.he AUROC values for the best-performing parameter set (see Figure4in the main text and Additional File 1: Fig.S4) are indicated by dots and triangles respectively.Although there are differences bet een mO Us and Meta hlAn on individual datasets there is no clear trend to ards either method indicating that the choice of taxonomic profiler does not significantly impact the resulting model accuracy (P .from paired ilcoxon test ith the best-performing AUROC values).oxes denote the R across all values ith the median as a thic blac line and the his ers extending up to the most extreme points ithin .-fold R.

Figure S9 :Figure S10 :
Figure S9: Baseline evaluation of cross-study transfer of machine learning models via AUROC and falsepositive rate (a) For the three conditions in our meta-analysis that were represented by three or more data sets each, namely colorectal cancer (CRC), Crohn's disease (CD), and ulcerative colitis (UC), we conducted a classical evaluation of the SI C models on the e ternal datasets within the same disease usin R C analysis ar hei ht corresponds to the mean UR C of models trained on the other datasets and evaluated on the one indicated at the bottom (points indicate individual model performances and error bars show the standard deviation) For a detailed description of the control-au mentation approach see main te t, Figure , and ethods ( ) False-positive rates are shown for models in application to data from different diseases he evaluations on disease cases and controls are summari ed in the top and bottom panels, respectively he hori ontal blac line corresponds to a false-positive rate of to which all models were calibrated usin their respective crossvalidation dataset (see also Additional File : Fig S)hile a false-positive rate below is maintained with few e ceptions by the control-au mented models, the naively transferred models lar ely fail to properly control the false-positive rate on cases of different diseases as well as on the controls from these studies See a le for disease abbreviations

Figure S12 :
Figure S12:Naive machi e ear i g m e ma e a high eve a e re ic i e er a a a e Detection rates for other diseases (red color-scheme) or the same disease (green color-scheme) are shown for the naive ML models.True positive rates of the models, when applied to the training set in cross-validation, are indicated by boxes around the tile.Detection rates over 1 are labeled.hen compared to Figure in the main text, the naive ML models show dramatically higher detection rates on other diseases compared to control-augmented ML models.

Figure S13 :
Figure S13: Control-ug ent tion tr teg gener ll i ro e o el tr n er it out trong e en en e on t e t e n nu er o ontrol le Cross-study portability andprediction rate on the same disease (upper row) and on other diseases (lower row) are compared for all models between the reference method for control augmentation (as described and displayed in the main text) and variations of that approach.Here, it is important to acknowledge that "control" is not a clear concept and its definition varies greatly across studies.Nonetheless, it is useful at an operational level to enrich for asymptomatic individuals, therefore reducing bias that could result from unintended comparisons to patients with a different disease.The reference method consisted in the addition of five times the number of controls sampled randomly in each cross-validation split from a set of three large (>250 samples) cohort studies (see Methods in the main text).The other approaches are defined as follows:The n i e o el are models without augmentation, shown as a baseline to visualize the improvements in model transfer achieved by control-augmentation (reference method).While cross-study portability and disease-specificity generally improve, the prediction rate on the same disease (e.g. for different studies including cases of colorectal cancer or Crohn's disease) is sometimes reduced by control-augmentation, reflecting a general tradeoff between sensitivity and specificity.Control-ug ent tion it i il r t et was used for a subset of datasets in our ML meta-analysis that clustered together (all datasets included samples from the same population and were generated in the same laboratory; Yu et al.Gut 2017, Jie et al.Nat comun 2017, He et al.Gigascience 2017, and Qin et al.Nature 2012).When training on those datasets, controls were randomly sampled from the other datasets listed above.Using similar datasets in the control-augmentation does not lead to the same improvements that are seen when using a more diverse set of studies to augment.Control-ug ent tion it t o ti e t e nuer o ontrol le is very similar to the reference method in that controls are sampled from the same cohort studies as in the reference method but a lower number of controls is added (twice the number instead five times the number of controls).The method performs very similarly albeit slightly worse compared to the reference method.For the ontrol-ug ent tion it ot er t et , we sampled twice the number of control samples from another pool of datasets (Danish samples from Nielsen et al.Nat Biotech 2014, samples from mothers in Backhed et al. e ost icro e 2015, Vincent et al. icro iome 2016, Zhu et al. icro iome 2018, and Poyet et al.Nat e 2019); more data sets were necessary to obtain a sufficient number of control samples for the reference controlaugmentation approach.We used only samples reported as controls and filtered out repeated samples from the same subject, whenever applicable.The results of this methods are also similar to the reference approach, but lead to a further decrease in prediction on the same disease.Lastly, we used ontrol-ug ent tion it r n o t et , for which we randomly sampled five datasets out of the meta-analysis set and used their control samples to augment the training set.The resulting augmented model was not evaluated for model transfer on the datasets which were used for augmentation.This method behaves similarly to the control-augmentation with other datasets with only minor differences to the reference method.
For the ontrol-ug ent tion it ot er t et , we sampled twice the number of control samples from another pool of datasets (Danish samples from Nielsen et al.Nat Biotech 2014, samples from mothers in Backhed et al. e ost icro e 2015, Vincent et al. icro iome 2016, Zhu et al. icro iome 2018, and Poyet et al.Nat e 2019) Figure S14: Datasets cluster isease t e c si eri g ac i e lear i g el eig ts r ass ciati s a Principal coordinate (PCo) analysis based on Canberra distances between relative models weights for naive ML models.Each dot represents a trained model from the repeated cross-validation.Datasets are indicated by 90 density ellipses.or more convenient labeling the C C datasets are abbreviated by their rst letter.PCo analysis based on Canberra distances between relative models weights for control-a gmented ML models.Each dot represents a trained model from the repeated cross-validation and datasets are again indicated by 90 density ellipses.c PCo analysis based on the Canberra distances between gen s-level generali ed fold changes for each dataset.
Cross-study portability on the control portion of external studies (see Methods) is shown as a heatmap for naive models a and control-augmented models .The heatmap only includes models with an AUR C of .or higher (see Figure in the main text).alues e ual or smaller than .are highlighted.