Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

Fig. 1

SIAMCAT statistical and machine learning approach model differences between the groups of microbiome samples. a Each step in the SIAMCAT workflow (green boxes) is implemented by a function in the R/Bioconductor package (see SIAMCAT vignettes). Functions producing graphical output (red boxes) are illustrated in b–e for an exemplary analysis using a dataset from Nielsen et al. [27] which contains ulcerative colitis (UC) patients and non-UC controls. b Visualization of the univariate association testing results. The left panel visualizes the distributions of microbial abundance data differing significantly between the groups. Significance (after multiple testing correction) is displayed in the middle panel as horizontal bars. The right panel shows the generalized fold change as a non-parametric measure of effect size [37]. c SIAMCAT offers statistical tests and diagnostic visualizations to identify potential confounders by testing for associations between such meta-variables as covariates and the disease label. The example shows a comparison of body mass index (BMI) between the study groups. The similar distributions between cases and controls suggest that BMI is unlikely to confound UC associations in this dataset. Boxes denote the IQR across all values with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. d The model evaluation function displays the cross-validation error as a receiver operating characteristic (ROC) curve, with a 95% confidence interval shaded in gray and the area under the receiver operating characteristic curve (AUROC) given below the curve. e SIAMCAT finally generates visualizations aiming to facilitate the interpretation of the machine learning models and their classification performance. This includes a barplot of feature importance (in the case of penalized logistic regression models, bar width corresponds to coefficient values) for the features that are included in the majority of models fitted during cross-validation (percentages indicate the respective fraction of models containing a feature). A heatmap displays their normalized values across all samples (as used for model fitting) along with the classification result (test predictions) and user-defined meta-variables (bottom)

Back to article page