Skip to main content
Fig. 5 | Genome Biology

Fig. 5

From: Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

Fig. 5

Control augmentation improves ML model disease specificity and reveals shared and distinct predictors. a Schematic of the control augmentation procedure: control samples from external cohort studies are added to the individual cross-validation folds during model training. Trained models are applied to external studies (either of a different or the same disease) to determine cross-study portability (defined as maintenance of type I error control on external control samples) and cross-disease predictions (i.e., false detection of samples from a different disease). b Cross-study portability was compared between naive and control-augmented models showing consistent improvements due to control augmentation. c Boxplots depicting cross-study portability (left) and prediction rate for other diseases (right) of naive and control-augmented models (see Fig. 1 for the definition of boxplots). d Heatmap showing prediction rates for other diseases (red color scheme) and for the same disease (green color scheme) for control-augmented models on all external datasets. True-positive rates of the models from cross-validation on the original study are indicated by boxes around the tile. Prediction rates over 10% are labeled. e Principal coordinate (PCo) analysis between models based on Canberra distance on model weights. Diamonds represent the mean per dataset in PCo space across cross-validation splits, and lines show the standard deviation. f Visualization of the main selected model weights (predictors corresponding to mOTUs, see the “Methods” section for the definition of cutoffs) by genus and disease. Absolute model weights are shown as a dot plot on top, grouped by genus (including only genera with unambiguous NCBI taxonomy annotation). Below, the number of selected weights per genus is shown as a bar graph, colored by disease (see e for color key). Genus labels at the bottom include the number of mOTUs with at least one selected weight followed by the number of mOTUs in the complete model weight matrix belonging to the respective genus

Back to article page