mi-Mic: a novel multi-layer statistical test for microbiota-disease associations

mi-Mic, a novel approach for microbiome differential abundance analysis, tackles the key challenges of such statistical tests: a large number of tests, sparsity, varying abundance scales, and taxonomic relationships. mi-Mic first converts microbial counts to a cladogram of means. It then applies a priori tests on the upper levels of the cladogram to detect overall relationships. Finally, it performs a Mann-Whitney test on paths that are consistently significant along the cladogram or on the leaves. mi-Mic has much higher true to false positives ratios than existing tests, as measured by a new real-to-shuffle positive score. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03256-0.

1. Import miMic and additional packages.from mimic da import apply mimic import pandas a s pd 2. Load the raw ASVs table in the following format: • The first column is named "ID".
• Each row represents a sample and each column represents an ASV.
• The last row contains the taxonomy information, named "taxonomy".d f = pd .r e a d c s v ( " e x a m p l e d a t a / f o r p r o c e s s .c s v " ) • Note: 'for process.csv' is a file that contains the raw ASVs table in the required format, you can find an example file in the 'example data' folder in GitHub.
3. Load a tag table as csv, such that the tag column is named "Tag".
t a g = pd .r e a d c s v ( " e x a m p l e d a t a / t a g .c s v " , i n d e x c o l =0) • Note: 'tag.csv' is a file that contains the tag table in the required format, you can find an example tag in the 'example data' folder in GitHub.
4. Specify a folder to save the output of the miMic test.
f o l d e r = " e x a m p l e d a t a /2 D images " • Note: '2D images' is a folder that will be created in your current working directory, and the output of the miMic test will be saved there.
• MIPMLP using defaulting parameters, you can find more in 'Note' section below.
• correct first: apply FDR correction to the starting taxonomy level according to sis parameter,[True, False].The default is True.
• save: whether to save the corrs df of the miMic test to computer,[True, False].The default is True.
-In "test" mode the defaulting value is "None".
-In the "plot" mode the tax is set automatically to the selected taxonomy of the "test" mode [1, 2, 3, "noAnova"].-"noAnova", where a priori nested ANOVA test is not significant.
-"nosignificant", where a priori nested ANOVA test is not significant and miMic did not find any significant taxa in the leafs.In this case, the post hoc test will not be applied.
• colorful: Determines whether to apply colorful mode on the plots [True, False].The default is True.
• threshold p: the threshold for significant values.The default is 0.05.
• THRESHOLD edge: the threshold for having an edge in the "interaction" plot.The default is 0.5.
• processed: the processed data from the previous step.The default is None.
• apply samba: whether to apply SAMBA or not.The default is True (Boolean).
• samba output: if you already have samba outputs-miMic will read it from the folder you specified, else miMic will apply samba and set samba output to None.
i f p r o c e s s e d i s not None : t a x o n o m y s e l e c t e d , samba output = apply mimic ( f o l d e r , tag , eval="man" , i f t a x o n o m y s e l e c t e d i s not None : apply mimic ( f o l d e r , tag , mode=" p l o t " , tax=t a x o n o m y s e l e c t e d , eval="man" , s i s= ' f d r b h ' , samba output=samba output , s a v e=F a l s e , t h r e s h o l d p =0.05 , THRESHOLD edge=0.5)7. Note: if apply samba is set to True, miMic will apply samba-metric.If save is set to True, the output will be saved to the folder you specified.See SAMBA PyPi https: //pypi.org/project/samba-metric/for more explanations.
1.3 miMic output miMic will output the following: • If save is set to True, SAMBA's outputs and the following 'csv' will be saved to your specified folder: -corrs df: a data frame containing the results of the miMic test (including Utest results).
-just mimic: a data frame containing the results of the miMic test without the Utest results.
-u test without mimic: a data frame containing the results of the Utest without the miMic results.
-miMic&Utest: a data frame containing the joint results of miMic and Utest tests.
• If mode is set to "plot", plots will be saved in the folder named 'plots' in your current working directory.The following plots will be saved: 1. tax vs rp sp anova p: Bar plot illustrating the taxonomy levels in the miMic test vs. the number of significant findings in a real run (RP) shown in blue, and in a shuffled run (SP) shown in red.The highest bar plot represents the actual RP vs. SP of the selected taxonomy level of miMic combined with the leaves test as explained in the Methods.Taxonomy levels used for the a priori nested ANOVA test are shaded in grey.The number of RP significantly exceeds the number of SP (see Fig. 6 A).
2. rsp vs beta: Representation of RSP(β) score as a function of the confidence level beta.An RSP score of 1 indicates the presence of only RP without any SP (see Supp.Mat.Fig. S7).
3. hist: Histogram of the distribution of logged abundances within each level of taxonomy on the cladogram of means.Different line styles and line weights are assigned to each taxonomy level for distinction (see Supp.Mat.Fig. S6).
4. corrs within family Analysis of significant positive and negative relations within taxonomic families.The y-axis displays significant families in the cohort (defined by a family that has at least 1 significant descendant), while the x-axis shows the count of positive relations within a family in blue or the count of negative relations within a family in red.Each family is colored according to its color in the interaction network in Fig. 6 (B) and the cladogram of correlations in Fig. 5 (see Fig. 6 C).
5. interaction: Interaction between significant taxa found in miMic.Each taxon is colored according to its significant family color, similar to Fig. 5 above.Each node shape represents the taxon's order.An edge is drawn between two nodes if their Spearman correlation coefficient (SCC) is above 0.3 (user-adjustable) and its p-value ¡ 0.05.The width of the edge corresponds to its SCC.A blue edge represents a positive relation, while a red edge represents a negative one (see Fig. 6 B).
6. correlations tree: Differential abundance analysis results are visualized on a cladogram for the IBD cohort.Each color represents the sign of the Mann-Whitney score (blue for positive scores, red for negative scores, and grey for non-significant taxa).
The node size corresponds to -log10(p-value) from the Mann-Whitney test in miMic.
The node shape represents its origin of significance: spheres were identified by both miMic and the Mann-Whiteny test on leaves, circles were identified by miMic only, and squares were identified by only the Mann-Whitney test.The colors represent the taxonomic family of each node (see Fig. 5).
2 How to apply miMic via the micrOS website -Running Example

How to apply miMic
1. Load the raw ASVs table via the "Select OTU table" button (see below) in the following format: • The first column is named "ID".
• Each row represents a sample and each column represents an ASV.
• The last row contains the taxonomy information, named "taxonomy".
2. Load a tag table as csv via the "Select tag file" button (see below), such that the tag column is named "Tag".
3. Tick in V the miMic option (see below).
4. Choose the running parameters (see image below).For explanations about each parameter see How to apply miMic via PyPi -Running Example how to apply miMic 6.

miMic output
miMic will output the following: • corrs df: a data frame containing the results of the miMic test (including Utest results).
• The following plots will be presented: 1. tax vs rp sp anova p: Bar plot illustrating the taxonomy levels in the miMic test vs. the number of significant findings in a real run (RP) shown in blue, and in a shuffled run (SP) shown in red.The highest bar plot represents the actual RP vs. SP of the selected taxonomy level of miMic combined with the leaves test as explained in the Methods.Taxonomy levels used for the a priori nested ANOVA test are shaded in grey.The number of RP significantly exceeds the number of SP (see Fig. 6 A).
2. rsp vs beta: Representation of RSP(β) score as a function of the confidence level beta.An RSP score of 1 indicates the presence of only RP without any SP (see Supp.Mat.Fig. S7).
3. hist: Histogram of the distribution of logged abundances within each level of taxonomy on the cladogram of means.Different line styles and line weights are assigned to each taxonomy level for distinction (see Supp.Mat.Fig. S6).
4. corrs within family Analysis of significant positive and negative relations within taxonomic families.The y-axis displays significant families in the cohort (defined by a family that has at least 1 significant descendant), while the x-axis shows the count of positive relations within a family in blue or the count of negative relations within a family in red.Each family is colored according to its color in the interaction network in Fig. 6 (B) and the cladogram of correlations in Fig. 5 (see Fig. 6 C).

interaction:
Interaction between significant taxa found in miMic.Each taxon is colored according to its significant family color, similar to Fig. 5 above.Each node shape represents the taxon's order.An edge is drawn between two nodes if their Spearman correlation coefficient (SCC) is above 0.3 (user-adjustable) and its p-value ¡ 0.05.The width of the edge corresponds to its SCC.A blue edge represents a positive relation, while a red edge represents a negative one (see Fig. 6 B).
6. correlations tree: Differential abundance analysis results are visualized on a cladogram for the IBD cohort.Each color represents the sign of the Mann-Whitney score (blue for positive scores, red for negative scores, and grey for non-significant taxa).
The node size corresponds to -log10(p-value) from the Mann-Whitney test in miMic.
The node shape represents its origin of significance: spheres were identified by both miMic and the Mann-Whiteny test on leaves, circles were identified by miMic only, and squares were identified by only the Mann-Whitney test.The colors represent the taxonomic family of each node (see Fig. 5).
One should click on each plot to download it into the computer.
3 Supplementary Figures Fig. S3: Within-study differential abundance consistency analysis across multiple tools.The percentage of total significant features is plotted against the number of tools that identified the feature as significant.The total number of significant features identified by each tool is provided in the legend.
Fig. S4: Cross-study consistency analysis of differential abundance.The percentage of significant species is plotted against the number of studies where each species was identified as significant, conducted on five inflammatory bowel disease (IBD) cohorts.Observed results for each tool are depicted in the tool color, while the expected results are presented in black (see Methods).Additionally, a parallel analysis on shuffled labels is provided for the ANCOM-BC2 model (green) within (P).The models' performance exceeds that of the expected random model.However, certain tools, such as ANCOM-BC2, exhibit artificially consistent results, as indicated in (P).Note that LINDA-C's plot is empty since there were no common significant species between the studies.4 Supplementary Tables

p r o c
e s s e d = apply mimic ( f o l d e r=f o l d e r , t a g=tag , mode=" p r e p r o c e s s " , p r e p r o c e s s=True , rawData=df , taxnomy group= ' sub PCA ' )

Fig. S1 :
Fig. S1: Microbiome distributions over 2 IBD cohorts (referred to as IBD A-C and Jacob D-F over different preprocessing (grouping and normalization), such as mean grouping + log normalization A, D, Sub-PCA grouping + log normalization B, E and mean grouping + relative normalization C, F.

Fig. S5 :
Fig. S5: Sensitivity robustness assessment.A-B.The heatmaps illustrate Spearman correlation coefficients (SCCs) between each generic dataset characteristic and the percentage of significant taxa identified by each tool per dataset in 16S (A) and WGS (B) cohorts.C-D.The heatmaps illustrate Spearman correlation coefficients (SCCs) between each generic dataset characteristic and the RSP(1) score per dataset in 16S (C) and WGS (D) cohorts.Positive correlations are depicted in red, while negative correlations are shown in blue.Stars indicate a significant correlation (p-value < 0.05).miMic demonstrates robustness across all tested generic features in 16S and WGS datasets.

Fig. S6 :
Fig. S6: Histogram of the distribution of logged abundances within each level of taxonomy on the cladogram of means.Different line styles and line weights are assigned to each taxonomy level for distinction.This histogram is based on the IBD cohort.

Fig. S7:
Fig. S7: Representation of RSP(β) score as a function of the confidence level beta on the IBD cohort.An RSP score of 1 indicates the presence of only RP without any SP.

•
Note: MIPMLP is a package that is used to preprocess the raw ASVs table, see MIPMLP PyPi https://pypi.org/project/MIPMLP/or MIPMLP website https: //mip-mlp.math.biu.ac.il/Home for more explanations.If you have your own processed data, set preprocess to False, and use your processed data as input for processed parameter in the next step.

Table S3 :
P-values of statistical comparison between models and datasets by two-way ANOVA followed by a one-sided t-test between the two best models.