LinDA: linear models for differential abundance analysis of microbiome compositional data

Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered log-ratio transformed data, and correcting the bias due to compositional effects. We show that LinDA enjoys asymptotic FDR control and can be extended to mixed-effect models for correlated microbiome data. Using simulations and real examples, we demonstrate the effectiveness of LinDA. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02655-5).


Background
The role of the human microbiome in health and disease has been intensively studied over the past few years, see, e.g., [1,2], for several reviews. Potentially pathogenic or probiotic microorganisms can be identified by analyzing their abundances in a microbial ecosystem (e.g., the human gut) with respect to some covariate of interest such as disease status.
Current prevailing technologies for studying the human microbiome use metagenomic sequencing, where either the DNA of a taxonomically informative gene (e.g. 16S rRNA) or all the genomic DNA in the microbial genome is sequenced. After obtaining the raw sequencing reads, the reads can be clustered into operational taxonomic units (OTUs), denoised into amplicon sequence variants (ASVs), or mapped to a microbial reference database (taxa) using existing bioinformatics pipelines such as UPARSE, DADA2, and MetaPhlAn [3,4,5]. For simplicity, we use the term taxon (pl. taxa) to represent any taxonomic unit (OTU/ASV/taxon) from a bioinformatics pipeline. Therefore, after bioinformatics processing, we have an abundance table recording the frequencies of detected taxa in the samples, together with a meta data table capturing the sample-level information. Differential abundance analysis is then carried out based on the abundance and meta data table.
Ideally, we want to measure the absolute abundance of the microorganisms, i.e., the number of microorganisms per unit area/volume at the microbial ecosystem, and differential abundance analysis is performed on the absolute abundance data. However, the data from a sequencing experiment only captures the relative abundance (compositional) information since the total sequence read count, also known as sequencing depth or library size, does not reflect the total microbial load in the specimen due to the complex chemistry involved in sequencing [6,7]. Although there are several experimental techniques such as qPCR, spike-in and flow cytometry that can be used to achieve absolute abundance measurement, the severe limitations of these techniques prevent their wide adoption [8].
Therefore, the prevailing sequencing protocol is still only able to measure the relative abun-dances. Drawing inferences about the changes on the unknown absolute abundance based on the measured relative abundance data is challenging due to missing the total microbial load information. The increase or decrease in the abundance of some taxa with respect to a covariate of interest automatically results in changes in the relative abundances of all other taxa, a statistical phenomenon known as compositional effects. Therefore, using the standard statistical techniques such as two-sample t-test, Wilcoxon rank sum test, and linear regression analysis ignoring the compositional nature of the data could lead to a large number of false discoveries.
For the differential abundance problem to be well defined, one has to make assumptions.
One major assumption is that the differential signal is sparse, i.e., only a small proportion of taxa are associated with the covariate of interest. Although many studies have supported the sparse signal assumption, there are also studies support dense signal hypotheses, where a large number of taxa are differential with small effect sizes [9, 10]. Therefore, the validity of a method and the definition of true or false positive depends on the specific assumption one is willing to accept. Here our goal is to provide a statistical tool that could be potentially useful for pinpointing top candidate taxa for further biological validation.
To address compositional effects in differential analysis, one popular approach is robust normalization. It involves calculating a normalizing factor (scale factor), which is robust to a small number of differential taxa and could well capture the sequencing effort for the non-differential part. Therefore, dividing by such a normalizing factor will bring the abundance of the non-differential taxa to the same scale while retaining the differences for those differential ones. Assuming the number of differential taxa is small, different strategies have been used to calculate a robust normalizing factor including TMM, RLE, CSS, and GMPR [11,12,13,14]. We list these methods in Additional file 1: Table S1. In contrast, the naive total sum scaling (TSS) normalization, which divides the counts by the library size, is not a robust normalization method [14].
These normalization techniques can be combined with different statistical procedures in differential abundance analysis. For example, we can divide the counts by the normalizing factor from the normalization techniques in Additional file 1: Table S1 and then apply standard statistical tools based on the normalized data. The normalizing factor could also be included as an offset in regression models such as edgeR [15], DESeq2 [16], MicrobiomeDDA [17], and metagenomeSeq [13], where the TMM, RLE, GMPR, and CSS normalization are the accompanying normalization methods. The recently developed MaAsLin2 [18] uses log linear models on the normalized abundance data. Different normalization approaches including TSS, TMM, CSS, and CLR are options in MaAsLin2. A variant to the robust normalization approach is to find a reference taxon or a set of reference taxa, which are assumed to be non-differential with respect to the covariate of interest. The data are then normalized by the count of the reference taxon (or the sum of the counts of the reference taxa). This strategy was used in RAIDA [19] and DACOMP [20].
Another line of methods to tackle the compositional effect uses (log) ratio approach since only ratios are well defined for compositional data [21]. The ALDEx2 method by [22] uses the centered log-ratio (CLR) transformation, where the counts of a sample are divided by their geometric mean before taking logarithms. Differential abundance analysis is then performed using Wilcoxon rank sum test or t-test based on the CLR transformed data. In the CLR approach, the geometric mean can also be regarded as a robust normalizing factor. The ANCOM proposed by [23] computes the pairwise ratios of the relative abundances and identifies the taxa with the most differential ratios. This is based on the observation that the abundance ratios for those differential taxa to other taxa are all differential assuming distinct effect sizes while the ratios for those non-differential taxa are mostly non-differential. Therefore, by analyzing the pattern of the pairwise ratios, one could distinguish the differential taxa from a background of non-differential taxa with high accuracy. Recently, a bias-corrected version of ANCOM (called ANCOM-BC) has been proposed [24], which uses a linear regression framework based on log-transformed taxa count and estimates the unknown bias term due to the compositional effect through an EM algorithm.
In the work of [25,26], the authors evaluated several popular methods in differential abundance analysis (ANCOM-BC/MaAsLin2 not included) and showed that the inflation of the false discovery rate (FDR) is still a ubiquitous problem, and no method is satisfactory in all aspects. A method that is computationally efficient, relatively robust and powerful, and flexible enough to allow covariate adjustment and application to correlated microbiome data is still lacking in the field. In this paper, we propose a linear regression framework for differential abundance analysis (LinDA) to fill the methodological gap. LinDA involves three simple steps that can be carried out efficiently. First, it runs linear regressions using the CLR transformed abundance data as the response. Then it identifies a bias term due to the compositional effect and corrects for the bias using the mode of the regression coefficients across different taxa. Finally, it computes the p-values based on the bias-corrected regression coefficients and applies the Benjamini-Hochberg (BH) procedure to control the FDR. We rigorously prove the asymptotic FDR control of the proposed method, making it the first procedure that enjoys a theoretical FDR control guarantee. Our approach is related to ANCOM-BC but differs in several aspects. (i) Our derivation provides a clear interpretation of the bias term and suggests a simple way to correct it. (ii) Our procedure does not involve the EM-algorithm and can be 100-1000 times faster than ANCOM-BC in our numerical studies. (iii) Our method can be directly extended to the mixed-effect models. Longitudinal and repeated measurement-based microbiome studies have been increasingly common [27, 28] but statistical tools for correlated microbiome data analysis remain scarce. LinDA can analyze the correlated microbiome data using the classic linear mixed effects models. Through extensive simulation studies and real data analyses, we show that the new method outperforms the state-of-the-art approaches in terms of FDR control and power.

Numerical studies
Setups We conducted comprehensive simulations to evaluate the performance of the proposed method under different setups. We set m = 500 as the baseline for the number of taxa, which is similar to the number of tests at the species level for a typical microbiome study. We investigated the sample size n = 50, 200 representing small and large sample sizes, respectively. More combinations of m and n were studied in additional settings. We generated the absolute abundances from the log normal distribution and considered three cases for the covariate of interest and confounders: the covariate of interest follows the Bernoulli distribution and no confounder (denoted as C0), the covariate of interest follows the standard normal distribution and no confounder (C1), and the covariate of interest follows the Bernoulli distribution and two confounders (C2). In addition to the basic setting (log normal abundance distribution, denoted as S0), we investigated other settings to study the robustness of the proposed method including zero-inflated absolute abundances (S1), correlated absolute abundances (S2), gamma abundance distribution (S3), smaller m (S4), smaller n (S5), 10-fold difference in library size (S6), negative binomial abundance distribution (S7) and correlated microbiome data generated by mixed-effect model (S8). See section 5.6 for more details.
Competing methods We compared our method with ANCOM-BC, ALDEx2, DESeq2, edgeR, metagenomeSeq and MaAsLin2. For DESeq2 and edgeR, we replaced their native normalization methods with GMPR normalization, which was shown to improve the power and false positive control in differential abundance analysis [14]. For metagenomeSeq, there are two implementations, fitZig and fitFeatureModel, in the R Bioconductor package metagenomeSeq. Currently, fitFeatureModel is only applicable to binary covariate case (C0). We use metagenomeSeq2 and metagenomeSeq to denote the fitFeatureModel and fitZig procedures, respectively. We also compared with the standard non-parametric methods: Wilcoxon rank sum test for case C0 (a binary covariate) and Spearman correlation test for case C1 (a continuous covariate), both with the GMPR normalized data.
For the proposed method, we considered two zero-handling approaches. The first approach adds a pseudo count of 0.5 to all the counts, which is widely used in microbiome data analysis on the log scale. However, it has been shown to be problematic under certain situations [20]. We thus designed a new imputation-based approach, where the zeros were imputed by N s /(max k:Y ik =0 N k ) for ith taxon in sth sample, where N s denotes the library size (sequencing depth) of sth sample and Y ik denotes the read count of ith taxon in kth sample. In other words, zeros were imputed differently according to the library size of the sample, and zeros in the sample with a larger library size were replaced with larger fractions. In this imputation approach, we treat zeros as left-censored missing data. Suppose we only know the library sizes, then a natural strategy is to impute zeros in proportion to the library size with the sample of the largest library size receiving a fractional count close to 1 (in our approach, we simply set it as 1). The purpose of the imputation strategy is to reduce false positives when the library size is correlated with the covariate of interest.
As shown in the simulation studies, the pseudo-count approach worked sufficiently well in most settings except the setting S6, where the library size between the groups differed by 10 folds. In contrast, the imputation approach reduced the false positive rate extensively for the setting S6 (Additional file 2: Fig. S1). However, it was slightly less powerful than the pseudo-count approach when the library size was a not confounder (Additional file 2: Fig. S2). Thus, in the implementation, we used an adaptive approach: we first tested the association between the covariate of interest and the library size. If the p-value was smaller than 0.1, we used the imputation approach conservatively; otherwise, we used the pseudo-count approach. Additional file 2: Fig. S1 and S2 show that the adaptive method controls the false positives when the library sizes are very different among groups while retaining the power when the library sizes are similar.
The proposed LinDA method can be viewed as a three-step procedure: CLR-OLS-BC (OLS stands for ordinary least squares and BC stands for bias correction), which can be   easily extended to the linear mixed-effects model using CLR-LMM-BC (LMM stands for linear mixed-effect model). In the setting S8 (correlated microbiome data), we compared CLR-LMM-BC to CLR-OLS-BC, CLR-OLS, and CLR-LMM to demonstrate the utility of LinDA for correlated microbiome data analysis.

Results
We use S0C0 (log normal abundance distribution, a binary covariate) to denote the setting S0 (log normal abundance distribution) with the covariate design C0 (a binary covariate) and likewise for other setups. For S0, we studied all the three covariate designs (C0-C2), and for S1-S8, we only performed C0 for demonstration. We found that We first point out that MaAsLin2 with CLR normalization is essentially the same as the CLR-OLS procedure we described earlier. Additional file 2: Fig. S3 compares LinDA, CLR-OLS, MaAsLin2-TSS, MaAsLin2-TMM, MaAsLin2-CSS and MaAsLin2-CLR under the setting S0C0 (log normal abundance distribution, a binary covariate). We can see that MaAsLin2-CLR is close to CLR-OLS, both of which suffer from FDR inflation. We included MaAsLin2 with its default configuration (i.e., TSS normalization) in our comparisons below. both methods control the FDR around the target level, and ANCOM-BC is slightly more powerful than LinDA when the sample size is small. However, under C1 (Additional file 2: Fig. S4), LinDA controls FDR at the target level at both sample sizes while ANCOM-BC has slight FDR inflation when the sample size is small. LinDA is also slightly more powerful than ANCOM-BC at a small sample size. The Wilcoxon rank sum test based on GMPR normalized data and MaAsLin2 perform well under C0 (a binary covariate, Fig. 1) with slightly inflated FDR at larger effect sizes and reasonable power across settings. In contrast, for a continuous covariate (C1, Additional file 2: Fig. S4), the Spearman rank correlation test and MaAsLin2 have large FDR inflation when the signal is dense. When there are confounders (C2, Additional file 2: Fig. S5), Wilcoxon has severe FDR inflation when the sample size is large due to its inability to adjust for confounders, while MaAsLin2 provides acceptable results as under C0. ALDEx2 is a conservative method, which offers the strongest FDR control but is much less powerful. metagenomeSeq2 performs well when the signal is sparse but fails to control the FDR when the signal is dense. We also studied the effect of zero inflation and the correlations among taxa (S1C0 and S2C0, Additional file 2: Fig. S6 and S7), where we observed similar patterns such that LinDA and ANCOM-BC had overall the best performance among the compared methods.
Since LinDA assumes a log normal distribution of the absolute abundance, it is interesting to evaluate its performance when the log normal assumption is violated. We thus simulated the absolute abundance data using a gamma distribution (S3C0), and the results are depicted in Fig. 2. It shows that LinDA controls the FDR close to the target level and has the highest power. When the signal is dense (20%), ANCOM-BC has a noticeable FDR inflation while ALDEx2, metagenomeSeq2, MaAsLin2 and Wilcoxon have severe FDR inflation when the signal is dense.
With a smaller number of taxa (m = 50, S4C0, Additional file 2: Fig. S8), ANCOM-BC controls the FDR and the power is also high. LinDA is the most powerful but it has slight FDR inflation. metagenomeSeq2, MaAsLin2 and Wilcoxon control the FDR but are less powerful in the case of sparse signal. However, when the signal is dense, they could not control the FDR properly. When the sample size is very small (n = 20 or 30, S5C0), LinDA controls the FDR around the target level and maintains high power (Additional file 2: Fig. S9). ANCOM-BC and metagenomeSeq2 have large FDR inflation and the inflation seems to increase as the sample size gets smaller. MaAsLin2 and Wilcoxon are much less powerful and ALDEx2 has virtually no power. Under the setting S6C0, where the sequencing depth differs by 10 folds, only ALDEx2 and our proposed method with adaptive zero-handling approach are able to control the FDR (Additional file 2: Fig. S10).
LinDA achieves a better performance in both the FDR control and power than ALDEx2.
It is interesting that ALDEx2 performs better under S6C0 than under other settings. We point out here that when we implemented ANCOM-BC, we disabled its zero treatment.
To further investigate whether its zero treatment option improves its performance, we also run the procedure enabling its zero treatment (zero cut = 0.9, lib cut = 1000, struc zero = TRUE), and found the results were very similar (S6C0, Additional file 2: Fig. S11).
Under the previous simulation settings, we found that DESeq2 and edgeR had the worst false positive control (Additional file 2: Fig. S20-S28). As the two methods assume negative binomial distribution for the counts, it is interesting to see their performance when the data are generated by their assumed model (S7C0). Additional file 2: Fig.   S29 shows that DESeq2 and edgeR (and metagenomeSeq) remain to have serious FDR inflation, indicating that the normalization approach to address the compositional effect is not sufficient. In contrast, LinDA and ANCOM-BC perform the best among competitors as in other settings, and ANCOM-BC achieves higher power than LinDA (Additional file 2: Fig. S12).
Finally, we applied LinDA to correlated microbiome data (S8C0), where the other competing methods except MaAsLin2 are not applicable to correlated samples. Additional file 2: Fig. S13 and S14 compare the methods CLR-LMM-BC (LinDA-LMM), CLR-OLS-BC (LinDA-OLS), CLR-LMM, CLR-OLS and MaAsLin2 for correlated data. In the scenario of comparing the pre-treatment and post-treatment samples (S8.1, Additional file 2: Fig.   S13), we could clearly see that ignoring the bias tremendously increases the FDR level especially under dense signals (LinDA-LMM vs CLR-LMM). In addition, LinDA-LMM is more powerful than LinDA-OLS due to its ability to exploit the correlation between preand post-treatment samples. Under the replicate sampling setting (S8.2, Additional file 2: Fig. S14), we see that the LinDA-OLS has significant FDR inflation by treating the replicate samples as independent ones. In contrast, LinDA-LMM controls the FDR at the target level. MaAsLin2 control the FDR under both settings but is less powerful than LinDA-LMM.
Based on the presented simulation settings, we summarize that LinDA and ANCOM-BC have overall the most robust performance among the methods evaluated. However, ANCOM-BC is computationally intensive. As shown in Table 1, LinDA could be 100-1000 times faster than ANCOM-BC, making LinDA a highly scalable method in practice.
In addition, the extension of LinDA to the mixed-effect models is easily carried out and performs well. with the study ID 1460 and 524. All the datasets have binary phenotypes. Antibiotics use is the confounder for the IBD dataset (p = 0.03 and OR = 0) while sex is the confounder for the SMOKE dataset (p = 0.02 and OR= 2.26). They will be adjusted in methods that are capable of covariate adjustment. We excluded samples with less than 1000 read counts and taxa which appear in less than 10% of the samples. The basic characteristics for the four filtered datasets are summarized in Table 2. We compared the detection power as well as their overlap patterns for LinDA, ANCOM-BC, ALDEx2, metagenomeSeq2, MaAsLin2 and Wilcoxon. Specifically, we compared the number of discoveries at different FDR levels (0.01-0.25) and used UpSet plot [34] to show the overlap at the target FDR of 0.1. We used winsorization at quantile 0.97 to reduce the impact of potential outliers as recommended in [17].

Real data applications
Results For the CDI dataset, LinDA or MaAsLin2 made the most discoveries at different FDR levels (Fig. 3). At 10% FDR, LinDA discovered eight and MaAsLin2 discovered six taxa associated with CDI. In contrast, ANCOM-BC, ALDEx2, and Wilcoxon discovered three while metagenomeSeq2 discovered two. As discussed in [29], subjects with CDI were more likely to have the bacterial family Lachnospiraceae and Erysipelotrichaceae. LinDA found one more taxon belonging to Lachnospiraceae than other methods (blue bars in Fig. 4). Besides, LinDA, MaAsLin2 and Wilcoxon found one differential taxon belonging to Erysipelotrichaceae while the other three methods did not identify any (orange bar in Finally, we applied LinDA-LMM and MaAsLin2 to the SMOKE dataset, where each subject has two replicate samples from the throat. The aim is to identify smoking-associated taxa adjusting for the sex. To account for the correlation between the replicate samples, we included a subject-level random intercept in LinDA-LMM. As a comparison, we also applied LinDA-OLS and MaAsLin2 to the right and left throat samples separately. LinDA-OLS based on the left or right throat samples alone discovered 12 and 15 differential taxa at 10% FDR. When both left and right samples were used in LinDA-LMM, 21 differential taxa were identified, covering the majority of the taxa identified based on the left or right throat samples alone (Fig. 4). In addition, LinDA-LMM detected five taxa that were missed by analyzing the left or right samples separately. Compared to MaAsLin2, LinDA-LMM discovered seven more differential taxa. Therefore, LinDA-LMM provides a convenient way to analyze correlated microbiome datasets and enjoys the power improvement by analyzing all samples together.
To visualize the results, LinDA provides a function to generate the effect size plot and volcano plot for differential taxa. Additional file 2: Fig. S16-S19 display the effect size plots of differential taxa at FDR level of 0.1 and volcano plots for the four datasets, respectively.
In the effect size plots (Additional file 2: Fig. S16A-S19A), the taxa in black are detected by LinDA and taxa in red are detected solely by LinDA. In Additional file 2: Fig. S16A (CDI), the taxa in blue are missed by LinDA but detected by one or more other methods.
In Additional file 2: Fig. S17A (IBD) and S18A (RA), the taxa in blue are missed by LinDA but detected by two or more other methods. No taxa are detected by MaAsLin2 but missed by LinDA-LMM for the SMOKE dataset. Based on the effect size plots for the CDI, IBD and RA datasets, we can see that, for the taxa solely detected by LinDA, the effect sizes tend to be underestimated without bias correction and bias correction improves the power in these cases. On the contrary, for the taxa missed by LinDA, the effect sizes tend to be overestimated without bias correction. In addition, we observe that the differential taxa for the IBD, CDI and RA datasets are more unbalanced (i.e., more negative or positive associations) while the differential taxa for the SMOKE dataset are relatively balanced (i.e., similar numbers of negative and positive associations). Indeed, the effect size plots, where we plot both the debiased and un-debiased coefficients, revealed larger biases for the IBD, CDI, and RA datasets.

Discussion
Differential abundance analysis is at the core of the statistical analysis of microbiome data.
Microbiome data are compositional in nature and all we know are the relative abundances, making the identification of differentially abundant taxa at the ecological site particularly challenging [6, 7]. Numerous differential abundance analysis methods have been proposed focusing on addressing the compositional effects [15,16,17,13,19,20,22,23,24]. Among all the competing methods, ANCOM-BC is the state-of-the-art method, it has been demonstrated to be more robust and powerful than the competing methods. However, there are two drawbacks of ANCOM-BC. First, it is computationally intensive for large-scale microbiome datasets such as the AmericanGut dataset. Due to the huge inter-subject variation, large-scale microbiome studies have been increasingly popular, resulting in larger sample sizes. On the other hand, metagenomic sequencing has become increasingly deeper to have a high-resolution view of the microbiome, leading to an unprecedented number of new microbial features. To meet the analysis need for such large-scale datasets, a computationally efficient tool is much needed. Secondly, ANCOM-BC is not applicable to correlated/clustered microbiome datasets such as those from family/longitudinal microbiome studies or studies with paired and repeated measurements [27,28]. Longitudinal microbiome analysis, which enables the study of the trajectory of the microbiome as well as controls for potential confounders, has been increasingly employed in human microbiome studies. Unfortunately, statistical tools for longitudinal microbiome studies are scarce. In contrast, LinDA is computationally efficient since it only involves fitting regular linear regression models and could be easily scaled to hundreds of thousands of taxa. Moreover, the extension of LinDA to linear mixed effects models (LMM) is straightforward and we have highly efficient tools such as the R lme4 package [35] for fitting LMM. Therefore, differential abundance analysis of correlated/clustered microbiome datasets could be easily performed using LinDA. Our framework also gives more insights into the CLR-based approach, which has been widely used in compositional data analysis [21]. However, the bias of CLR regression models has not been formally recognized to our best knowledge. Our framework justifies the use of CLR regression and provides a solution to correct the bias associated with CLR regression.
In the simulation, we found that Wilcoxon rank sum test and MaAsLin2 showed similar FDR/power curves and performed fairly well in most settings. As we mentioned earlier, MaAsLin2 is based on log linear models on the normalized count data, and it is essentially a two-sample t-test when no confounders are included, which explains why MaAsLin2 is close to Wilcoxon rank sum test. However, when we simulated an even stronger compositional effect by drawing the differential taxa from the top 25% most abundant taxa, we found that Wilcoxon rank sum test and MaAsLin2 began to break down (Additional file 2: Fig.   S15). ANCOM-BC was overall robust and powerful, but it had inflated type I error at small sample sizes. metagenomeSeq2 did not perform well when the signal was dense and was generally less powerful than ANCOM-BC and LinDA. ALDEx2 was the most conservative method: its strong FDR control was at the expense of statistical power. LinDA was as competitive as ANCOM-BC in most settings. It had better FDR control than ANCOM-BC when the sample size was small or the covariate of interest was continuous. However, LinDA had some FDR inflation when the number of taxa was small. Under a very strong compositional effect (Additional file 2: Fig. S15), LinDA also showed some FDR inflation but overall it had the best FDR and power trade-off.
When the library size was associated with the covariate of interest, all existing methods had severe type I error inflation. Fortunately, such association is detectable and if we see a significant association, rarefaction should be used for those methods. Although rarefaction controls the effect of uneven library sizes, it discards a significant portion of the reads and thus loses much information in the data. When there are many samples with small library sizes, the users have to decide whether to retain more reads or more samples. In LinDA, we implemented a heuristic imputation method, where the imputed values are proportional to the library sizes. This procedure makes the imputed data after CLR transformation independent of the library size and substantially reduces the inflated type I error due to library size confounding.
Although the presented simulation settings could give basic insights into the performance of various methods, such model-based simulations might not be able to capture the full characteristics of the real microbiome data. It is very likely that the performance of the compared methods will change using a different simulation framework. Moreover, our simulation strategy purposely creates strong compositional effects, where all differential taxa show the same direction of change. Such setting is used to test the limit of the various methods in addressing the compositional effects. However, in real data, the compositional effects may not be always strong, and the FDR inflation of many methods could be very moderate. Therefore, a future benchmarking study, which uses real data-based simulation strategy and investigates all biologically plausible differential settings, is much needed to have a comprehensive and objective evaluation of existing differential abundance analysis methods.
As for all model-based approaches, LinDA has several assumptions and limitations.
First, LinDA relies on the assumption that there is a mode at 0 for the regression coefficients (Condition (vi) in Theorem 1). This assumption is easy to be met if the signal is sparse. In the simulation, we show that when the signal density is around 20%, LinDA is still very robust. However, when the signal is extremely dense, LinDA could fail. Second, LinDA assumes a log linear model on the absolute abundance. Although this is a reasonable assumption, which has been widely adopted in the analysis of abundance data, the interaction between the host and the microbiome could be more complex than the simple log linear relationship. Analysis of the residuals from the CLR regression could provide evidence about whether the assumption is reasonable. If the model assumption is violated, a permutation test or transformation of the variables may be performed. Finally, although LinDA provides asymptotic FDR control, its finite-sample FDR control is not guaranteed.
Based on numerical simulations, we found that LinDA performed well under small sample sizes. However, we did observe some FDR inflation under a small feature size due to inefficiency in mode estimation with few features. Therefore, we do not recommend applying LinDA to datasets with small feature sizes (e.g., m < 50) such as phylum-level abundance data.
LinDA uses the relative abundance data and does not model the sampling variability of the read counts. This could reduce the statistical power for those less abundant taxa, whose sampling variability is larger than those abundant taxa. To remedy the power loss, another multinomial sampling layer could be imposed on top of LinDA. However, the computational complexity will be increased significantly and breaking the simplicity of LinDA. Another Besides microbiome data, LinDA could be applied to other sequencing data such as RNA-Seq data since all sequencing data are compositional in nature [37]. Thus, LinDA could be an alternative for differential expression analysis if there are strong compositional effects, for example, when the highly abundant genes are differential with the same direction of change.
Finally, we comment that addressing compostionality is more relevant when analyzing individual microbial features such as differential abundance analysis, since the major interest to biomedical investigators is to find those truly differential features ("driver") instead of those driven by the compositional effect ("passenger"). However, for community-wide analysis such as distance-based analysis [38,39], addressing the compositionality may not be necessary in order to control the type I error. This is because that compositional effect is only relevant under the alternative hypotheses. Considering compositionality in the community-wide analysis has also been found to have small effects on the statistical power [40,25]. Additionally, in microbiome-based predictive models [41], the relative abundances and/or their ratios could already be informative features for prediction and addressing compositionality may not necessarily increase the prediction accuracy significantly. Therefore, whether to address compositionality depends on the specific problems.

Conclusions
In summary, we proposed LinDA for differential abundance analysis of microbiome compositional data. LinDA identified a bias associated with traditional linear regression models based on CLR transformed abundance data and proposed a strategy to estimate and correct the bias. LinDA can be extended to linear mixed effects model for analysis of correlated microbiome data. As a general methodology, LinDA can be applied to differential abundance analysis of other high-dimensional compositional data.

Setup
We use C, C 1 , and C 2 to denote positive constants, which can be different from line to line.
As summarized in the background, there are two ways to tackle the compositional effects in differential abundance analysis, namely normalization and log-ratio transformation. In this paper, we adopt the CLR transformation and develop a bias-correction procedure to address the compositional effects. Denote the absolute abundance and the observed read count of the ith taxon in the sth sample by X is and Y is , respectively. For the sth sample, the total read count of all taxa, N s = m i=1 Y is , is determined by the sequencing depth and DNA materials. Given N s , it is natural to model the stratified count data over m taxa through a multinomial distribution as where e is denotes the estimation error, which is expected to diminish as N s gets large.

OLS estimation
We consider the log linear model on the absolute abundance where c s = (c s1 , ..., c sd ) is the d-dimensional covariates to be adjusted, u s is the covariate of interest, and is is the error term. Our goal is to discover taxa that are differentially abundant with respect to u s . Statistically, we want to simultaneously test the following m (2) and (3), the CLR-transformed data satisfies the following linear model (4), we can see that the OLS estimator for α based on the CLR transformed data is biased with the bias where z s = (u s , 1, c s ) . We respectively let var z (·) and cov z (·, ·) denote the variance and covariance computed conditional on z 1 , ..., z n . It can be shown that whereρ is the (1, 1)th element of (n −1 n s=1 z s z s ) −1 .

Bias correction
In many applications, it is reasonable to assume that there is only a small portion of differential taxa, i.e., most α i 's are equal to 0. Under this assumption, asα i is an unbiased estimator forᾱ i = α i −ᾱ, the mode ofα i is expected to be close to −ᾱ. This observation motivates us to estimate −ᾱ by In (6), K is a non-negative even function with ∞ −∞ K(y)dy = 1, and h is the bandwidth parameter. Under some regular conditions, we have as m, n → ∞ (see the supplementary material for the proof). Therefore, one can estimate α i by the bias-corrected estimatorα i =α i +α.

Testing procedure
To construct a statistic for testing H 0,i , we need to find a proper estimator for the variance ofα i . To this end, we note that Since var z (α i ) isρσ 2 i /n, it dominates var z (α) and cov z (α i ,α) as n, m → ∞ under mild conditions. Thus, we estimate the variance ofα i byρσ 2 i /n. As shown in the next section, the studentized statistic T i := √ nα i / ρσ 2 i is asymptotically normal. However, for small sample, we found that t distribution provides a better approximation to the sampling distribution of T i . We define the p-value for testing H 0,i as where F n−d−2 (·) denotes the cumulative distribution function of t distribution with n−d−2 degrees of freedom. Based on the p-values in (7), we can use the BH procedure to control the FDR. The above discussion leads to the following Algorithm 1.
Algorithm 1 Linear models for differential abundance analysis (LinDA) 1.
Step 1: Run OLS based on the CLR transformed observations and calculateα i and σ 2 i as in (5).

3.
Step 3: Calculate the p-values as in (7) and run the BH procedure.
Remark 1. Built upon the linear regression framework, our method could be easily extended to the mixed-effect model: where γ i is the random effect and r s is the corresponding design. Mixed effects can be used to analyze correlated microbiome data from studies involving replicates or spatial sampling as well as family-based and longitudinal microbiome studies. We suggest using the R function lmer to estimate the parameters for the CLR-transformed data. Denote byα i,lmer ,σ 2 i,lmer , and df i,lmer the estimations forᾱ i , the variance ofα i,lmer , and the degrees of freedom ofα i,lmer from the lmer function. We compute the bias-corrected estimateŝ α i,lmer =α i,lmer +α lmer , whereα lmer is obtained as the same procedure used in (6). Similarly, we let T i,lmer =α i,lmer /σ i,lmer and p i,lmer = 2F df i,lmer (−|T i,lmer |). The BH procedure on p i,lmer is finally used to control the FDR.
Remark 2. Compared to the existing methods based on either normalization or CLR transformation, our method is computationally much more efficient and can be easily scaled to problems with tens of thousands of taxa. Table 1 compares the computation time of LinDA and ANCOM-BC based on simulated datasets. We observe that our method is 100-1000 times faster than ANCOM-BC. We also tested on a massive dataset of the similar scale of the AmericanGut project [42] (m = 5000 and n = 10000). ANCOM-BC completed the analysis in 85 minutes compared to 28 seconds for our method (see the column of S0C0 in Table 1). Large-scale microbiome studies have been increasingly common to overcome the large inter-subject variability, making our method practically useful for the analysis of big microbiome datasets.

Asymptotic FDR control
Suppose the target FDR controlling level is q. The BH procedure is equivalent to finding Here I denotes the indicator function. To show the asymptotic FDR control as m, n → ∞, we take a Bayesian perspective by assuming that the parameters α i 's are independently generated from a common distribution. The key result is summarized in the following theorem and technical details can be found in the supplementary note. (ii) σ i 's are i.i.d. and P(C 1 < σ i < C 2 ) = 1.
(vi) Denote by f n (·; a) the density function of √ nα i + √ aε is for any a > 0. For large enough n, the density f n (·; ρ) has a unique mode at 0, i.e., arg max x∈R f n (x; ρ) = 0; for any > 0, .
There exists t 0 such that for large enough n, Under the above conditions, we have Conditions (i)-(v) help prove the consistency of the variance estimators and the mode of the regression coefficients. By assuming that the errors follow the normal distributions (Condition (iii)), we can integrate all the relevant covariate information in a single parameterρ, which facilitates the establishment of the consistency of the kernel density estimation and hence the estimator of mode. In the simulation studies, we also investigated the scenario of non-normal distribution. We use an example to illustrate Condition (vi). In particular, we assume that √ nα i follows a discrete distribution with P( √ nα i = a n,l ) = π l for l = 0, 1, where a n,0 = 0, a n,1 = 0, π l > 0, and π 0 + π 1 = 1. To reflect the sparsity, π 0 is set to be 0.8. We choose a n,1 = 2 and 5 representing weak and strong signals, respectively.
We consider two cases for the error variance:

Detailed setups for numerical studies
The differential taxa were randomly drawn from the entire set. In particular, let H i = 0 if the ith taxon is differentially abundant and H i = 1 otherwise. The underlying truth H i was generated from We simulated two levels of signal density (i.e., percentage of differential taxa) γ = 5%, 20%, roughly corresponding to sparse and dense signals. We assumed that the baseline absolute and correspondingly the absolute abundance X is were draw based on represents the coefficients of the confounders, i = 1, . . . , m. Let The observed OTUs data were simulated by Multinomial(N s , π 1s , . . . , π ms ).
To create a power curve, we included six effect sizes labeled as {1, 2, ..., 6} in the figures.
We made the effect sizes have the same signs for differential taxa (i.e., the differential taxa have the same direction of change), creating a relatively strong compositional effect. Since low-abundance taxa have much less statistical power, we up-weighted their effects so that the power will not be dominated by those abundant ones. Specifically, for a randomly drawn differential taxon i, we set where µ is equally spaced on [1.05, 2] andπ is /n. We considered three cases for the covariate and confounders: In the above, c s1 is specified to independently follow the Rademacher distribution and c s2 ∼ i.i. d. N (0, 1). The corresponding coefficients of the confounders β i ) , i = 1, . . . , m, were independently generated from a 2-dimensional normal distribution with mean (1, 2) and variance I 2 , where I 2 denotes the 2 by 2 identity matrix.
The parameters β (0) i , σ 2 i , and N s were generated based on the estimation for a real dataset (COMBO) from the study of the gut microbiota in a general population [45], which consists of 98 samples and 6674 taxa. We only used its 500 most abundant taxa. Since β (0) i and σ 2 i were not directly estimable using the relative abundance data, we estimated β We assume that the library size for each sample follows the negative binomial distribution where the mean and dispersion parameters were estimated based on the combo data. The resulting sparsity (percent of zeros) of the count matrix is around 65%-75%.
In addition to the basic setting (S0, log normal abundance distribution), we designed seven other settings to study the robustness of the proposed method. Specifically, on top of S0 and C0 (a binary covariate), we studied S1. zero-inflated absolute abundances. The microbiome data contains excessive zeros and many zeros in the microbiome data can be explained by insufficient sampling [46] since majority of the taxa are of low-abundance. However, it is also possible that zeros are due to physical absence of the taxa [47]. To study the effect of zero inflation on differential abundance analysis, we randomly forced 30% of the absolute abundance data to be 0.
S2. Correlated absolute abundances. Existing differential abundance analysis methods assume independence among taxa. However, in practice, taxa are interconnected forming networks [48]. It is interesting to see if the methods compared are robust to the correlations among the taxa. In this setting, we simulated block-correlation structure by dividing the 500 taxa into 25 equal-sized blocks. Within each block, we further divided the block into 2 by 2 sub-blocks and simulated equal positive correlations (0.5) within each sub-block and equal negative correlations (−0.5) between the two sub-blocks. This mimics the scenario that there are mutualistic relationships within the group and competitive relationships between groups.
S3. Gamma abundance distribution. Although the log normal distribution has been widely used for modeling species abundance data, other models such as gamma distribution are also possible [49]. We thus did additional simulation studies using the . Similarly, we estimated η S5. Smaller n. In pilot microbiome studies, the sample sizes are usually small. It is interesting to study the performance of the methods at a much smaller sample size.
We studied n = 20, 30, and used the same effect size as n = 50.
S6. 10-fold difference in library size. When the microbiome samples are not fully randomized in sequencing, it is likely that samples of the two groups end up in two separate sequencing runs leading to very different library sizes for the two groups.
Since the presence/absence of a taxon strongly depends on the library size, the differential library size will confound the two-sample comparison, especially for those rare taxa [25]. To create differential library sizes, we generated the library size from N s ∼ NB(5000, 5.3) and N s ∼ NB(50000, 5.3) for the two groups, respectively.
S7. Negative bionomial abundance distribution. DESeq2 and edgeR assume negative binomial distribution for the counts, thus we included one more simulation setting, where we generated the counts from the negative binomial distribution. Let X i ). Similarly, we estimated the κ where r s has one element equal to 1 and all the others equal to 0 indicating the subject ID of sample s. Each element of γ i follows N (0, τ 2 i ) independently, where we let τ 2 i = a i σ 2 i with a i ∼ Unif([0, 1]).

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
LinDA is implemented as the linda function in the CRAN R package MicrobiomeStat
[21] Aitchison J. (1986). The statistical analysis of compositional data. Chapman and Hall.       Supplementary notes for "LinDA: linear models for differential abundance analysis of microbiome compositional data" S1 Normalization approaches  Note that the first number in the square brackets represents the reference number in the manuscript.

S2 Technical details
In the following, we use F X (·) to denote the cumulative distribution function of a random Then for any 0 < t 0 < ∞, Proof of Lemma S1. Proof of Lemma S2. Throughout the proof, we shall assume that ε is /σ i is C-sub-Gaussian, which is indeed slightly weaker than Condition (iii). For any λ > 0, we have Thusε is conditional on {σ i } is sub-Gaussian by Condition (ii). Letθ i = (ᾱ i ,β i ) and and for any δ > 0, For the first term, we have For the second term, it can be shown that with δ 1 > 0 being a small enough constant. In the above, the last inequality is due to the condition σ min {E(z s z s )} > C and Lemma S8 of [50, Zhou et al., 2021]. We conclude that |σ 2 i −σ 2 i | has an exponential tail of the order O(e −C 1 n ) by using the Chernoff bound and the fact that the product of two sub-Gaussian variables is sub-exponential [51, Vershynin, 2018]. Thus by the union bound and Condition (ix), we have max i |σ 2 i −σ 2 i | = o P (1). Observing that we obtain the desired result that max i |σ 2 Proof of Lemma S3. We have andη is the first row of (n −1 n s=1 z s z s ) −1 . We first prove that U = o P (1). Using similar arguments as in the proof of Lemma S1, we have |η − η| = o Pn (1), where η is the first row where denotes the Hadamard product (element-wise product). The above implies that It is not hard to see that mode for any a, which may be related to m but is independent of i. We then have Therefore, we only need to show thatM : Given Condition (vi), we have that for large enough n, and then it boils down to show that Note that for any a > 0, where φ(·) denotes the density function of the standard normal distribution.
It implies that f n (x; a) is uniformly continuous and bounded uniformly over n and a > C.
For (S2), note that U i 's have the same distribution as √ρ ε is and are independent given The inverse Fourier transformation provides After plugging this expression into the definition of f m,h , it shows that where the last equality is because k(u) is even. This result further implies Using the above inequality, the Cauchy-Schwartz inequality, and Euler's identity (i.e., |e ıx | = 1), it shows that the left hand side of (S2) satisfies where the result of converging to 0 is due to Conditions (vii) and (viii). Therefore, (S2) is satisfied, which completes the proof.
Proof of Lemma S4. In the following, we focus on showing sup 0<t<t 0 |m −1 S m,n (t) − S ∞,n (t)| = o P (1). The proof of the second statement can be obtained by similar arguments, and thus is omitted.
The goal is to show Recall in the proof of Lemma S3, we have for any positive constants δ, δ 1 , and δ 2 , where the last step is due to ρ > C, σ i > C, and the results from Lemmas S1-S3. Thus we only need to show that for any δ > 0 and > 0, there exist ξ > 0, δ 1 = 0 and δ 2 = 0 such that for large enough m, and sup n, |ρ−ρ|<ξ, 0<t<t 0 First, (S4) is a direct result of applying the Glivenko-Cantelli theorem [52, Wainwright, 2019]. For (S5), we note that the cumulative distribution function of E + α i / aσ 2 i /n for any a > 0, denoted by G n (·; a), can be expressed as where Φ(·) represents the cumulative distribution function of the standard normal distribution. Thus G n (x; a) is equicontinuous uniformly over n and a > 0. In other words, for any > 0, there exists a δ > 0 such that sup n, a>0, which verifies the (S5). Further, sup n, |ρ−ρ|<ξ, |x|<t 0 |G n (x;ρ) − G n (x; ρ)| can be arbitrarily small as long as ξ is small enough, which confirms (S6).
Proof of Theorem 1. Observe that where S m,n (t) and S ∞,n (t) are defined in Lemma S4. Together with Lemma S4 and Condition (x), we deduce that there exists some t 0 such that t * < t 0 for large enough n, The conclusion follows by using Lemma 8.3 of [53, Cao et al., 2021].