Simulation framework (microbiome data)
We simulate samples for two groups: control (C) and treatment (T), and generate OTU counts (\( \mathbf {x}_{j}^{T}\) or \(\mathbf {x}_{j}^{C}\)) in a sample j from a Dirichletmultinomial (DM) distribution with parameters estimated from a real microbial dataset, as has been suggested in several articles [16, 17]. The real throat data, throat_v35, is subset from V35 that is provided in the R package HMP16SData [37], by taking 153 samples collected from throat and 956 OTUs (operational taxonomic units) with nonzero count in more than 25% of samples. In particular, we sample:
$$ \begin{aligned} \mathbf{x}_{j}^{C} & \sim \text{DM}\left(n_{j}, \mathbf{\alpha}^{C}\right) \\ \mathbf{x}_{j}^{T} & \sim \text{DM}\left(n_{j}, \mathbf{\alpha}^{T}\right) \\ \end{aligned} $$
(1)
where \(\mathbf {x}_{j}^{C} =\left (x^{C}_{1j}, \dots, x^{C}_{{Kj}}\right)\) and \(\mathbf {x}_{j}^{T} =\left (x^{T}_{1j}, \dots, x^{T}_{{Kj}}\right)\) are counts of K=956 OTUs in a sample j that belongs to control or treatment group, respectively; n_{j} is the total count of sample j that is randomly sampled from sequencing depths of 153 samples in throat_v35; \(\mathbf {\alpha }^{C} =\left (\alpha ^{C}_{1}, \dots, \alpha ^{C}_{K}\right)\) and \(\mathbf {\alpha }^{T} =\left (\alpha ^{T}_{1}, \dots, \alpha ^{T}_{K}\right)\) are parameters storing information about the relative abundance (proportion) and dispersion of OTUs in the control and treatment group, respectively. We estimate α^{C} using the R package dirmult [38] that reparameterizes α^{C} with \(\mathbf {\pi ^{C}} = \left (\pi ^{C}_{1},..., \pi ^{C}_{K}\right)\) and θ, where \(\pi ^{C}_{k}\) is the expected proportion of OTU k in a sample belonging to the control group, and θ is a parameter about OTU correlation. In short, \(\alpha ^{C}_{k} = \pi ^{C}_{k} \frac {(1\theta)}{\theta }\). In our simulation, θ is estimated from throat_v35 to apply in both control and treatment groups, and π^{C} and π^{T} are manipulated to create three scenarios: BS, US, and SS (see Fig. 2a and Additional file 1: Fig. S8). The simulated data (in the control group) is shown to have similar meanvariance relationship but a bit less random zeros when compared to the real data using countSimQC [39] (see Additional file 2).
In BS, signals are simulated on two randomly selected branches (A and B) by swapping their proportions in the treatment group as Eq. 2; US and SS are in Additional file 1: Supplementary Note 1.
$$ \left\{\begin{array}{ll} {\hat{\pi}}^{T}_{k} =\hat{\pi}^{C}_{k}; & k \notin {A, B}\\ {\hat{\pi}}^{T}_{k} = r\hat{\pi}^{C}_{k}; & k \in {A} \\ {\hat{\pi}}^{T}_{k} =\frac{1}{r}\hat{\pi}^{C}_{k}; & k \in {B} \end{array}\right. $$
(2)
where \(r = \frac {\sum _{k \in B} \hat {\pi }^{C}_{k}}{\sum _{k \in A} \hat {\pi }^{C}_{k}}\) is the fold change; \(\hat {\pi }^{C}_{k}\) is the estimated proportion of OTU k from throat_v35. In other words, π^{C} is estimated from throat_v35, and π^{T} is obtained based on π^{C} by changing values of selected OTUs.
Description of treeclimbR methodology
Data aggregation
Here, the aggregation is shown in Eqs. 3 and 4 for the DA and DS case, respectively. Depending on the dataset and method used in the differential analysis, the mean or median might be used instead of sum. In the DA case, counts of K entities in J samples are observed, and a tree on entities is constructed such that each entity can be mapped to a leaf. Data is aggregated in a way that the count of node i in sample j, Y_{ij} is generated as:
$$ Y_{{ij}} = \sum_{k \in b(i)} Y_{{kj}} \quad \text{and} \quad i = 1, 2,..., M; \quad j = 1, 2,..., J; \quad k = 1, 2,..., K $$
(3)
where b(i) represents the descendant leaves of node i (see tree notations in Fig. 1); M is the total number of nodes on the tree; J is the number of samples; K is the number of entities observed.
In the DS case, we have values of G features observed on each cell from J samples, and a tree about cell subpopulations (entities) is constructed such that multiple cells are mapped to a leaf. Samples are collected from different experiment conditions. The value of feature g on node (cell subpopulation) i in sample j, \(Y^{g}_{{ij}}\) is aggregated from cells as:
$$ Y^{g}_{{ij}} = \sum_{k \in (j \cap i)} Y^{g}_{k} \quad \text{and} \quad i = 1, 2,..., M; \quad j = 1, 2,..., J; \quad k = 1, 2,..., K $$
(4)
where k∈(j∩i) means that a cell k is from sample j and belongs to subpopulation i (cell k is mapped to the descendant leaves of node i, or k∈b(i)); M, J, and K correspond to the total number of nodes, samples, and cells, respectively.
Differential analysis
Differential analysis is performed at all nodes of the tree. For the parametric synthetic microbial data and AMLsim data, we use edgeR to model the count data with negative binomial distribution and obtain P values via likelihood ratio tests for the following methods: BH, HFDR, minP, StructFDR, diffcyt, and treeclimbR. miLineage has its own way to calculate P values. For nonparametric synthetic microbial datasets, the nonparametric Wilcoxon rank sum test is used to compare the taxa’s abundance between two groups, which generates P values for all benchmarked methods. For BCRXLsim, the median transformed expressions of cell state markers on each node (cell subpopulation) of the tree are compared between groups using limma [40], which generates P values for diffcyt, minP, and treeclimbR. Three real datasets (Infant gut microbiota, mouse miRNA, and mouse cortex scRNA) are all count data, and edgeR is used for the differential analysis.
The generation of candidates
Candidates are used to capture the latent signal pattern on the tree. The search for candidates is based on a U score defined as Eq. 5:
$$ \begin{aligned} q_{k}(t) &= \mathbf{sign}(\theta_{k})\mathbf{I}(p_{k} \leq t) \\ U_{i}(t) &= \left\frac{\sum_{k \in B(i)}q_{k}(t)}{n_{B}}\right \\ \end{aligned} $$
(5)
Here, q_{k}(t) is a score of node k, derived from its P value p_{k} and estimated direction sign(θ_{k}), under a tuning parameter t. When p_{k}≤t,q_{k}(t)=1 with sign(θ_{k}); otherwise, q_{k}(t)=0. The U score of node i at t, U_{i}(t), is the absolute average q scores over nodes in B(i) that includes node i and its descendant nodes. n_{B} is the number of nodes in B(i). The U score could be considered as a measure of coordinate change within a branch. It achieves 1 when a consistent pattern, which includes both signs in the same direction and P values below t, is observed, and it is close to 0 when nodes in a branch highly disagree on either the sign or P value. With a suitable t value, we might expect signal branches are in a consistent pattern while others that have P values following a uniform distribution [0,1] and directions arbitrary up or down on leaves are not. Since signal branches are unknown in reality, we cannot directly determine the value of t. To suggest different candidates of signal branches, the tree is explored by tuning t in the range [0,1] (see Additional file 1: Fig. S17).
A candidate at t is obtained using the procedure below:

1
It starts from the root and moves toward leaves along edges.

2
For each path, it stops when a node i having U_{i}(t)=1 and p_{i}<0.05 appears or the leaf is reached.
If a branch without signal by chance has the same direction, its branch node might reach U=1 at high t (e.g., t=1). In branches without signals, to keep candidate close to the leaf level, we hinder the selection of an internal node with a restriction p_{i}<0.05. This means the probability of representing a threeleaf branch, without signals, using an internal node is around 0.01, and is much lower for a larger branch. P values selected in such a procedure are unbiased at different t for branches without signal and follow a uniform distribution (see Additional file 1: Fig. S16).
If multiple features exist, the procedure is carried out separately for each feature, and the global candidate at t, C(t), is defined as:
$$ \mathbf{C}(t) = \bigcup_{g \in \mathbf{G}} C_{g}(t) $$
(6)
where C_{g}(t) is the candidate of feature g generated at t, and G includes all features.
The selection of candidates
Correction for multiple testing is performed separately on each candidate, but FDR is controlled on the leaf level by limiting t in the range as below (see Additional file 1: Supplementary Note 2 and Fig. S15).
$$ t \in [0, 2 \alpha (r1)] $$
(7)
where α is the nominal FDR; r is the average size of signal branches identified at FDR=α. The branch size is the number of leaves in a branch. If r=1, signals do not cluster on the tree, and the leaf level (t=0) should be used. In real data, r is unknown and is estimated for a candidate C(t) as:
$$ \hat{r} =\frac{l}{s} $$
where s is the number of nodes with H_{0} rejected on the candidate C(t), and l is the number of descendant leaves of those rejected nodes.
Candidates that are generated with \(t \notin [0, 2 \alpha (\hat {r_{t}}1)]\) are firstly discarded to control FDR. Those that have reported the highest number of leaves with the lowest number of nodes are then selected to increase power while keeping results as short as possible.
The preprocessing and analysis of datasets
Available methods
For LEfSe, the default settings of LEfSe that is installed with conda in python 2.7 are used. For miLineage, we have applied both onepart (miLineage1) and twopart analysis (miLineage2) using the R package miLineagev2.1. For lasso, we build lassoregularized logistic regression models, which consider values of features (e.g., abundance or expression) on all nodes of the tree as the explanatory variables, and the sample information (e.g, control or treatment group) as the response variable, with R package glmnet2.018 and chose model that gives the minimum mean crossvalidated error. For diffcyt (v1.6.0), we use diffcyt’s testDA_edgeR and testDS_limma to analyze AMLsim and BCRXLsim datasets, respectively. For StructFDR and HFDR, R packages StructFDRv1.3 and structSSIv1.1.1 are used, respectively. Inputs on nodes (e.g., P values) required by methods StructFDR, HFDR, treeclimbR, and minP (see Additional file 1: Supplementary Note 3) are estimated by edgeRv3.28.0 (treeclimbR’s runDA function) in all datasets, except that diffcyt’s testDS_limma was used in BCRXLsim datasets. Unless specified, the default settings provided in R packages are used for all methods.
Parametric synthetic microbial data
To evaluate performance of methods on different signal patterns, datasets are simulated under three scenarios (BS, US, and SS) on two randomly selected branches using the R package treeclimbR’s simData function. More simulations with varying signal branches are provided to introduce signals on branches with different characteristics (see Additional file 1: Supplementary Note 1). Due to the swap of relative abundances between branches, the absolute logFC in BS, SS, and US are 1.45, 2.26, and in the range [0.02,2.13], respectively. For each scenario, 100 repetitions that are on the same signal branches but different counts on OTUs are made. To perform DA analysis, data was aggregated using Eq. 3.
AMLsim and BCRXLsim
Datasets were downloaded from the HDCytoData [41] R package. According to cell type markers, cells were first grouped into a large number of clusters (400,900,1600 in AMLsim datasets and 100, 400, 900 in BCRXLsim datasets) using FlowSOM [42]. Then, among clusters, pairwise euclidean distances were computed using their median expressions of type markers to generate a dissimilarity matrix. Finally, the hierarchical clustering from stats’s hclust [43] was applied on the matrix to create a tree on clusters.
Infant gut microbiota data
The data was downloaded from the curatedMetagenomicData [44] package that provides uniformly processed human microbiome data. Only samples from babies were used. This includes a count matrix with 464 metaOTUs in rows and 285 samples in columns, and a phylogenetic tree that has 464 leaves (metaOTUs) and 463 internal nodes. Samples belong to four time points: 4 days (0M), 4 months (4M), and 12 month (12M). At each time point, there are 15 samples from the Csection group and about 80 samples (80 in 0M, 81 in 4M, and 79 in 12M) from the vaginal group. Data was aggregated according to Eq. 3.
Mouse miRNA data
The data is from KokkonenSimon et al. [30], and 10 samples, including 5 receiving TOC and 5 receiving Sham surgery, are used. The trimming, alignment, and quantification of miRNA sequences were processed using sports [45], which ended up with 6375 miRNA sequences with counts in more than one sample. The tree was constructed based on the origins of the miRNA sequences: the miRNAs were grouped by primary transcript using the miRBase v22.1 annotation, and primary transcripts less than 10kb apart were further grouped into genomic clusters. It has 774 internal nodes and 6375 leaves. A leaf represents a unique sequence, and an internal node represents multiple sequences that share the same biological origin on a specific level. Data was aggregated as Eq. 3, and edgeR [46] was used to compare abundance between mice receiving TOC and mice receiving Sham surgery.
Mouse cortex scRNAseq data
We followed the preprocessing done by Crowell et al. [8] that annotates cells with 8 cell types. To obtain cell type markers, expressions of genes among cell types were first compared using FindAllMarkers (from Seuratv3.1.1) separately in each vehicletreated sample to avoid selecting LPSrelated state genes. For each cell type, the top 20 genes (ranked by absolute logFC) with absolute logFC above 0.5 were then selected; We further removed markers that were only identified in one sample and finally obtained 125 marker genes. Based on 135 unique marker genes (13 canonical type marker genes and 125 computationally identified marker genes), a tree that encodes information of cell subpopulations at different resolutions was constructed using Seurat’sFindClusters (resolution at 6) and BuildClusterTree. The tree has 66 leaves, each of them representing a cell subpopulation. To perform DS analysis, data was aggregated as Eq. 4.