A

The gene expression matrix, signature vector, and gene categories (not shown in Figure 1) are entered

B

For each category E_{
i
}, the genes in the microarray (X) are split into those in E_{
i
}and those not in E_{
i
}(denoted, respectively, by two blue and 12 green rows of the gene expression matrix)

C

A negative control set consisting of C × n_{
i
}genes in X but not in E_{
i
}is selected at random. Those genes are noted as elevated rows in Figure 1, which was arbitrarily drawn for C = 3. The integer constant C is used to mitigate issues of statistical imprecision associated with small categories. The default value C = 6 creates better runtorun reproducibility and was used in the actual LeFE calculations

D

A composite matrix of gene expression X_{
iter
}is assembled by selecting the rows in X that correspond to the n_{
i
}genes in E_{
i
}along with the negative control genes selected in step C. The resulting data structure, X_{
iter
}, has dimension (C × n_{
i
}) + n_{
i
}rows by n_{
s
}columns

E

A random forest with 400 trees is built on X_{
iter
}to model the signature vector, using the default value for the random forest parameter mTry. The random forest is given no information for distinguishing between category genes and negative control genes

F

A vector I of standardized importance scores of each gene in X_{
iter
}is computed internally from the random forest. I is then divided into two sets of importance scores, I_{
E
}and I_{
notE
}, for the genes in E_{
i
}and the negative control set, respectively

G

The statistical significance of the gene expression evidence for rejection of the null hypothesis that the mean of I_{
E
}is less than or equal to the mean of I_{
notE
}(that the category genes are, on average, more important to the Random Forest model than the negative control genes) is determined by a onesided permutation ttest. Because that test compares observed tstatistics with a null distribution of tstatistics of permuted data, it avoids using the parameterized tdistribution and is therefore nonparametric. For statistical robustness, steps C to G are repeated n_{
r
}times with different, randomly selected sets of negative control genes. The covariate structures of the E and notE genes are likely to differ, but the negative control genes, selected at random from notE, are unbiased with respect to the ranking of categories in the next step and with respect to the calculation of false discovery rates

H

Applying the above procedure to a single gene category creates three outputs. The first (denoted i, in Figure 1) is a median importance score for each of the n_{
i
}genes in the category. Because the median importance scores are computed within the category's multivariate Random Forest models, they reflect the genes' importance within its complex biological context. The second output (denoted ii) is the entire category's median P value from the n_{
r
}permutation ttests. The third output (denoted iii) is an importance plot, which compares the distributions of importance scores of the genes in the category and the negative control genes
