Skip to main content

Table 4 Steps in the LeFE algorithm

From: The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data

Step

Details

A

The gene expression matrix, signature vector, and gene categories (not shown in Figure 1) are entered

B

For each category E i , the genes in the microarray (X) are split into those in E i and those not in E i (denoted, respectively, by two blue and 12 green rows of the gene expression matrix)

C

A negative control set consisting of C × n i genes in X but not in E i is selected at random. Those genes are noted as elevated rows in Figure 1, which was arbitrarily drawn for C = 3. The integer constant C is used to mitigate issues of statistical imprecision associated with small categories. The default value C = 6 creates better run-to-run reproducibility and was used in the actual LeFE calculations

D

A composite matrix of gene expression X iter is assembled by selecting the rows in X that correspond to the n i genes in E i along with the negative control genes selected in step C. The resulting data structure, X iter , has dimension (C × n i ) + n i rows by n s columns

E

A random forest with 400 trees is built on X iter to model the signature vector, using the default value for the random forest parameter mTry. The random forest is given no information for distinguishing between category genes and negative control genes

F

A vector I of standardized importance scores of each gene in X iter is computed internally from the random forest. I is then divided into two sets of importance scores, I E and I notE , for the genes in E i and the negative control set, respectively

G

The statistical significance of the gene expression evidence for rejection of the null hypothesis that the mean of I E is less than or equal to the mean of I notE (that the category genes are, on average, more important to the Random Forest model than the negative control genes) is determined by a one-sided permutation t-test. Because that test compares observed t-statistics with a null distribution of t-statistics of permuted data, it avoids using the parameterized t-distribution and is therefore nonparametric. For statistical robustness, steps C to G are repeated n r times with different, randomly selected sets of negative control genes. The covariate structures of the E and notE genes are likely to differ, but the negative control genes, selected at random from notE, are unbiased with respect to the ranking of categories in the next step and with respect to the calculation of false discovery rates

H

Applying the above procedure to a single gene category creates three outputs. The first (denoted i, in Figure 1) is a median importance score for each of the n i genes in the category. Because the median importance scores are computed within the category's multivariate Random Forest models, they reflect the genes' importance within its complex biological context. The second output (denoted ii) is the entire category's median P value from the n r permutation t-tests. The third output (denoted iii) is an importance plot, which compares the distributions of importance scores of the genes in the category and the negative control genes

  1. Shown are the steps in the Learner of Functional Enrichment (LeFE) algorithm, with the letters in the left-hand column corresponding to those in Figure 1.