Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome

Fig. 2

Optimizing and training a multi-layer perceptron. We used a number of features to predict chromatin factor binding in each bin. These include (a) expression score (Fig. 1d, e), (b) the number of training cell types with binding of that chromatin factor, (c) chromatin accessibility, (d) PhastCons genomic conservation in placental mammals, and (e) any sequence motif corresponding to that TF in the JASPAR database. In JASPAR, some chromatin factors have no sequence motifs, while others have up to seven different sequence motifs. This led to a number of features p [4, 11], excluding features from HINT footprints or CREAM peaks not used in the main model. (f) For each chromatin factor, we trained a multi-layer perceptron using these features for selected bins in four chromosomes (5, 10, 15, and 20). Specifically, we selected bins with accessible chromatin or ChIP-seq signal in at least one training cell type (selected regions with vertical blue bars are for illustration purpose). To optimize hyperparameters, we repeated the training process with different hyperparameters using fourfold cross validation, excluding one chromosome at a time. For each chromatin factor, we performed a grid search over (g) activation function (sigmoid, tanh, and rectifier), (h) number of hidden units per layer (2(p+1), 50, or 100), (i) number of hidden layers (2, 5, 10, or 50), and (j) L2 regularization penalty (0.0001, 0.001, or 0.01). We chose the quadruple of hyperparameters which resulted in the highest mean Matthews correlation coefficient (MCC) over all four chromosomes. (k) Schematic of the matrix of input features for training the multi-layer perceptron. We used a number of features to predict chromatin factor binding in each bin. (l) Cross-validation MCC of chr5 as a function of the number of hidden layers, for several other hyperparameter values. Size: number of hidden units within each layer. Shape indicates activation function: logistic (circle) or rectifier (triangle). Color indicates L2 regularization penalty: 0.01 (turquoise) or 0.0001 (orange)

Back to article page