A pitfall for machine learning methods aiming to predict across cell types

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue. Supplementary Information The online version contains supplementary material available at (doi:10.1186/s13059-020-02177-y).


Model architectures
We evaluated the performance of a variety of neural network models for our tasks. For models that used only epigenomic signal as input, we considered all models that had between 1 and 5 layers and all powers of 2 between 1 and 4096 neurons per layer.
For models that used only nucleotide sequence as input, we considered two different types of models. The first are fully dense networks similar to those that used only epigenomic signal. These models had between 1 and 3 layers with all powers of 2 between 1 and 1024 neurons per layer. The second are convolutional models that are composed of a variable number of convolutional layers followed by max pooling layers and ending with a single dense layer. These convolutional models had between 1 and 3 convolutional layers, between 1 and 256 filters per convolutional layer, and between 1 and 1024 nodes in the final dense layer. The convolutional layers used a kernel of size 8 and a stride of 1. The max pooling layers had a kernel of size 4 and a stride of 4.
The models that used both nucleotide sequence and epigenomic signal were composed of one of the nucleotide models above and one of the epigenomic models. The final hidden layers of the two models were concatenated together and fed through an additional hidden layer before the output. Rather than consider all potential model architectures that utilized nucleotide sequence, we limited our evaluation to only 100 randomly selected model architectures for computational reasons.
In all models, both the convolutional layers and the hidden dense layers used ReLU activations, where f (x) = max(0, x).

Model training
The neural network models were trained in a standard fashion for neural network optimization. This involved using the Adam optimizer 2 and a binary cross-entropy loss. All model hyperparameters were set to their defaults as specified by Keras version 2.0.8 (https://keras.io), and no additional regularization was used. The models were trained on balanced mini-batches of size 32, and an epoch was defined as 400 mini-batches. Training proceeded for 100 epochs, but was stopped early if performance on a balanced validation minibatch of size 3,200 did not improve after five consecutive epochs.
The gradient boosted decision tree models were trained using XGBoost 3 . The default values were used for all parameters, except that training progressed for 300 iterations, instead of 100, and the maximum depth of each tree was set to 6, instead of 3. The model was trained using a binary logistic loss and a L2 regularization strength of 1. A single model was trained for each input feature set. These models are then evaluated using the first N trees, using N between 1 and 300, to get the performance of models of varying complexity. Because subsampling is not used, this procedure is identical to independently training models of varying sizes.
The training, validation, and test sets consisted of different genomic loci depending on the model evaluation setting. In the cross-chromosomal setting, the validation set was derived from Chromosome 2 and the test set was derived from Chromosome 1 for both tasks. For the gene expression task, the training set consisted of all genes in Chromosomes 3 through 22, while for the TAD boundary prediction task, it consisted of all 40 kbp bins in Chromosome 3. In the cross-cell type setting, the training, validation, and test sets were derived from Chromosomes 2 through 22 in the gene expression task and Chromosomes 2 and 3 in the TAD boundary prediction task. In the hybrid setting, the training and validation sets were the same as in the cross-cell type setting, but the test set for both tasks were samples derived from Chromosome 1. We chose to hold the training set constant between the cross-cell type and hybrid approaches, rather than the test set, in order to demonstrate that models trained on the same data exhibit markedly different trends with respect to model complexity depending on the evaluation set.
Depending on the evaluation setting, these models were also trained on either a single, or multiple, cell types. In all cases, models were evaluated on data derived from the H1 cell line (E003). In the crosschromosomal setting, models for both tasks were also trained on data from the H1 cell line (E003). For the gene expression task in both other settings, samples drawn from spleen (E113), H1 BMP4 derived mesendoderm cultured cells (E004), CD4 memory primary cells (E037), and sigmoid colon (E106) were used as the validation set, and all other cell types (excluding the H1 cell line) were used as the training set (see Additional file 1: Table S1). For predicting TAD boundaries, the validation set was drawn from GM12878 (E116) and the training set consisted of all other cell lines (excluding the H1 cell line).

Average activity
The average activity is the value of some form of biological signal averaged, at each genomic locus, over all cell types in the training set. Formally, for a training set with m cell types, n genomic loci, and measurements of some signal S ∈ IR m,n , the average activity A ∈ IR n is calculated as

Average precision
For each task we evaluated model performance using the average precision score. Average precision refers to the average of the precision multiplied by the recall using each point as a threshold. The average precision is an estimate of the area under a precision-recall curve that is not overly optimistic, as is the case when one linearly interpolates between points in the curve. The score is calculated by AP = n (Recall n − Recall n−1 ) Precision n where Recall n and Precision n are the recall and the precision at the n-th calculated threshold, with one threshold for each data point.