We have studied a large number of annotated NGS files to characterize data quality and to create machine learning models able to automatically predict quality from the raw data. Our goal is to provide an alternative to manual quality control of NGS data files, which currently requires high-level expertise and highly depends on assays and experimental conditions and is prone to human biases. Interestingly, a minimum number of uniquely mapped reads and usable fragments mentioned in the ENCODE guidelines cannot be used to categorize the ENCODE data with respect to quality (Additional file 1: Fig. S14 A, C, and E).
Workflow
Our study is based on 2642 human and mouse raw NGS data files labeled as high- or low-quality from the ENCODE data portal (Fig. 1). From them, we extracted different quality features using software tools commonly used in the NGS field. Each tool represents a different stage or view of the NGS data analysis, providing features sets related to raw sequencing reads (RAW), mapping to a reference genome (MAP), genomic localizations of the reads (LOC), and spatial distribution of the reads near transcription start sites (TSS) (see the “Methods” section for details).
Our approach was first to derive statistical guidelines from the detailed study of individual quality features, and then to systematically benchmark 10 popular machine learning algorithms in a grid search, to predict the quality of NGS data files based on combinations of quality features and various sets of parameters (see grid search specifications in Additional file 2). We finally evaluated 4.6 M predictive models covering different data subsets: either generic (including all species and/or all assays) or specialized in particular species and assays (e.g., human ChIP-seq or mouse DNase-seq).
Quality prediction
One-feature quality predictions as baseline and guidelines
Before evaluating machine learning algorithms, we evaluated the predictive power of each quality feature independently (Fig. 2a). This first analysis, which does not involve machine learning, gave us an overview of their performance across the data subsets and a baseline for the evaluation of the machine learning algorithms. We compared the results using the area under the receiver operating characteristic curve (auROC) that ranges from 0.5 to 1, from not predictive to completely predictive, respectively. The predictive power of the features strongly depends on the data subset, ranging from poorly predictive (e.g. genomic localization in “1st exon” for human RNA-seq; auROC = 0.50) to highly predictive (e.g., “overall mapping” for mouse ChIP-seq files; auROC = 0.94).
Some quality features are of broad interest because of their good performance across all data subsets, especially all MAP features and the following two RAW features: “Overrepresented sequences” and “Per sequence GC content” (auROC up to 0.89 and 0.78, respectively). Other quality features are less interesting because of their poor performance in all subsets (e.g., “Sequence Length Distribution”; or “TSS − 2500”; auROC ≥ 0.62). Comparing species, DNase-seq showed more differences than ChIP-seq results. This is probably due to the large difference in composition between human and mouse DNase-seq samples (Additional file 1: Fig. S7). Analyzing separately narrow and broad peak samples from the ChIP-seq results, we could observe an under-representation of broad peak samples (only 82 human and 46 mouse samples) and differences between peak types (Additional file 1: Fig. S9).
From this analysis, we have derived detailed statistical views that confirm the specificity to data subsets observed above (Additional file 1: Fig. S1). By providing an easy way to compare quality features of new data files to thousands of ENCODE files, these results can serve as statistical guidelines to NGS specialists.
Multi-feature quality predictions by machine learning
With the performance of each single quality feature as baseline, we evaluated the potential improvement of using combinations of features by machine learning algorithms to predict NGS data files quality. We evaluated machine learning models trained with the different data subsets for each feature set (Fig. 2b). The models, tuned by a grid search that systematically explores performances over the parameter space, outperformed the baseline. Each model may use a different algorithm or set of parameters. For example, the tuned generic model using all features is based on a Random Forest classifier using 1000 estimators, while the tuned human ChIP-seq model for single-end reads experiments is based on a multilayer perceptron with 2 hidden layers (see Additional file 3 for details about parameters used by each tuned model).
Data files quality from each subset can be predicted with high performance, especially when using all features (auROC > 0.9; Fig. 2b). Within each data subset, the different feature sets led to comparable results, although performances were more variable with RAW features in DNase-seq subsets, and LOC and TSS feature sets were less performing for ChIP-seq subsets. Results for all combinations of feature sets and other performance measures such as area under precision-recall curve, accuracy, or F1 are shown in Additional file 1: Fig. S2 and S5A.
This analysis confirms the expectation that sequencing reads from the ChIP-seq subset (including narrow and broad peak types) would not be highly biased towards specific genomic localizations or TSS relative positions different to RNA-seq, for instance, for which these two feature sets enable very high performing models (auROC = 1 or 0.988, respectively). Nevertheless, we also evaluated separate models for narrow-peak and broad-peak ChIP-seq samples (Additional file 1: Fig. S10). Considering all features, human and mouse narrow-peak models performed similarly to the generic or less specialized ChIP-seq models described above. Broad-peak models outperformed the other models although we could not exclude a bias due to the lower sample size of the underlying training sets.
Taken together, we can see that the tuned generic and specialized models show their best performances mostly when using all quality features. Therefore, we considered those tuned models using all features as optimal and used them for further analyses below (see Additional file 4 for details about parameters of the 7 optimal models).
Within-experiment analysis
The extent to which individual NGS researchers would benefit from using machine learning models in their QC procedure could be known by analyzing the results on replicate files produced within the same experiment. In our dataset derived from ENCODE, some experiments gather high- and low-quality files from different replicate and control samples. For the different species-assay combinations we extracted these experiments with the corresponding files (Fig. 3a).
Focusing on the probability of a file to be of low quality that was provided by the optimal specialized models to each file across the different experiments, our models were able to clearly differentiate which files would be considered high or low quality after manual QC (Fig. 3b). Within each experiment, we could mostly observe a clear cut or a meaningful sorting by actual quality (Fig. 3c and Additional file 1: Fig. S3), with an average difference between low- and high-quality files ranging from 0.5 to 1 (Fig. 3d). From these observations, we conclude that the optimal specialized models were not biased towards some experiments. These results suggest that, early in the sequencing analysis pipeline, researchers can already define the potential of their data files to be considered of enough quality for publication and can accordingly take decisions that could save a substantial amount of time and resources for further analyses, storage, or manual reviews. For instance, out of 23 mouse ChIP-seq experiments in our dataset, 41 (53%) files could have been early identified as of low quality and not submitted, stored, processed, and manually reviewed by the ENCODE database curators.
Top machine learning algorithms and parameters
Out of our model tuning strategy, which systematically tested 10 different algorithms and numerous parameter sets, different algorithms could be found in models optimal for each data subset and/or feature set. Given the heterogeneity of the data composed of numerical and categorical values, and the moderate dataset size, we thought that decision-tree-based algorithms would be appropriate to the task. To test this hypothesis, we summarized the results across the data subsets and combinations of feature sets. As expected, decision-tree-based algorithms (random forest, gradient boosting, and XG boost) often performed better than others as well as multilayer perceptron, which is a deep learning classifier based on artificial neural networks (Fig. 4a). In general, there were only minor differences between the top algorithms. Nevertheless, the choice of their parameter values proved to be critical (Fig. 4b). For example, the multilayer perceptron clearly benefited from the quasi-Newton method (lbfgs) to solve the weight optimization (> 80% of the best models) but optimal sizes of the hidden layers were more dependent on other parameters. The choice of an automatic feature selection could also be of importance, especially with gradient boosting algorithms or support vector machines (Fig. 4c).
Cross-species generalization
A main limitation of this study is the availability of labeled data only for human and mouse. In order to know if the models were potentially generalizable to other species, we conducted a test where models were trained with human data and tested on mouse data unseen during model training, and vice versa (Fig. 5a). The tests were performed with ChIP-seq and DNase-seq data, which was available for both species. Results showed that ChIP-seq models trained with a particular species can predict the quality of files from the same and the other unseen species with comparable performance. This could not be clearly observed with DNase-seq, for which prediction performance dropped substantially when trying to predict the file quality from data of unseen species. ROC-curves derived from the different feature sets were consistent with this observation (Additional file 1: Fig. S4). Therefore, cross-species generalizability of the models can be considered assay dependent.
Generic model
A model that was of importance to us was the optimal generic model trained on files from different species and different assays of the full dataset. Although its performance was high during the cross-validations (Fig. 2b) we were interested to know if it could be biased to data files from a particular subset, especially because the human ChIP-seq subset is overrepresented (52% of the dataset). After tuning and cross-validation of the generic model, we detailed its performance for each data subset (Fig. 5b and Additional file 1: Fig. S5B). The generic model was able to predict the quality of files from each subset with high performance. Still, specialized models, specifically trained on each subset, performed slightly better. We compared the predictions across different ChIP-seq protein targets; we were unable to observe any bias towards particular targets or an effect of including LOC and TSS features in the model (Additional file 1: Fig. S11). Moreover, we could observe consistency of the predictions with the real annotations when observing distributions of uniquely mapped reads and usable fragments in relation to the ENCODE guidelines (Additional file 1: Fig. S14 A-F).
Evaluation in independent datasets
We have shown above that machine learning models trained on appropriate data from the same source (ENCODE) are powerful and unbiased predictors of NGS file quality. Testing the models on independent datasets from another source, such as the GEO [23] and Cistrome [24] databases, would allow us to assess the generalizability of our approach and consequently demonstrate its useability in different applications.
Application to independent diagnostic studies
The potential of the models to filter low-quality files from diagnostic studies was evaluated on 90 samples from 6 independent gene expression studies related to the following diseases: Alzheimer’s disease (GEO Series GSE125583), Crohn’s disease (GSE66207), diabetes (GSE92724), sporadic amyotrophic lateral sclerosis (GSE76220), liver cancer (GSE25599), and thyroid papillary carcinoma (GSE63511). Assuming a large effect of the quality on gene expression data, being able to automatically identify files of low quality would potentially prevent patients from receiving inappropriate medications or treatments. We analyzed the datasets to mark samples predicted to have low quality as potential outliers (Fig. 6 and Additional file 1: Fig.S12). Those samples could be expected to confuse the comparison of the samples by group, for example, in a two-dimensional projection derived from the gene expression profiles. Thus, we applied a principal component analysis (PCA) before and after the removal of outliers and compared the resulting clusters based on the first two principal components. The clustering of the samples was substantially improved in 4 datasets out of 6 (minor or no improvement in the 2 other datasets). Although selected here automatically and only for testing purposes, such samples could be potential outliers that should be reviewed to decide their relevance for downstream analysis. As we have observed that probabilities returned by the models may not be directly comparable between studies, we would recommend for a similar application to train the models on mostly similar clinical data labeled as low and high quality when available.
Assessing the generalizability of models on independent datasets
The Cistrome database provides quality information for independent ChIP-Seq, DNase-Seq, and ATAC-Seq samples that were not in ENCODE and thus not used to train our models. Compared to the quality status used in ENCODE to label our training sets, the Cistrome quality metrics are automatically produced and simple cutoff values are given in the Cistrome’s guidelines to flag each metric to represent good or bad samples. Nevertheless, we used this data to test if our models would generalize to guidelines derived by another database and to ATAC-Seq samples.
We downloaded and analyzed 610 ChIP-Seq, DNase-Seq, and ATAC-Seq samples from 32 datasets annotated in Cistrome. For human and mouse ChIP-seq and DNase-seq subsets, there was a high positive correlation (from 0.42 to 0.76) between the number of Cistrome’s bad quality flags and the low-quality probabilities derived by the optimal generic model (Fig. 7). Similar correlation results were obtained with optimal specialized models used where appropriate, but more variance could be observed (Additional file 1: Fig. S13).
On the ATAC-Seq data, the generic model also produced probabilities that positively correlated with the flags. However, the correlation was high with mouse data (0.61) and only moderate with human data (0.15). Applying the specialized DNase-seq models to ATAC-seq data also performed well with mouse data (0.44) and substantially improved the results for human data (0.46). Although this is not an external or independent validation, we note that a Random Forest model directly trained on ATAC-seq data showed high performance in 10-fold cross-validations (auROC of 0.910 and auPRC of 0.932; see the “Methods” section).
Taken these results together, the optimal generic model better generalizes to independent datasets representing various data subsets. ATAC-seq data could also be analyzed by this model but optimal specialized models for DNase-seq or custom models directly trained on ATAC-seq data may be more relevant.