Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect

Fig. 2

Machine learning models for error correction. Negative control (NC) alignments for “sim” dataset A (Nextseq 500) arose from the human CBS coding sequence after functional complementation in yeast [10]. Alignments for “sim” datasets B–D (NovaSeq 6000, HiSeq 2500, HiSeq 4000) and MiSeq runs arose from HEK293T endogenous CBS cDNA, and alignments for HiSeq X datasets arose from CBS plasmid. A Error proportions in negative control libraries. Proportion of each error substitution across NC libraries from various sources. Shape of the points indicates an independent NC library. B Model selection. To compare models, dataset A (3802 variants, 7859 true mismatches, 6463 false mismatches) was used. Up to 19 satmut_utils call quality features were selected to train binary classifiers (“Methods”). C Random forest performance. Random forests (RF) were trained on all four “sim” datasets and cross-validation performance across different platforms was calculated. D Feature importance for RF models. A RF was trained on a combined dataset (all “sim” datasets A–D), and the top fifteen important features as measured by mean decrease in accuracy (“Methods”) are plotted. E Cross-generalization of RF models. Pairwise train-test regimes were carried out with all “sim” datasets to assess model generalization across sequencing libraries and platforms. F Error correction impact on variant calls in NC libraries. satmut_utils variant calls from each NC library were filtered by the RF models. The number of error mismatches before and after filtering is plotted for each NC library. NC: negative control; GBM: gradient boosted machine; GLM-elasticnet: generalized linear model with elastic net regularization; kNN: k-nearest neighbors; RF: random forest; SVC: support vector classifier

Back to article page