Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: SSBlazer: a genome-wide nucleotide-resolution model for predicting single-strand break sites

Fig. 2

Model performance evaluation and the effect of different imbalance ratios in the dataset. a The model performance comparison of the proposed models (in AUROC and AUPRC). SSBlazer-NC refers to SSBlazer without the center feature processing module. SSBlazer-LM refers to a language model version of SSBlazer. b Assessment of AUROC and AUPRC values across varying input sequence lengths, ranging from 51 bp to 501 bp, to determine the optimal context length. c, d Cross-species evaluation reveals that SSBlazer exhibits desirable cross-species generalization ability. SSBlzer was first trained the model on dataset II-A (Homo sapiens) and evaluated the model performance on dataset II-B (Mus musculus) and then had the reverse experiment (II-B for training, II-A for evaluation). e–h Profile heatmaps on 1250 ground truth SSB sites illustrate the impact of introducing imbalanced datasets (Q = 1, Q = 10, Q = 100, and Q = 1000) on the 151 bp region around the SSB sites of the human genome (hg19) chromosome 1. These signal-to-noise landscapes reveal that the introduction of imbalances can sufficiently reduce false positives. i Prediction scores for a specific ground truth SSB site region (Human chr1: 871,507–871,686) of Q = 1 model (red), Q = 10 model (purple), Q = 100 model (orange), and Q = 1000 model (brown). The model trained on the balanced dataset shows a high false-positive rate in the flanking regions. The model trained with the imbalanced dataset (Q = 100) has a significantly narrow peak at the ground truth SSB site and a relatively low signal in the flanking regions

Back to article page