Skip to main content
Fig. 3 | Genome Biology

Fig. 3

From: Species-aware DNA language models capture regulatory elements and their evolution

Fig. 3

Reconstruction of motifs depends on the context and predicts whether a motif instance will be bound in vivo. A Actual and predicted (by the 5′ species LM) nucleotide biases as a function of the distance to the TSS (imputed using CAGE data). The model keeps track of local variations in nucleotide biases. B Reconstruction fidelity (log-likelihood of the individual observed nucleotides, averaged per motif instance, according to the 3′ species LM) of instances of the Puf3 motif (TGTAAATA), as a function of the distance to the end of the annotated 3′ UTR. The predictions of the model for masked nucleotides are indicated for two instances of the motif (blue circles). Reconstruction fidelity is notably degraded beyond the 3′ UTR end (P = 2.2 × 10−15, Mann–Whitney U). C ROC curve evaluating to what extent the reconstruction fidelity of our 3′ LMs, as well as the phastCons conservation score, can serve as a predictor of whether a Puf3 motif instance is within or beyond the 3′ UTR boundary. The LMs greatly outperform the conservation score. D Reconstruction fidelity (log-likelihood of the observed nucleotides according to the 5′ species LM) of instances of the Tbf1 consensus motif (ARCCCTA), as a function of the distance to the closest 3′ TSS (imputed using CAGE data). Blue indicates that the motif instance was bound in vivo according to Chip-exo data. Motif instances that are around − 100 to − 250 nt to the TSS are better reconstructed than those further away or in the 5′ UTR (P = 1.2 × 10−11, Mann–Whitney U). E ROC curve evaluating to what extent the reconstruction fidelity of our 5′ LMs, as well as the phastCons conservation score and an expert-curated PWM, can serve as a predictor of whether a Tbf1 motif instance is bound in vivo. The LMs again greatly outperform the alternative methods

Back to article page