Skip to main content
Fig. 4 | Genome Biology

Fig. 4

From: Species-aware DNA language models capture regulatory elements and their evolution

Fig. 4

LMs can trace the movement and disappearance of motifs across species and they account for the evolution of the transcription initiation mechanism. A We applied the 3′ species LM to the 3′ region of CBP3 homologs in a number of fungal species (compare Fig. 1A). Darker color indicates that the model assigns a higher probability to the correct nucleotide at that position. In most species, the Puf3 motif instance (delineated with black bars) is notably reconstructed better than the remaining nucleotides. Star indicates that this difference in reconstruction is significant (P < 0.05, Mann–Whitney U). Species with gray background were held out during LM training. B We computed the reconstruction fidelity (log-likelihood) achieved by the species 5′ LM for the S. cerevisiae consensus Rap1 motif (CAYCCRTACAY) instances and for instances matching shuffled versions of this motif in 60 fungal species. The difference in reconstruction between the true and shuffled motif instances, expressed as log2 fold change, is plotted against the − log10 P-value of this difference, computed using a Mann–Whitney U test. We observe that in species that have no BLAST match to S. cerevisiae Rap1p, the reconstruction fidelity of the S. cerevisiae Rap1 motif is generally not much better than that of shuffled versions thereof, indicating that the model correctly accounts for species context when reconstructing motifs. C Reconstruction fidelity (log-likelihood of the observed nucleotides according to the 5′ Species LM) of instances of the TATA-box (TATAWAWR), as a function of the distance to the closest 3′ TSS (imputed using CAGE data). Positive values indicate that the TATA-box instance is located in the 5′ UTR. We observe that in S. pombe, the TATA-box is best reconstructed when located ca. − 30 bp to the TSS. In S. cerevisiae, which uses a scanning mechanism to initiate transcription and therefore allows more flexible positioning of the TATA, the model reconstructs TATA well overall, but somewhat better when located 50 to 120 bp 5′ from the TSS

Back to article page