Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Species-aware DNA language models capture regulatory elements and their evolution

Fig. 2

Language models reconstruct likely regulatory sequences in a held-out species and recover known binding motifs. A Reconstruction accuracy for nucleotides within instances of RNA-binding protein consensus motifs and across all nucleotides in S. cerevisiae 3′ UTR sequences (those longer than 300 bp have been truncated). We compare the agnostic and species 3′ LM to a variety of baselines. The dashed line represents the accuracy achieved by the intra-genus alignment. Star indicates that the species LM significantly (P < 0.05, binomial test) outperforms the best baseline. For Puf3 and Whi3, Modisco clustering on the species LM reconstructions recovers the motif (depicted above the respective plots). B A sample of known transcription factor motifs recovered by applying Modisco clustering to the 5′ species LM reconstructions, (manually) matched to the respective high-confidence PWM from the YeTFaSCo [34] database

Back to article page