Skip to main content
Fig. 5 | Genome Biology

Fig. 5

From: Species-aware DNA language models capture regulatory elements and their evolution

Fig. 5

Sequence representations of the species LM outperform other methods on a variety of downstream tasks. A Performance (R2) of linear models trained on embeddings from language models compared to state-of-the-art models and k-mer count regressions, where the best k from {3, 4, 5} is shown. Star indicates that the Species LM significantly (P < 0.05) outperforms the second best. B Effect of mutation of 3′ sequences on expression. Observed log2 fold changes, as measured in Shalem et al. [55] are well predicted by the species LM representation. C Observed and predicted effects of mutation on expression as a function of distance to the stop codon for the YDR131C 3′ sequence. D Motifs recovered through in-silico mutagenesis followed by Modisco clustering on our linear model for the S. cerevisiae half-life task. Motifs with a negative effect on half-life are depicted upside down. We recover (2 of 4) motifs found by Cheng et al. [50]: the Puf3 motif and the Whi3 motif. Additionally, we find two motifs not found by this previous analysis, the Puf4 motif and the efficiency element, both of which have known effects on RNA stability

Back to article page