Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Species-aware DNA language models capture regulatory elements and their evolution

Fig. 1

Masked language modeling can serve as an alternative for alignments, which struggle to capture the conservation of regulatory elements over large evolutionary distances. A BLAST hits of S. cerevisiae CDS and 3′ UTRs in other fungal species. B Regions 3′ to the stop codon of CBP3 orthologues in different fungal species. Instances of Puf3-like motifs (TGTA*ATA) are indicated in red, and a star indicates experimental evidence of Puf3 binding. It appears that Puf3 binding is conserved whereas the location of the motif is not. C Masked language models are neural networks trained to reconstruct masked nucleotides from context. We illustrate this with the example of a Puf3 motif (TGTAAATA), where the second to last T has been hidden. Since this motif is highly conserved, the model may learn that, given this context, a T is most likely. For each masked nucleotide, the model returns a probability distribution over the letters A, C, G, and T. We can extract sequence representations from the model by pooling the hidden representations of the last four layers of the model. The architecture of the LM corresponds to DNABERT [14], with the modification that we make the model species-aware, by providing a token denoting the species where the sequence is originally from. D We train language models on hundreds of highly diverged fungi. In each genome, we locate the annotated coding sequences and we extract the non-coding sequences immediately before the start codon (5′ region) and immediately after the stop codon (3′ region). We train separate models for each region. Each model is trained on more than 10 million sequences

Back to article page