Skip to main content
Fig. 4 | Genome Biology

Fig. 4

From: High-throughput deep learning variant effect prediction with Sequence UNET

Fig. 4

High-throughput proteome analysis. A Number of proteins and species in the Muller et al. proteomics dataset. B Computation speed comparison between SIFT4G, FoldX, single-pass ESM-1b and Sequence UNET on CPU and GPU. Sequence UNET was also tested running on GPU in batches of 100. The SIFT4G and FoldX computations were performed as part of an independent deep mutational scanning analysis [25], ESM-1b was run on ProteinNet proteins and Sequence UNET computations are across this proteomics dataset. C Pearson correlation coefficient between predicted conservation and protein abundance in each species. The error bounds of Pearson’s ρ are calculated with Fisher’s Z transform. Predicted conservation is summarised as the mean number of variants predicted to be deleterious across positions. Results are shown for Sequence UNET frequency predictions across all species and SIFT4G for Mycoplasma and species with data available in Mutfunc [29]. The species’ phylogeny is also shown based on NCBI Taxonomy Common Tree. D Boxplot showing distribution of correlation coefficients for each domain, split between proteins in SwissProt and TrEMBL. The p-value comes from a two-sample unpaired T-test. E Relationship between Pearson correlation and standard deviation of raw protein abundance across species. F Relationship between Pearson correlation and standard deviation of protein length across species

Back to article page