Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: WhatsGNU: a tool for identifying proteomic novelty

Fig. 1

Workflow and performance of WhatsGNU. a Workflow for the WhatsGNU tool and its compression technique. The tool starts by compressing the database of proteins. The second step is to match each protein from a query genome to an exact match in the compressed database. The final step is to produce a report with a GNU (Gene Novelty Unit) score for each protein. b Compressed Databases available in WhatsGNU. c A collector’s curve expresses the number of exact matches (unique alleles) as a function of the number of genomes sequenced. The size of the panallelome of available genomes of S. aureus on GenBank and Staphopia were compared. The 1000, 2000, 4000, 8524, 10,350, 20,000, and 30,000 genomes from the 43,914 S. aureus genomes available on Staphopia were randomly selected. The random sampling step was done three times, independently. The error bars are shown in green. d Effect of the number of isolates on the running wall time of WhatsGNU and blastp. Both WhatsGNU and blastp were used on a single CPU and 16 GB of RAM. The S. aureus database used for WhatsGNU was previously processed and serialized using the Python3 pickle module. The time needed to find exact matches for each of the 2893 proteins of S. aureus NCTC 8325 was noted for WhatsGNU and blastp. 1, 100, and 1000 copies of NCTC 8325 genome were used to evaluate the running time for WhatsGNU. For blastp, to reduce computational costs, the running time of one NCTC 8325 genome was multiplied by 100 and 1000, respectively. Running time would differ on desktops with different specifications. Blastp running time can be reduced by using multiple threads if more than one CPU is available

Back to article page