WhatsGNU: a tool for identifying proteomic novelty

To understand diversity in enormous collections of genome sequences, we need computationally scalable tools that can quickly contextualize individual genomes based on their similarities and identify features of each genome that make them unique. We present WhatsGNU, a tool based on exact match proteomic compression that, in seconds, classifies any new genome and provides a detailed report of protein alleles that may have novel functional differences. We use this technique to characterize the total allelic diversity (panallelome) of Salmonella enterica, Mycobacterium tuberculosis, Pseudomonas aeruginosa, and Staphylococcus aureus. It could be extended to others. WhatsGNU is available from https://github.com/ahmedmagds/WhatsGNU. Electronic supplementary material Supplementary information accompanies this paper at 10.1186/s13059-020-01965-w.

If it is needed permanently, the last line can be added to .bashrc or .bash_profile.

WhatsGNU toolbox
We define panallelome as the total set of allelic diversity in a given group of genomes. In this case it is the set of protein alleles that differ by at least one amino acid. The toolbox is composed of four scripts: 1. WhatsGNU_get_GenBank_genomes.py This script downloads genomic fna files or protein faa files from GenBank.

WhatsGNU_database_customizer.py
The protein ids usually have little or no information about the strain so once all proteins are mixed in the database, each strain's information is lost. This script customizes the protein faa files from GenBank, RefSeq, Prokka and RAST by adding a strain name to the start of each protein to later count the top genomes and it also changes the separator to be '|' instead of '_'. This script can also customize the strain names for gff file to be used in Roary for pangenome analysis, if the Ortholog mode is going to be used in WhatsGNU.

WhatsGNU_main.py
In basic mode, this script utilizes the natural variation in public databases to rank protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a particular species. It generates a report for all the proteins in your query in seconds using exact match compression technique. In ortholog mode, the script will additionally link the different alleles of an ortholog group using the clustered proteins output file from Roary or similar pangenome analysis tools. In this mode, WhatsGNU will calculate Ortholog Variant Rarity Index (OVRI) (scale 0-1).
= Sum of all allele GNU scores in the ortholog group ≤ the GNU score of the allele Sum of GNU scores in the ortholog group We developed this metric to take into consideration the distribution of the GNU scores of other alleles in the same ortholog group. This index represents how unusual a given GNU score is within an ortholog group by measuring how many other protein alleles in the ortholog group have that GNU score or lower. For instance, an allele of GNU=8 in an ortholog group that has 6 alleles with this distribution of GNU scores [300, 20,15,8,2,1] will get an OVRI of (8+2+1)/346= 0.03. On the other hand, the allele with GNU=300 will get an OVRI of (300+20+15+8+2+1)/346= 1. An allele with an OVRI of 1 is relatively common regardless of the magnitude of the GNU score, while an allele with OVRI of 0.03 is relatively rare. This index helps distinguish between ortholog groups with high levels of diversity and ortholog groups that are highly conserved.
Note: The GNU score could be affected by the protein size if the length increases the probability of mutation. In this case a longer protein may obtain lower GNU scores or GNU score of zero more frequently compared to shorter proteins. Similarly, genes that contain homopolymer repeats or other regions that are vulnerable to NGS errors might also end up with lower GNU scores. Truncated proteins would also get a GNU score of zero if they do not appear in the database, and therefore open reading frames at the end of assembled contigs should be validated.
It is important to note also that GNU scores are reflective of the database composition and change as the database grows. GNU scores from wellsampled clades will tend to be higher compared to strains from other undersampled lineages. However, the change over time is appropriate for the major goal of the GNU score, which is to measure novelty as we learn more about the diversity of genomes in a species. . The x-axis is the delta average GNU score (Average_GNU_score_case -Average_GNU_score_control) in the ortholog group. Lower average GNU score in cases will have a negative value on the x-axis (red dots) while lower average GNU score in the control group will have positive value on the x-axis (green dots). The y-axis could be drawn as a -log10(P value) from Mann-Whitney-Wilcoxon test. In this case, lower average GNU score in one group (upper left for case or upper right for control) would be of interest as shown by a significant P value (-log10( P value) > 1.3). The y-axis can also be the average OVRI in the case group for negative values on the x-axis or average OVRI in the control group for positive values on the x-axis. In this plot, the regions of interest are: i. Lower left: These proteins have lower average GNU scores in the case group compared to control and low OVRI which means the alleles of these proteins in the case group are rarely seen in the entire database of a species. ii. Upper left: These proteins have lower average GNU scores in the case group compared to control and high OVRI, which means the alleles of the proteins in the case group are frequently seen in the entire database of a species. iii. Lower right: These proteins have lower average GNU scores in the control group compared to case and low OVRI which means the alleles of these proteins in the control group are rarely seen in the entire database of a species. iv. Upper right: These proteins have lower average GNU score in the control group compared to case and high OVRI which means the alleles of these proteins in the control group are frequently seen in the entire database of a species. Note: In its basic mode, the customizer script uses the file name as a locustag. The script can also accept a list.csv (comma-separated) of 3+ columns: file_name, old locustag, new locustag and optionally metadata. In the latter case, if metadata are provided, the script will concatenate the new locustag with metadata using '_' as a separator. The new locustag in this case will be: new_locustag_metadata_. In case of GenBank, RefSeq and RAST, use NA for the old locustag column in the list.csv file. The -c option will provide one concatenated file of all input files, which is recommended for protein faa files while the -i option will return individual files, which is required for gff files.

Output
A prefix_concatenated.faa, individual modified faa files or individual modified gff files. If metadata were provided in the list.csv with the -l option, the script will output a metadata_frequency.csv file that has the metadata frequencies in the database. Note: -dm basic will work for all the previous commands.
The following options work with -dm ortholog: Run blastp on the proteins with GNU score of zero and modify the report with ortholog information.

WhatsGNU_main.py -d compressed_db.pickle -dm ortholog -b query.faa
Note: blastp has been used in concordance with the recent literature describing a common misconception regarding the BLAST parameter -max_target_seqs [5,6].
Get the output report of blastp run (works with -b • Blastp [4].

Input
A folder of query_WhatsGNU_report.txt files.

Command line options
Heatmap Plot a heatmap of GNU scores for these proteins in proteins.faa using this strains' order. Assign a title using -t. Font size and figure size (w,h) are given by -f and -fs, respectively. Annotate the heatmap cells with OVRI rare tag using -r option.

Database format
Using WhatsGNU with -m option compresses the database in two formats for convenience; a database.txt format that will need shorter time for subsequent uses.

Methods and Results of Mash comparison
We compared the WhatsGNU -t option with the Mash sketch/dist functions [17] using the S. aureus genome NCTC8325. The Mash tool [22]  Note: WhatsGNU additionally provides the strain name and metadata (clonal complexes) details.

.1 Panel C (Collector's curve)
A collector's curve was used to compare the size of the panallelome of available genomes of S. aureus on GenBank [17,18] and Staphopia [20] which would reflect any biases in sampling for both databases. A collector's curve in this case expresses the number of unique alleles as a function of the number of genomes sequenced. The genomes available on GenBank [17,18] at two different time points (8524 and 10350) were used with WhatsGNU and the panallelome size was recorded. A script (random_sampler.py in Additional file 4) was then used to randomly select 1000, 2000, 4000, 8524,10350, 20000 and 30000 genomes from the 43914 S. aureus genomes available on Staphopia [20]. The random sampling step was done three independent times with replacement and was run with WhatsGNU. Panallelome size was recorded and plotted in GraphPad Prism v.7. random_sampler.py -n 20000 random_sample_folder_name Sau_staphopia/ Note: -n was changed to randomly select 1000, 2000, 4000, 8524,10350 20000 and 30000 with replacement. The curve showed that genomes in Staphopia [20] had greater sampling breadth compared to GenBank [17,18], which was reflected by the shift in the number of alleles that were obtained at the two data points 8524 and 10350. The names of the different strains used in the collector's curve are listed in Additional file 3.

Panel D (Performance evaluation)
A MacBook Pro desktop computer with a single processor (3.3 GHz Intel Core i7) and 16 GB of RAM was used for the evaluation. The number of exact matches were counted in the blast report and compared with WhatsGNU results. The number of the exact matches reported by blastp and WhatsGNU (GNU score) were identical. The query and subject files and the two output reports NCTC8325_2ptn_blast_report.txt and NCTC8325_2_ptn_WhatsGNU_report.txt are available in Additional file 4.
7 Methods of Figure 2 7.1 Panel B The GNU scores of all the proteins in a SSTI S. aureus clinical-CC8-USA300 isolate "SSTI_179_1" were used to produce a histogram (panel B). It is important to note that genomes from different lineages will have different histograms based on the composition of the database.

WhatsGNU_plotter.py -x -e blue -b 103 -p 100 10000 SSTI_179_1_histogram WhatsGNU_reports_folder/
Note: WhatsGNU_reports_folder/ has SSTI_179_1_WhatsGNU_report.txt Note: There were 37 and 170 proteins with GNU < 100 and > 10,000 representing the novelty and ultra-conserved peaks, respectively (in Additional file 4). The novel peak had some virulence proteins such as Staphostatin B and Clumping factor A, transcriptional regulators and IS elements. On the other hand, the ultra-conserved peak had proteins like GpsB (Cell cycle protein), DivIC (Cell division protein), multiple 30S and 50S ribosomal proteins, SigA (RNA polymerase sigma factor) and ATP synthase subunits. In supplementary figure 2 we annotated the proteins in GNU score bin category with gene ontology terms. This GO-annotated histogram showed that some accessory proteins annotated with biological processes such as adhesion and aggregation are located in the novelty peak with low GNU scores while proteins involved in cellular processes and cellular component organization or biogenesis tend to be more conserved.

Panel C
A SSTI S. aureus clinical-CC8-USA300 isolate "SSTI_179_1" was used to create a report with CC/ST composition using -c option.  The proteome of the SSTI S. aureus clinical-CC8-USA300 isolate "SSTI_179_1" was blasted against NCTC 8325 and 80% identity and 80% coverage were used to call an ortholog from NCTC 8325. The 2475 mapped protein IDs from NCTC 8325 were then used in GOQuick [23,24] to annotate the proteins with biological processes gene ontology terms. A total of 1535 proteins were annotated. The GO annotations were limited to 22 biological processes (metabolic process, cellular process, localization, biological regulation, response to stimulus, cellular component organization or biogenesis, multi-organism process, signaling, detoxification, cell killing, biological adhesion, reproduction, developmental process, nitrogen utilization, cell aggregation, carbon utilization, carbohydrate utilization, growth, locomotion, sulfur utilization, phosphorus utilization and pigmentation). There were 1145 proteins that were not annotated and were reported as unclassified. The GNU scores of all the proteins were used to produce a histogram with a y-axis showing the composition of biological process hits in GOQuick for each of the 103 GNU score bins.
10 References Figure S1. Relationship between sequence assembly quality and GNU score of Zero. More proteins with GNU scores of zero (representing potential errors) are seen as number of contigs increases, and sequence quality decreases