SHOOT: phylogenetic gene search and ortholog inference

Determining the evolutionary relationships between genes is fundamental to comparative biological research. Here, we present SHOOT. SHOOT searches a user query sequence against a database of phylogenetic trees and returns a tree with the query sequence correctly placed within it. We show that SHOOT performs this analysis with comparable speed to a BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional tree inference, and it can identify orthologs with high accuracy. In summary, SHOOT is a fast and accurate tool for phylogenetic analyses of novel query sequences. It is available online at www.shoot.bio. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02652-8.

on this group to infer the relationships between those genes. The SHOOT algorithm was designed to make such a phylogenetic analysis feasible as a real-time search using a twostage approach. The first stage comprises the ahead-of-time construction of a SHOOT phylogenetic database, and the second stage implements the SHOOT search for a query sequence (Fig. 1). The database preparation phase includes multiple automated steps including homology group inference, multiple sequence alignment, phylogenetic tree inference, and homology group profiling (see the "Materials and methods" section). Thus, prior to database searching, the phylogenetic relationships between all genes in the database are already established. Subsequent SHOOT searches exploit the fact that the alignments and trees have already been computed to enable the use of accurate phylogenetic methods for the placement of query genes within pre-computed gene trees with little extra computation required.
The mean time for a complete SHOOT search of a database containing 984,137 protein sequences from 78 species was 6.9 s using 16 cores of an Intel Xeon E5-2683 CPU ( Fig. 2A). This compared with 1.9 s for a conventional BLAST search of the same sequence set and 2.1 s for DIAMOND ( Fig. 2A). However, unlike BLAST, DIAMOND, or similar sequence search methods, the output of a SHOOT search is not an ordered list of similar sequences but is instead a maximum likelihood phylogenetic tree with bootstrap support values inferred from a multiple sequence alignment with the query gene embedded within it. SHOOT also computes the orthologs of the query gene using phylogenetic methods. SHOOT is more accurate than BLAST in identifying the closest related gene sequence A leave-one-out analysis was conducted to test SHOOT's ability to find the most closely related gene sequence in a given database. Here, a set of 1000 test cases was randomly sampled from the UniProt Reference Proteomes database. Each test case consisted of a pair of genes sister to each other with at least 95% bootstrap support in a maximum likelihood gene tree. One member of the test pair was arbitrarily designated the "query sequence, " and the other gene was designated "the expected closest gene, " i.e., the gene that should be identified by a search method as the most similar gene in the database. To provide a comparison, BLAST [13] and DIAMOND [17] were also tested on the same dataset. The set of query genes was searched against the database, and each method was scored on whether or not the closest/best scoring gene in each search result was "the expected closest gene. " The tests showed that SHOOT identified "the expected closest gene" as the most closely related gene in 94.2% of cases ( Fig. 2A). For comparison, BLAST correctly identified the "the expected closest gene" as the most similar gene sequence in 88.4% of cases and DIAMOND in 88.3% of cases. To put this in context, there is a 1 in 9 chance that the top hit returned by BLAST is not the most closely related sequence in the database while there is a 1 in 17 chance that the same is true for SHOOT. Thus, SHOOT is better able to identify the closest related gene to a given query gene in a given database and can be used as an alternative to BLAST for this purpose.

SHOOT gives evolutionary context of a query gene's position within its gene family
Although for many users knowledge of the closest related gene as described above may be sufficient, in many instances, there will be more than one gene that is equally closely related to the query gene in a given species. Thus, to generalize the "best hit" analysis above for larger gene sets, the "mean average precision at k" score [18] was calculated, to quantify the precision at which the k closest homologs identified by SHOOT, BLAST, or DIAMOND correspond to the k expected closest homologs in maximum likelihood gene trees. This analysis was conducted for values of k between 1 (equivalent to the "best hit" analysis above) and 50 (Fig. 2B). As k increased, MAP@k for BLAST fell to 71.8%, and for DIAMOND, it fell to 59.2% at k = 50, i.e., there was a 71.8% agreement between the 50 closest homologs identified using BLAST and those identified using phylogenetic methods. In contrast, the use of phylogenetic methods in the database construction stage of SHOOT coupled with the accurate placement of genes within the database trees ( Fig. 2A) resulted in MAP@50 for SHOOT of 90.3%. Thus, both the list of most closely related genes and their rank order of relationship to the query gene is substantially more accurate for SHOOT than for BLAST.

SHOOT has high accuracy in identifying orthologs of the query gene
A frequent goal of sequence similarity searches is to identify orthologs of the query gene in other species. As stated above, local similarity search tools such as BLAST do not do this. Instead, they return a list of genes that should be subject to multiple sequence alignment and phylogenetic inference in order to infer the orthology relationships between genes. The phylogenetic tree returned by SHOOT provides the evolutionary relationships between genes inferred from multiple sequence alignment and maximum likelihood tree inference allowing orthologs and paralogs to be identified. SHOOT also automatically identifies orthologs and colors the genes in the tree according to whether they are orthologs or paralogs (Additional file 1: Fig. S1), as identified using the species overlap method [19,20], which has been shown to be an accurate method for automated orthology inference [21]. The tree viewer also supports a zoom functionality to view a progressively larger or smaller clade of genes around the query gene. An image of the tree can be downloaded, the tree can also be exported in Newick format, and the FASTA file of protein sequences in the tree can be downloaded to support further downstream analyses.
To evaluate the accuracy of ortholog inference, 6 species were chosen at an increasing time since divergence from humans. These query species comprised mouse, chicken, zebrafish, the tunicate Ciona intestinalis, fruit fly, and the yeast Saccharomyces cerevisiae (Fig. 3A). Orthologs between these species and humans were determined from OrthoFinder on the 2020 Quest for Orthologs benchmark dataset [16,21]. For each query species, 100 query genes were selected, creating a test set of 600 genes in total. For these 600 genes, SHOOT was evaluated on its accuracy in identifying the orthologs in humans. For comparison, BLAST best hit (BH) and reciprocal best hit (RBH) were likewise evaluated (Fig. 3B). SHOOT was between 11% (mouse) and 47% (S. cerevisiae) more accurate than either method using BLAST, and the difference was greatest for more diverged species (Fig. 3B). The greatest difference between SHOOT and BLAST was in the percentage of orthologs that were recovered (recall, Fig. 3C). For all species, the ortholog recall for SHOOT was > 79%, whereas the ortholog recall for BLAST RBH was 37% for S. cerevisiae, the most distant species from humans in the analysis (Fig. 3C). The precision of SHOOT orthologs was intermediate between BLAST RBH and BH (Fig. 3D). Thus, SHOOT ortholog assignments are more accurate than performing a "top hit" or "reciprocal best BLAST hit" analysis for the identification of orthologs.

Curated databases place the gene in the context of model species and key events in the gene's evolution
The initial release of SHOOT includes phylogenetic databases for Metazoa, Fungi, Plants, Bacteria, and Archaea, and also the UniProt Quest for Orthologs (QfO) reference proteomes, which cover all domains of cellular life (Additional file 1: Tables S1-S5). To maximize the utility of the gene trees to a wide range of researchers, the species within the databases have been chosen to contain model species, species of economic or scientific importance, and species selected because of their key location within the evolutionary history covered by the database. Each database also contains multiple outgroup species to allow robust rooting of the set of gene trees. As an example, Additional file 1: Fig. S2 shows the phylogeny for the metazoan database, highlighting the taxonomic groups of the included species. Although a number of databases are provided on the SHOOT webserver, the SHOOT command-line tool has been designed so that databases can be compiled from any species set.

Discussion and conclusions
SHOOT is a phylogenetic search engine for the analysis of biological sequences. It has been designed to take a user-provided query sequence and return a phylogenetic analysis of that sequence using a database of reference organisms. We show that SHOOT can perform this search and analysis with comparable speed to a typical sequence similarity search, and thus, SHOOT is provided as a phylogenetically informative alternative to BLAST and as a general-purpose sequence search algorithm for the analysis and retrieval of related biological sequences.
Local similarity or profile-based search methods such as BLAST [13], DIAMOND [17], or MMseqs [22] have a wide range of uses across the biological and biomedical sciences. The near-ubiquitous utility of these methods has led to them being referred to as the Google of biological research. However, one of the most frequent use cases of these searches is to identify orthologs of a given query sequence. Due to the frequent occurrence of gene duplication and loss, orthologs are often indistinguishable from paralogs in the results of local similarity searches. This is because a given query sequence can have none, one, or many orthologs in a related species. Accordingly, the sequences identified by local similarity searching methods will be an unknown mixture of orthologs and paralogs [23]. The problem of distinguishing orthologs from paralogs can be partially mitigated by a reciprocal best hit search, but with low recall [23]. Phylogenetic methods are required to correctly distinguish orthologs from paralogs as they are readily able to distinguish sequence similarity (branch length) and evolutionary relationships (the topology of the tree).
SHOOT was designed to provide the accuracy and information of a phylogenetic analysis with the speed and simplicity of a local sequence similarity search. By pre-computing the within-database sequence relationships, SHOOT can perform an individual search in a comparable time to BLAST. However, instead of returning a list of similar sequences, SHOOT provides a full maximum likelihood phylogenetic tree, enabling immediate phylogenetic interrogation of the sequence search results. A phylogenetic tree provides the best representation available of the evolutionary history of a gene family. A tree allows the identification of speciation and gene duplication events and thus the identification of orthologs and paralogs. SHOOT performs this analysis of the tree automatically, providing a table of orthologs and paralogs of the query sequence. Nevertheless, it remains best practice to examine the tree manually to gain an understanding of how the gene family evolved, using the orthology assignment by SHOOT as a guide.
A standard phylogenetic approach to identifying orthologs of a query gene is to begin a local sequence similarity search or profile search (HMMER [24], MMseqs [22]). Frequently, an e-value cutoff is applied to identify a set of similar sequences for subsequent phylogenetic analysis. Because e-values (and their constituent bit scores) are imperfectly correlated with evolutionary relatedness, the set of similar sequences meeting the search threshold will often be missing some genes as well as often including genes that should not be present. A systematic study using HMMER found that for all n genes from an orthogroup clade to pass an e-value threshold, on average, the threshold would have to be set such that 1.8n genes in total met the threshold [25], i.e., an additional 80% of genes needed to be included, on average, to ensure the orthogroup was complete [25]. Thus, unless a very lenient search is used, genes will be incorrectly absent from the final tree. This can lead to incorrect rooting and subsequent misinterpretation even by phylogenetic experts [25]. Thus, even for bespoke phylogenetic analyses, it is better to use phylogenetic methods to first select the clade of genes of interest. SHOOT supports this by inferring the tree for the entire family of detectable homologs. The use of trees for complete sets of homologs, together with the use of OrthoFinder's robust tree rooting algorithm [16], avoids the problem of mis-rooting and misinterpretation of a tree inferred for a more limited set of genes. Also, by using OrthoFinder clustering approach [16,26], hits missed for a single sequence are also corrected by multiple hits identified for its homologs. This "phylogenetic gene selection workflow" is supported by SHOOT's web interface, which allows a clade of genes to be selected and the protein sequences for just this clade to be downloaded for downstream user analyses.

Conclusions
In summary, SHOOT was designed to be as easy to use as BLAST but to provide phylogenetically resolved results in which the query sequence is correctly placed in a phylogenetic tree. In this way, the phylogenetic history of the query sequence and its orthologs can be immediately visualized, interpreted, and retrieved.

Database preparation
SHOOT consists of a database preparation program and a database search program. The database preparation program takes as input the results of an OrthoFinder [16] analysis of a set of proteomes.
To prepare phylogenetic databases for the SHOOT website, the OrthoFinder version 3.0 option, "-c1, " was used to cluster genes into groups consisting of all homologs, rather than the default behavior which is to split homologous groups at the level of orthogroups. The advantage of creating complete homologous groups is that their gene trees show the fullest evolutionary history of that family. Orthogroups separate into different tree genes that diverged more distantly in time than the last common ancestor of the included species. Gene trees of complete homologous groups include all these genes in a single tree and show the gene duplication at which different orthogroups from the same gene family diverged. This differs from a default OrthoFinder orthogroup analysis, for which the partitioning of genes into taxonomically comparable orthogroup groups is the priority. OrthoFinder-inferred rooted gene trees for these homolog groups are computed using MAFFT [27] and IQ-TREE [28] by using the additional options "-M msa -A mafft -T iqtree -s species_tree.nwk, " where "species_tree.nwk" was the rooted species tree for the included species. For IQ-TREE, the best fitting evolutionary model was tested for using "-m TEST" and bootstrap replicates performed using "-bb 1000. " The OrthoFinder results were converted to a SHOOT database in two steps: splitting of large trees and creation of the DIAMOND profiles database for assigning novel sequences to their correct gene tree. Large trees are split since the time requirements for adding a sequence to an MSA for a homologous group and for adding a sequence to its tree can grow super-linearly in the size of the group, leading to needlessly long runtimes. It was found that DIAMOND could instead be used to assign a gene to its correct subtree and then phylogenetic placement could be applied to assign the gene to its correct position within the subtree.
The script "split_large_tree.py" was used to split any tree larger than 2500 genes into subtrees of no more than 2500 genes each. Each subtree tree also contained an outgroup gene, from outside the clade in the tree for that subtree, which was required for the later sequence search stage. For each tree that was split into subtrees, a super-tree was also created by the script of the phylogenetic relationships linking the subtrees. For each subtree, the script extracted the sub-MSA for later use. This subtree size of 2500 genes was chosen as it is the approximate upper limit tree size for which SHOOT could place a novel query gene in the tree in 15 s. This was judged to be a reasonable wait for users of the website to receive the tree for their query sequence. For the databases provided by the SHOOT website, between 2 (of 9115) and 40 (of 10516) of the largest trees were split into subtrees.
The script "create_shoot_db.py" was used to create a DIAMOND database of "profiles" for each unsplit tree or each subtree. A profile here refers to a set of representative sequences that best describe the sequence variability within a homologous group. These profiles are used to assign a novel query sequence to the correct tree or subtree. The representative sequences for a gene tree are selected using k-means clustering applied to the MSA corresponding to that (sub) tree using the python library Scikit-learn [29]. For each cluster, the sequence closest to the centroid is chosen as a representative. For a homologous group of size N genes, k = N/10 representative sequences are used, with a minimum of min (20, N) representative sequences. This ensures that large and diverse homologous groups have sufficient representative sequences in the assignment database.

Database search
A query sequence is searched against the profiles database using DIAMOND [17] with default sensitivity and an e-value cutoff of 10 −3 . If no hit is found, a second search is performed with the "--ultra-sensitive" setting. The top hitting sequence is used to assign the gene to the correct tree or subtree. If there is a hit to a second tree (or more) with an e-value < 10 10 times the e-value of the best hit, then these assignments are also considered. The query gene is added to the pre-computed alignment for each possible using the MAFFT "--add" option, and a phylogenetic tree is computed from this alignment using the precomputed tree for the reference alignment using EPA-ng [15] and gappa [30]. A tree is returned for each of the possible assignments.
If the gene is added to a subtree, then the tree is rooted on the outgroup sequence for that subtree. The outgroup is then removed from the subtree, and the subtree is grafted back into the original larger tree, using the supertree to determine the overall topology. This method provides the accuracy of phylogenetic analysis to place the gene in its correct position within the subtree while at the same time providing the user with the full gene history for the complete homologous group given by the supertree, which was calculated in full in the earlier database construction phase. All tree manipulations by SHOOT are performed using the ETE Toolkit [31].
The accuracy of this supertree inference method was tested in comparison with the direct placement of the gene in the full tree. The test was performed on the 22 gene trees from the SHOOT UniProt database for which the supertree method was required. Two versions of the SHOOT database were prepared: "Sub-trees", corresponding to the SHOOT database with the trees split into sub-trees (as deployed on the SHOOT webserver), and "Unsplit" corresponding to the database without any of the gene trees split into subtrees. The test was performed 2000 times by randomly sampling a gene from the complete set of genes in these gene trees. For each of these test cases, the gene was removed from the corresponding MSA and gene tree in each database. SHOOT was then used to place the gene using each of the two databases.
Search and placement of the gene was faster with the supertree method (Fig. 4A), taking on average 7.2 s for the Sub-trees database compared to 127.4 s for the Unsplit database. There was also less variability in the runtime: the maximum time for a SHOOT query using the Sub-trees database was 20.6 s compared to 659.3 s using the Unsplit database. With either of the databases, SHOOT returns the gene placed in the same, full gene tree.
The resulting gene placements were then compared to the original placements of the genes by calculating the normalized Robinson-Foulds distance between the original tree and the tree returned by SHOOT (Fig. 4B, C). The accuracy was comparable between the two methods, with an average normalized Robinson-Foulds distance of 0.0018 using the Sub-trees database and 0.00085 using the Unsplit database. SHOOT provides the user with the option to use this supertree method for the largest trees in the database, or to use unsplit trees in all cases.

Curated databases
For the Plants database, the protein sequences derived from primary transcripts were downloaded from Phytozome [32]. The Uniport Reference Proteomes database was constructed using the 2020 Reference Proteomes [21]. For the Fungi and Metazoa databases, the proteomes were downloaded from Ensembl [33], and the longest transcript variant of each gene was selected as a representative of that gene using OrthoFinder's "primary_ transcripts.py" script [16]. The Bacterial and Archaeal database proteomes were downloaded from UniProt [34]. The parallelization of tasks in the preparation of the databases was performed using GNU parallel [35].

Accuracy validation and performance
The UniProt Reference Proteomes database was used for validation of the SHOOT phylogenetic placements using a leave-one-out test. As this database covers the greatest phylogenetic range (covering all domains of life), its homologous groups contain the greatest sequence variability, and it provides the severest test of the accuracy of SHOOT. Test cases were constructed by selecting 1000 "cherries" (pairs of genes sister to one another) with 95% bootstrap support from gene trees with median bootstrap support of at least 95%. The use of cherries allowed BLAST and DIAMOND to be tested alongside SHOOT. This test was possible for the score-based searches BLAST and DIAMOND since they would only have to identify a single closest gene, rather than having to identify a gene as the sister gene to a whole clade of genes (as SHOOT is designed to be able to do). The bootstrap support criteria ensured that the correct result was known with high confidence so that both methods could be assessed accurately. To ensure an even sampling of test cases, at most one test case was extracted from any one gene tree. The BLAST, DIAMOND, and SHOOT databases were completely pruned of the 1000 test cases. BLAST 2.12.0+ was run with the options "-outfmt 6 -evalue 0.001 -num_threads 16. " DIAMOND v2.0.4.142 was run with the options "-e 0.001 -p 16 -k 50. " SHOOT 1.2.0 was run with the option "-n 16". Each of the 1000 test cases was run using 16 cores of an Intel Xeon E5-2683 CPU, and the runtime was recorded (Fig. 2).
To calculate the mean average precision at k score, the expected trees were reinferred using RAxML with the best-fitting model [36] so that a different method was used to that used in the SHOOT database construction. For each test gene, the ordered list of closest homologs was calculated using branch length distance in the SHOOT result trees and e-values (with ties broken by bit score) for the BLAST and DIAMOND results. These ordered homologs were compared to the expected ordered list of closest homologs from the expected RaxML trees to calculate the precision at each value of k from 1 to 50, and these precision scores were averaged over the 1000 test cases.
The ortholog prediction accuracy tests calculated the precision, recall, and F-score for identifying orthologs in Homo sapiens for genes from Mus musculus, Gallus gallus, Danio rerio, Ciona intestinalis, Drosophila melanogaster, and Saccharomyces cerevisiae. For each of these 6 species, 100 genes were sampled at random. The expected orthologs were obtained from OrthoFinder 2020 Quest for Orthologs benchmark results, obtained from the benchmarking server: https:// ortho logy. bench marks ervice. org. For SHOOT, the orthologs were inferred using the species overlap method [19] on the SHOOT result trees.