OrthoFinder workflow
A gene tree is the canonical representation of the evolutionary relationships between the genes in a gene family. Thus, ortholog inference from gene trees is an important goal. However, no automated software tools are available that provide genome-wide ortholog inference from gene trees. A number of challenges had to be addressed to enable this. These included the efficient partitioning of genes into small, non-overlapping sets such that all orthologs of a gene are contained in the same set as the original gene; scalable and accurate inference of gene trees from these gene sets; automatic rooting of these gene trees without a user-provided species tree; and robust ortholog inference in the presence of imperfect gene tree inference. The OrthoFinder workflow was designed to address each of these challenges and is described in detail below.
By default, OrthoFinder infers orthologs from the orthogroup trees (a gene tree for the orthogroup) using the steps shown in Fig. 2. Input proteomes are provided by the user using one FASTA file per species. Each file contains the amino acid sequences for the proteins in that species. Orthogroups are inferred using the original OrthoFinder algorithm [10]; an unrooted gene tree is inferred for each orthogroup using DendroBLAST [24]; the unrooted species tree is inferred from this set of unrooted orthogroup trees using the STAG algorithm [33]; this STAG species tree is then rooted using the STRIDE algorithm by identifying high-confidence gene duplication events in the complete set of unrooted orthogroup trees [22]; the rooted species tree is used to root the orthogroup trees; orthologs and gene duplication events are inferred from the rooted orthogroup trees by a novel hybrid algorithm that combines the “species-overlap” method [31] and the duplication-loss-coalescent model [32] (described below); and comparative statistics are calculated. All major steps of the algorithm are parallelized to allow optimal use of computational resources. Only the orthogroup inference was provided in the original implementation of OrthoFinder [10]; all other subsequent steps are new and described below.
Use of orthogroups for gene tree inference
Orthologs are the set of genes in a species pair descended from a single gene in the last common ancestor of those two species. An orthogroup is the set of genes from multiple species descended from a single gene in the last common ancestor (LCA) of that set of species. Thus, an orthogroup is the natural extension of orthology to multiple species.
For ortholog inference, orthogroups are the optimum partitioning of genes for gene tree inference: An orthogroup is the smallest set of genes such that, for all genes it contains, the orthologs of these genes are also in the same set. Since gene tree inference scales super-linearly with the number of genes, partitioning genes into the smallest possible sets is the most efficient way of constructing a set of gene trees that encompass all orthology relationships. Although partitioning genes into larger sets (e.g., gene families containing gene duplication events prior to the LCA) would decrease the number of gene trees to be inferred, the super-linear scaling of gene tree inference would result in a longer overall runtime for the complete set of trees. The original OrthoFinder orthogroup inference method is still the most accurate method on the independent Orthobench test set [10] and thus is used for this step.
Customizable steps in the OrthoFinder method
There are two customizable steps in the OrthoFinder method: (1) the sequence search method and (2) the orthogroup tree inference method. The default option for step 1 is DIAMOND [5]. The default option for step 2 is DendroBLAST [24]. The default options are recommended by the authors as they are fast and achieve high accuracy on the Quest for Orthologs benchmarks [1] (Fig. 4a–d). However, the user is free to substitute any alternative methods for these steps. Currently, supported methods for step 1 include BLAST [4] and MMseqs2 [6]. Similarly, any combination of multiple sequence alignment and tree inference method can be substituted in for step 2. For illustrative purposes, the default multiple sequence alignment method is MAFFT [35] and the default tree inference method is FastTree [25]; this combination is benchmarked above. It is impossible for the authors to test all possible combinations of multiple sequence alignment and tree inference methods, and the selected methods were chosen because of their speed and scalability characteristics [25, 35]. OrthoFinder provides flexibility for the user to select their preferred method. More accurate multiple sequence alignment and tree inference methods should give more accurate ortholog inference, and many studies exist comparing the accuracy and runtime characteristics of the available methods [36, 37]. A user-editable configuration file is provided in JSON format that allows new sequence search, multiple sequence alignment, and tree inference methods to be added to OrthoFinder. To facilitate the trialing of alternative multiple sequence alignment and tree inference methods, OrthoFinder provides the option to restart an existing analysis after the orthogroup inference stage. This skips the requirement to compute the all-vs-all sequence search and orthogroup inference and thus accelerates testing of different internal steps.
Species tree inference and rooting
The rooted species tree is required in order to identify the correct out-group in each orthogroup tree, as correct gene tree rooting is critical for the orthology assessment from that tree [22]. Since orthogroups can potentially contain any subset of the species in the analysis, it is not sufficient to simply know the out-group for the complete species set. Instead, the complete rooted species tree is required. If the user knows the rooted species tree for the set of species being analyzed, then it is recommended to specify this tree manually at the command line to remove the possibility of species tree inference error. Such a tree can be provided as a Newick format text file. In the event that a species tree is not provided (or not known), then OrthoFinder automatically infers it.
Sets of one-to-one orthologs that are present in all species are often used for species tree inference; however, in real-world large-scale analyses, these can be rare [33]. A new algorithm, Species Tree from All Genes (STAG), was developed to allow species tree inference even for species sets with few or no complete sets of one-to-one orthologs present in all species [33]. Without this algorithm, species tree inference could fail if there were no sets of one-to-one orthologs present in all species. STAG infers the species tree using the most closely related genes within single-copy or multi-copy orthogroups. In benchmark tests, STAG [24] had higher accuracy than other leading methods for species tree inference, including maximum likelihood species tree inference from concatenated alignments of protein sequences, ASTRAL [38] and NJst [39].
The Species Tree Root Inference from Duplication Events (STRIDE) algorithm [22] is used to root the species tree in OrthoFinder. STRIDE was developed to enable the rooting of the species tree using only information available in the set of gene trees. STRIDE does this by identifying the set of well-supported in-group gene duplication events in the complete set of unrooted orthogroup trees, and using these events to infer a probability distribution over an unrooted STAG species tree for the location of its root. Similarly to STAG, STRIDE has been shown to identify the correct root of the species tree in multiple large-scale molecular phylogenetic data sets spanning a wide range of time scales and taxonomic groups [22]. In some cases, it is possible that there could be few duplications within the gene trees, and so STRIDE will not be able to identify the root of the species tree, or will only be able to exclude the root from clades in which gene duplication events are observed. In this case, ortholog inference should still not be significantly impacted since the rooting of the gene tree only affects ortholog inference in cases where gene duplication events are present [22]. This makes the STRIDE approach particularly suited to gene tree rooting for ortholog inference.
Gene tree rooting
Tree inference methods infer unrooted gene trees. A gene tree must be correctly rooted in order for it to show the correct evolutionary history of the gene family and thus to allow correct ortholog inference. The orthogroup trees could contain any subset of the input species. In general, the rooted species tree, inferred as described above, can be used to root the orthogroup trees by identifying the out-group clade in each orthogroup tree and placing the root on the branch separating this out-group from the remaining genes.
However, species tree and gene tree topologies can arise in which this simple approach will not work, and so, a robust generalization of this outgroup rooting method is required in order to be able to root any potential gene tree. Firstly, in the species tree, the out-group could consist of a single species or multiple species. Secondly, in the gene tree, the genes from the out-group could be in a monophyletic clade or there may be no bipartition in the tree that separates all the genes from the out-group from all remaining genes. Thirdly, a gene duplication event could have occurred in the gene tree prior to the divergence of the out-group from the remaining species. Thus, the most ancient bipartition of the gene tree would be a gene duplication event separating the genes into two clades rather than a bipartition separating the out-group from the in-group. Such a gene tree should be rooted on this bipartition. Both of these two descendant clades could then potentially contain genes from both the out-group and in-group species. Thus, there will be no bipartition in such a tree that separates the genes of the out-group species from the genes of the in-group species.
The algorithm used by OrthoFinder searches for the correct bipartition on which to place the root. For each bipartition in the gene tree, it calculates two scores. The first, SAD, quantifies how well the bipartition corresponds to an ancient duplication prior to the divergence of the species. The second, SIO, quantifies how well the bipartition corresponds to the divergence of the out-group species from the in-group species. Both SIO and SAD range between 0 and 1. Let O be the set of species in the out-group and I be the set of species in the in-group. For a bipartition in the unrooted gene tree, let A be the set of species with genes on one side of the bipartition and let B be the set of species with genes on the other side of the bipartition. Then:
$$ {S}_{IO}=\frac{\left|O\cap A\right|}{\left|O\right|}\frac{\left|I\cap B\right|}{\left|I\right|}\left(1-\frac{\left|O\cap B\right|}{\left|O\right|}\right)\left(1-\frac{\left|I\cap A\right|}{\left|I\right|}\right), $$
$$ {S}_{AD}=\frac{\mid O\cap A\mid }{\mid O\mid}\frac{\mid I\cap B\mid }{\mid I\mid}\frac{\mid O\cap B\mid }{\mid O\mid}\frac{\mid I\cap A\mid }{\mid I\mid }. $$
Each of the four terms in these equations quantifies the proportion of in-/out-group species the bipartition correctly includes/excludes from clade A/B of the gene tree (giving the 23 = 8 terms in total across the two equations). The bipartition with the highest score for either SIO or SAD is the optimal root for the gene tree using this measure.
The effectiveness of these scores at identifying the correct root can be seen by considering the following. A bipartition with a value of 1 for SIO implies that it perfectly divides the tree into an in-group and out-group and implies a value of 0 for SAD for all bipartitions in the tree (thus, there are no potential bipartitions corresponding to an ancient duplication). This is the correct bipartition on which to root the tree since it separates the in-group from the out-group genes. Conversely, a bipartition with a value of 1 for SAD implies that the bipartition is a duplication event before the divergence of any of the species, with all species present for both duplicates. It implies a value of 0 for SIO for all bipartitions in the tree (thus, there is no bipartition that corresponds to a first speciation event that splits the genes into an out-group clade and an in-group clade). The highest value for either SIO or SAD across the tree shows that the corresponding bipartition is close to one of these perfect cases and is the best root for the gene tree.
Ortholog inference and identification of gene duplication events from gene trees
A number of methods were considered for distinguishing orthologs from paralogs in gene trees. Duplication and loss reconciliation, e.g., Forester, uses a rooted species tree and rooted gene tree to determine if each node in the gene tree is a speciation or a duplication event. Genes that diverged at a speciation event are orthologs whereas those that diverged at a duplication event are paralogs. DLCpar [32] uses a model for duplication-loss-(deep) coalescent (DLC) that addresses incongruence between the gene and species trees to increase accuracy. It exists in two versions which we label DLCpar (full) and DLCpar (search). DLCpar (full) considers the complete space of possible reconciliations to find the maximum parsimony solution under the DLC model but can have large runtimes even for relatively small gene trees. DLCpar (search) instead employs an iterative search for a locally optimal solution, which can differ from the globally optimal solution. A third approach, here referred to as the species-overlap method, is employed in a number of ortholog databases [20, 31] and was originally described in a method for determining orthologs of human genes [31]. In this method, nodes in the gene tree are identified as duplication nodes if the sets of species below its child nodes overlap; otherwise, the node is a speciation node. Genes that diverged at a speciation node are orthologs, and those that diverged at a duplication node are paralogs.
These methods were tested on the fungal orthogroups (in parallel, using 16 cores) to determine their runtime on sets of typical orthogroup trees derived from sets of between 4 and 128 species. Our implementation of the species-overlap method was the fastest, taking 55 s to analyze the largest datatset (Fig. 5). This dataset consisted of the 18,651 orthogroup trees containing 948,449 genes and corresponded to the complete set of orthogroup trees for the 128 fungal species. Forester was 21 times slower, and DLCpar (search) was over 500 times slower. DLCpar (full) was unable to complete the analysis of the smallest input dataset in 120 h and so was not tested on any of the larger datasets. To put this time in context, all steps in the OrthoFinder algorithm for this dataset collectively take less than 4 min in total (i.e., orthogroup inference, gene tree inference, species tree inference, species tree rooting, gene tree rooting).
To compare the accuracy of the above methods, they were each tested for their precision and recall in identifying gene duplication events on simulated “flies” and “primates” datasets [32] and a simulated “metazoa” dataset [34]. Since for all methods tested a node in a gene tree is either a duplication or speciation event, the identification of all gene duplication events is equivalent (by complementation) to the identification of all speciation events. Thus, the overall accuracy at identifying gene duplication events is equivalent to the overall accuracy at identifying orthologs. The most accurate method on the simulated data was DLCpar (full) with an F-score of 91.8% followed by the species-overlap method with an F-score of 75.5%.
Since DLCpar (full) was the most accurate method on the simulated datasets but was unsuitable for analyzing gene trees with more than four species a novel hybrid algorithm was developed. This aimed to combine the strengths of the highest accuracy DLCpar (full) method with simplifications from the species-overlap method to achieve high accuracy in a reasonable runtime.
In the DLC model, clades of genes containing no duplicates are analyzed to find the most parsimonious reconciliation with the species tree. This is required since the goal for DLCpar is a complete reconciliation of the gene tree with the species tree. However, in the species-overlap method, clades of single-copy genes are identified as orthologs without further analysis of the topology of their relationship. This assumption is reasonable, since trees of single-copy orthologs are frequently topologically distinct from the species tree. For example, in an analysis of 1030 gene trees of one-to-one orthologs from 23 fungi species, all 1030 gene trees were topologically distinct from each other and from the species tree [40]. The analysis of such clades under the DLC model is likely to be computationally costly with no benefit in terms of accuracy of ortholog inference.
On the other hand, when a gene duplication event has occurred, it is important to accurately identify the genes affected by this event since the location of the event determines which genes are orthologs and which are paralogs. In the hybrid algorithm developed for OrthoFinder, these nodes, for which there is evidence of a gene duplication event through overlapping species sets, are analyzed under the DLC model. The DLC model is used to attempt to find the most parsimonious interpretation of this node in terms of which genes diverged at the gene duplication event and which diverged at a speciation event.
As described, this method would still require exploring a large search space for the nodes under consideration, and the reduction in runtime would not be significant. Thus, to accelerate the process, duplication and loss events are inferred directly using the species-overlap method. A duplication event is inferred from an overlap in the species sets below a node and a loss event is inferred by the presence of a gene from a species in one of the descendant clades but not in the other. The analysis can then be accelerated by classifying a node according the species overlaps of its subclades up to a maximum total topological depth of two below the node being analyzed (clades O, Additional file 1: Figure S4A). The possible sub-cases for the overlaps between these clades have been enumerated (Additional file 1: Figure S4B). For each sub-case, the most parsimonious interpretation under the DLC model has been pre-calculated (Additional file 1: Figure S4C) and can thus be corrected without the need for a topology search.
The algorithm implemented in OrthoFinder is as follows. A post-order traversal of the orthogroup tree is performed (a node is not visited until all its descendant nodes have been visited), analyzing each node of the orthogroup tree in turn. A given node is analyzed to identify if the species sets below its child nodes overlap. If there is an overlap, the smallest sub-clade below each child node that contains the complete set of overlapping species is identified up to a maximum total topological depth of two below the node (clades O, Additional file 1: Figure S4A). The node is assigned to the corresponding sub-case (Additional file 1: Figure S4B). If a more parsimonious interpretation of the sub-case is available under the DLC model, then the sub-tree below the node is rearranged to match this interpretation (Additional file 1: Figure S4C). After the node has been analyzed, the next node in the post-order traversal is analyzed. Note, the choice of a post-order traversal allows the traversal to be continued unimpeded despite any such rearrangements below the node being analyzed. The resulting gene trees are referred to as “resolved” gene trees and correspond to the “locus tree” under the DLCpar model [32]. Orthologs and gene duplication events are determined from the resolved gene tree according to the species overlap method.
Although only a single traversal of the tree is employed, rather than the iterative search and rearrangement employed by DLCpar, the post-order traversal enables more parsimonious interpretations of child clades below a node to be identified prior to the analysis of the parent node. Thus, the analysis of sub-trees below a node informs the subsequent analysis of the node itself. In theory, nodes could be categorized to sub-cases based on the overlaps of clades at a greater topological depth than that employed here. This conservative approach was taken since the number of subcases increases exponentially, and a total topological depth of two proved sufficient to achieve a higher accuracy for the method compared to the simple species overlap. The analysis of clades to this depth proved sufficient to increase the F-score from 72% with just the species-overlap method to 80% with the hybrid algorithm (Fig. 5a). The pre-calculated solutions for each sub-case removed the need for costly, iterative search using random (i.e., unguided) tree rearrangement operations thus accelerating the analysis considerably. The hybrid algorithm was able to analyze the complete set of orthogroup trees for the 128 fungi species in 141 s; this was 9 times faster than Forester and 187 times faster than DLCpar (search) (Fig. 5d). The hybrid method also outperformed both methods in terms of accuracy (Fig. 5a). Note that the species tree is not required for the hybrid model used by OrthoFinder. The only use of the species tree is in determining the root for each orthogroup tree. All gene tree processing is performed using the python ETE toolkit [41].
Simulation tests of OrthoFinder gene duplication event inference accuracy
The tests for gene duplication event inference accuracy were performed on the simulated “flies” and “primates” dataset from [32] and a simulated “metazoa” dataset from [34]. To model real data, the flies and primate datasets used known species trees, parameters for divergence times, duplication rates, loss rates, population sizes, and generation times. Trees were simulated with varying effective population sizes and duplication rates so as to model incomplete lineage sorting [32, 34]. The flies dataset consisted of 12,000 trees with 12 species and 12,032 gene duplication events. The primates dataset consisted of 7500 trees with 17 species and 16,066 gene duplication events. The metazoa dataset intended to emulate the complexity of real data by using heterogeneity in rates of duplication and loss, a complex model of sequence evolution, and then inferring trees with a homogenous, simple model [34]. It consisted of 2000 gene trees with 40 species and 4967 gene duplication events. For comparison, Forester [29], DLCpar (full), DLCpar (search) [32], and the overlap algorithm (i.e., without OrthoFinder’s tree resolution) were also tested.
All methods were provided with the input rooted gene tree and, where appropriate, the rooted species tree (Forester and DLCpar). No other parameters required specification for any of the other methods. The rooted gene trees were provided as part of the simulated data for the flies and primates datasets. Multiple sequence alignment (MSA) files were provided for the metazoa dataset. For this dataset, gene trees were inferred from the MSAs using FastTree so as to also include a potential level of tree inference error and were rooted with reconroot [32]. The OrthoFinder rooting algorithm was not used so as to avoid inadvertently biasing the results in favor of OrthoFinder. All methods were provided with the same input rooted gene trees. The complete set of gene duplication events identified by each of the methods was compared against the ground truth gene duplication events. An inferred gene duplication was identified as correct if the two sets of genes observed post-duplication exactly matched the two sets of genes post-duplication from the ground truth data.
The performance testing of the methods for identifying gene duplication events was performed on the orthogroup trees from the 4- to 128-species Fungi datasets as inferred by OrthoFinder with default parameters. The commands for Forester and DLCpar were run in parallel using GNU Parallel [42] using 16 threads on these gene trees. The OrthoFinder method was run via the “scripts/resolve.py” program included as part of the OrthoFinder distribution. To allow testing, the species-overlap method was also implemented in OrthoFinder and was run using the same program with the option “--no_resolve.”
Ortholog benchmarking
Orthogroup inference accuracy of OrthoFinder has already been tested using the independent Orthobench dataset [11]. This showed to be the most accurate method tested in terms of overall F-score (although other methods scored higher in terms of either precision or recall while scoring proportionally worse in the other) [10]. The community developed “Quest for Orthologs” benchmarks [1] were used to assess the accuracy of the newly developed OrthoFinder ortholog inference using the 2011_04 dataset. This dataset had benchmarks for the largest set of methods and so provided the widest comparison with other methods. OrthoFinder was tested using the default method (DIAMOND sequence search and DendroBLAST trees, no additional options). It was also tested with the BLAST replacing DIAMOND (options: “-S blast”) and with both BLAST search and multiple sequence alignment and maximum likelihood tree inference (options: “-S blast -M msa”). In the latter, MAFFT [35] and FastTree [25] were used for multiple sequence alignment and tree inference as described above. For each of these three cases, OrthoFinder was run on the 66 reference proteomes of the Quest for Orthologs test set with a single command (“-f Proteomes/” + options), and the inferred orthologs were submitted to the Quest for Orthologs web server for benchmarking.
The Quest for Orthologs benchmarks are described in detail in [1]. The Species Tree Discordance Test and the generalized version of this test both consider a set of species partitioned into clades with a known species tree topology connecting the clades. The benchmarking consists of a repeated test. For one of the clades of species, a gene is selected at random for each instance of the test. If the orthology inference method under scrutiny predicts an ortholog for that gene for at least one species from each of the remaining clades, then the test is recorded as a “successful ortholog set.” For each successful ortholog set, an MSA is constructed and a gene tree inferred using RAxML [28]. The normalized Robinson-Foulds (RF) distance is calculated between this tree and the known species tree. The result of the benchmark is the fraction of successful ortholog sets and the average RF distance for these successful sets. A higher fraction of success and a lower average RF distance indicates a better ortholog inference method under this test. The benchmarks include two different Species Tree Discordance Tests (STDT) across two different species sets and four Generalized Species Tree Discordance Tests (GSTDT) across four different species sets. In Fig. 4g, h, the total fraction of successful ortholog sets and the average normalized RF distance across these successful ortholog sets across the two/four species sets are reported for the STDT and GSTDT. The individual GSTDT and STDT results for the four individual species sets are given in Additional file 1: Figure S1.
Minor changes have been made to the labeling and orientation of the axes compared to the presentation in the Quest for Orthologs paper [1] to improve the consistency with the SwissTree and TreeFam-A benchmarks. The altered y-axis for the GSTDT and STDT presented here is (1 - normalized RF distance) so that higher y values always correspond to the better agreement with the species tree for all benchmark figures. The number of completed ortholog set successes for the STDT and GSTDT is reported as a fraction rather than the total number. For the SwissTree and TreeFam-A tests, the axes are labeled as “precision” instead of “pos. predictive value rate” and “recall” instead of “true positive rate” as this is more standard terminology for the quantities reported by the tests.
The full set of benchmarks, the input files, and the ortholog inference results can be seen online at http://orthology.benchmarkservice.org/. A comprehensive summary of the benchmarks, as described above, is show in Fig. 4a–l for ortholog prediction software tools. The corresponding comparisons against online databases are shown in Additional file 1: Figure S2 and Additional file 1: S3. The complete datasets are available to download from Zenodo research archive at https://doi.org/10.5281/zenodo.1481147 [43].
Performance testing
We constructed sets of fungal proteomes of increasing size for performance testing. Ensembl Genomes was interrogated on 6 November 2017 using its REST API [44] to identify all available fungal genomes. To achieve an even sampling of species, we selected 1 species per genera and excluded genomes from candidate phyla or phyla with fewer than 3 sequenced genomes. This gave a set of 272 species which were downloaded from the Ensembl FTP site [45]. We created datasets of increasing size by randomly selecting 4, 8, 16, 32, 64, 128, and 256 species such that the last common ancestor was the same for each dataset. Each dataset was analyzed using a single Intel E5-2640v3 Haswell node (16 cores) on the Oxford University ARCUS-B server using 16 parallel threads for OrthoFinder with DIAMOND (arguments: “-S diamond -t 16 -a 16”). The complete datasets for all analyzed species subsets are available for download from Zenodo at https://doi.org/10.5281/zenodo.1481147. All methods submitted to Quest for Orthologs that provided a user-runnable implementation of the method were tested on the same fungi datasets and the same ARCUS-B server nodes and run in parallel using 16 threads (when supported by the method).
Chordata dataset
The data for the OrthoFinder analysis of the ten Chordata species for the illustration of the results of an OrthoFinder analysis (Fig. 2a–h) are provided in the Zenodo archive https://doi.org/10.5281/zenodo.1481147. This includes the input proteomes, the OrthoFinder results, and the script used to generate the figures from the results. OrthoFinder was run with default settings (DIAMOND sequence search and DendroBLAST gene trees).