Data sets
Twelve sequence data sets were used to evaluate AF methods across five research areas (Table 1).
Protein homology
The reference data sets of protein family members sharing a high (≥ 40%) and low (< 40%) sequence identity were constructed based on two sections of the SCOPe database v. 2.07 [68], namely, ASTRAL95 and ASTRAL40 v. 2.07 [86], respectively. The SCOPe database provides a structural classification of proteins at four levels: classes (proteins with similar secondary structure composition, but different sequences and overall tertiary structures), folds (protein domains of similar topology and structure without detectable sequence similarity), superfamilies (proteins with similar structures and weak sequence similarity), and families (proteins with readily detectable sequence similarity). According to previous studies [5, 8], the ASTRAL data sets were subsequently trimmed to exclude sequences with unknown amino acids and families with fewer than 5 proteins and included only the four major classes (i.e., α, β, α/β, and α + β). To minimize the requirements for AF method submission related to performing all-versus-all sequence comparisons and uploading the output to the AFproject server, we further reduced the data sets by randomly selecting only two protein members in each family. As ASTRAL95 also contains protein family members sharing a sequence identity lower than 40%, the Needleman–Wunsch alignment was performed (using needle software in the EMBOSS package [87]) to select proteins with a sequence identity ≥ 40% to acquire a reference data set of proteins with high sequence identity.
Gene trees
Reference trees and corresponding protein sequences of eleven gene families were downloaded from SwissTree release 2017.0 [58, 88]: Popeye domain-containing protein family (49 genes), NOX “ancestral-type” subfamily NADPH oxidases (54 genes), V-type ATPase beta subunit (49 genes), serine incorporator family (115 genes), SUMF family (29 genes), ribosomal protein S10/S20 (60 genes), Bambi family (42 genes), Asterix family (39 genes), cited family (34 genes), Glycosyl hydrolase 14 family (159 genes), and Ant transformer protein (21 genes).
Gene regulatory elements
The data set of CRMs known to regulate expression in the same tissue and/or developmental stage in fly or human was obtained from Kantorovitz et al. [6]. The data set was specifically selected to test the capacity of AF measures to identify functional relationships among regulatory sequences (e.g., enhancers or promoters). The data set contains 185 CRM sequences taken from D. melanogaster—blastoderm-stage embryo (n = 82), eye (n = 17), peripheral nervous system (n = 23), and tracheal system (n = 9)—and Homo sapiens—HBB complex (n = 17), liver (n = 9), and muscle (n = 28).
Genome-based phylogeny
The sequences of 25 whole mitochondrial genomes of fish species from the suborder Labroidei and the species tree were taken from Fischer et al. [59]. The set of 29 E. coli genome sequences was originally compiled by Yin and Jin [23] and has been used in the past by other groups to evaluate AF programs [24, 25, 89]. Finally, the set of 14 plant genomes is from Hatje et al. [90]. This set was also used in the past to evaluate AF methods. To simulate unassembled reads from these data sets, we used the program ART [91].
Horizontal gene transfer
The 27 E. coli and Shigella genomes, and the 8 Yersinia genomes, were taken from Bernard et al. [62]. We used EvolSimulator [92] to simulate HGT in microbial genomes, adopting an approach similar to that described in Bernard et al. [62]. The HGT events were simulated to occur at random, i.e., anywhere along a genomic sequence and between any pair of genomes in a set. Each set of genomes was simulated under a birth-and-death model at speciation rate = extinction rate = 0.5. The number of genomes in each set was allowed to vary from 25 to 35, with each containing 2000–3000 genes 240–1500 nucleotides long. HGT receptivity was set at a minimum of 0.2, mean of 0.5, and maximum of 0.8, with a mutation rate m = 0.4–0.6 and a number of generations i = 5000. The varying extent of HGT was simulated using the mean number of HGT events attempted per iteration l = 0, 250, 500, 750, and 1000, and divergence factor d = 2000 (transferred genes that are of high sequence divergence, i.e., > 2000 iterations apart, will not be successful). All other parameters in this simulation followed Beiko et al. [92].
Alignment-free tools
AAF [38] reconstructs a phylogeny directly from unassembled next-generation sequencing reads. Specifically, AAF calculates the Jaccard distance between sets of k-mers of two samples of short sequence reads. This distance between samples or species is based on the estimate of the rate parameter from a Poisson process for a mutation occurring at a single nucleotide. The phylogeny is constructed using weighted least squares with weights proportional to the expected variance of the estimated distances. AAF provides features for correcting tip branches and bootstrapping of the obtained phylogenetic trees, directly addressing the problems of sequencing error and incomplete coverage.
AFKS [34] is a package for calculating 33 k-mer-based dissimilarity/distance measures between nucleotide or protein sequences. AFKS categorizes the measures into nine families: Minkowski (e.g., Euclidean), Mismatch (e.g., Jaccard), Intersection (e.g., Kulczynski), D2 (e.g., D2s), Squared Chord (e.g., Hellinger), Inner Product (e.g., normalized vectors), Markov (e.g., SimMM), Divergence (e.g., KL Conditional), and Others (e.g., length difference). The tool determines the optimal k-mer size for given input sequences and calculates dissimilarity/distance measures between k-mer counts that include pseudocounts (adding 1 to each k-mer count). The obtained distance is standardized to between 0 and 1.
alfpy [5] provides 38 AF dissimilarity measures with which to calculate distances among given nucleotide or protein sequences. The tool includes 25 k-mer-based measures (e.g., Euclidean, Minkowski, Jaccard, and Hamming), eight information-theoretic measures (e.g., Lempel–Ziv complexity and normalized compression distance), three graph-based measures, and two hybrid measures (e.g., Kullback–Leibler divergence and W-metric). alfpy is also available as a web application and Python package. In this study, the results based on 14 dissimilarity measures are evaluated.
ALFRED-G [45] uses an efficient algorithm to calculate the length of maximal k-mismatch common substrings between two sequences. Specifically, to measure the degree of dissimilarity between two nucleic acid or protein sequences, the program calculates the length of maximal word pairs—one word from each of the sequences—with up to k mismatches.
andi [24] estimates phylogenetic distances between genomes of closely related species by identifying pairs of maximal unique word matches a certain distance from each other and on the same diagonal in the comparison matrix of two sequences. Such word matches can be efficiently found using enhanced suffix arrays. The tool then uses these gap-free alignments to estimate the number of substitutions per position.
CAFE [36] is a package for efficient calculation of 28 AF dissimilarity measures, including 10 conventional measures based on k-mer counts, such as Chebyshev, Euclidean, Manhattan, uncentered correlation distance, and Jensen–Shannon divergence. It also offers 15 measures based on the presence/absence of k-mers, such as Jaccard and Hamming distances. Most importantly, it provides a fast calculation of background-adjusted dissimilarity measures including CVTree, d2star, and d2shepp. CAFE allows for both assembled genome sequences and unassembled next-generation sequencing shotgun reads as inputs. However, it does not deal with amino acid sequences. In this study, the results based on CVTree, d2star, and d2shepp are evaluated.
co-phylog [23] estimates evolutionary distances among assembled or unassembled genomic sequences of closely related microbial organisms. The tool finds short, gap-free alignments of a fixed length and consisting of matching nucleotide pairs only, except for the middle position in each alignment, where mismatches are allowed. Phylogenetic distances are estimated from the fraction of such alignments for which the middle position is a mismatch.
EP-sim [53] computes an AF distance between nucleotide or amino acid sequences based on entropic profiles [93, 94]. The entropic profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. For each position, it computes a score based on the Shannon entropies of the word distribution and variable-length word counts. EP-sim estimates a phylogenetic distance, similar to D2, by summing the entropic profile scores over all positions, or similar to \( {D}_2^{\ast } \), with the sum of normalized entropic profile scores.
FFP [35, 39] estimates the distances among nucleotide or amino acid sequences. The tool calculates the count of each k-mer and then divides the count by the total count of all k-mers to normalize the counts into frequencies of a given sequence. This process leads to the conversion of each sequence into its feature frequency profile (FFP). The pairwise distance between two sequences is then calculated by the Jensen–Shannon divergence between their respective FFPs.
FSWM [26] estimates the phylogenetic distance between two DNA sequences. The program first defines a fixed binary pattern P of length l representing “match positions” and “don’t care positions.” Then, it identifies all “Spaced-word Matches” (SpaM) w.r.t. P, i.e., gap-free local alignments of the input sequences of length l, with matching nucleotides at the “match positions” of P and possible mismatches at the “do not care” positions. To estimate the distance between two DNA sequences, SpaMs with low overall similarity are discarded, and the remaining SpaMs are used to estimate the distance between the sequences, based on the mismatch ratio at the “do not care” positions. There is a version of FSWM that can compare sets of unassembled sequencing reads to each other called Read-SpaM [48].
jD2Stat [37] utilizes a series of D2 statistics [17, 18] to extract k-mers from a set of biological sequences and generate pairwise distances for each possible pair as a matrix. For each sequence set, we generated distance matrices (at the defined k; Additional file 1: Table S1), each using \( {D}_2^S \) (D2S; exact k-mer counts normalized based on the probability of occurrence of specific k-mers), \( {D}_2^{\ast } \) (d2St; similar to \( {D}_2^S \) but normalized based on means and variance), and \( {D}_2^n \) (d2n; extension of D2 that expands each word w recovered in the sequences to its neighborhood n, i.e., all possible k-mers with n number of wildcard residues, relative to w).
kmacs [20] compares two DNA or protein sequences by searching for the longest common substrings with up to k mismatches. More precisely, for each position i in one sequence, the program identifies the longest pair of substrings with up to k mismatches, starting at i in the first sequence and somewhere in the second sequence. The average length of these substring pairs is then used to define the distance between the sequences.
kr [46] estimates the evolutionary distance between genomes by calculating the number of substitutions per site. The estimator for the rate of substitutions between two unaligned sequences depends on a mathematical model of DNA sequence evolution and average shortest unique substring (shustring) length.
kSNP3 [52] identifies single nucleotide polymorphisms (SNPs) in a set of genome sequences without the need for genome alignment or a reference genome. The tool defines a SNP locus as the k-mers surrounding a central SNP allele. kSNP3 can analyze complete genomes, draft genomes at the assembly stage, genomes at the raw reads stage, or any combination of these stages. Based on the identified SNPs, kSNP3.0 estimates phylogenetic trees by parsimony, neighbor-joining, and maximum-likelihood methods and reports a consensus tree with the number of SNPs unique to each node.
kWIP [44] estimates genetic dissimilarity between samples directly from next-generation sequencing data without the need for a reference genome. The tool uses the weighted inner product (WIP) metric, which aims at reducing the effect of technical and biological noise and elevating the relevant genetic signal by weighting k-mer counts by their informational entropy across the analysis set. This procedure downweights k-mers that are typically uninformative (highly abundant or present in very few samples).
LZW-Kernel [40] classifies protein sequences and identifies remote protein homology via a convolutional kernel function. LZW-Kernel exploits code blocks detected by the universal Lempel–Ziv–Welch (LZW) text compressors and then builds a kernel function out of them. LZW-Kernel provides a similarity score between sequences from 0 to 1, which can be directly used with support vector machines (SVMs) in classification problems. LZW-Kernel can also estimate the distance between protein sequences using normalized compression distances (LZW-NCD).
mash [11] estimates the evolutionary distance between nucleotide or amino acid sequences. The tool uses the MinHash algorithm to reduce the input sequences to small “sketches,” which allow fast distance estimations with low storage and memory requirements. To create a “sketch,” each k-mer in a sequence is hashed, which creates a pseudorandom identifier (hash). By sorting these hashes, a small subset from the top of the sorted list can represent the entire sequence (min-hashes). Two sketches are compared to provide an estimate of the Jaccard index (i.e., the fraction of shared hashes) and the Mash distance, which estimates the rate of sequence mutation under an evolutionary model.
Multi-SpaM [25], similar to FSWM, starts with a binary pattern P of length l representing “match positions” and “don’t care positions.” It then searches for four-way Spaced-word Matches (SpaMs) w.r.t. P, i.e., local gap-free alignments of length l involving four sequences each and with identical nucleotides at the “match positions” and possible mismatches at the “do not care positions.” Up to 1,000,000 such multiple SpaMs with a score above some threshold are randomly sampled, and a quartet tree is calculated for each of them with RAxML [95]. The program Quartet Max-Cut [96] is used to calculate a final tree of all input sequences from the obtained quartet trees.
phylonium [49] estimates phylogenetic distances among closely related genomes. The tool selects one reference from a given set of sequences and finds matching sequence segments of all other sequences against this reference. These long and unique matching segments (anchors) are calculated using an enhanced suffix array. Two equidistant anchors constitute homologous region, in which SNPs are counted. With the analysis of SNPs, phylonium estimates the evolutionary distances between the sequences.
RTD-Phylogeny [51] computes phylogenetic distances among nucleotide or protein sequences based on the time required for the reappearance of k-mers. The time refers to the number of residues in successive appearance of particular k-mers. Thus, the occurrence of each k-mer in a sequence is calculated in the form of a return time distribution (RTD), which is then summarized using the mean (μ) and standard deviation (σ). As a result, each sequence is represented in the form of a numeric vector of size 2·4k containing the μ and σ of 4k RTDs. The pairwise distance between sequences is calculated using Euclidean distance.
Skmer [50] estimates phylogenetic distances between samples of raw sequencing reads. Skmer runs mash [11] internally to compute the k-mer profile of genome skims and their intersection and estimates the genomic distances by correcting for the effect of low coverage and sequencing error. The tool can estimate distances between samples with high accuracy from low-coverage and mixed-coverage genome skims with no prior knowledge of the coverage or the sequencing error.
Slope-SpaM [97] estimates the phylogenetic distance between two DNA sequences by calculating the number Nk of k-mer matches for a range of values of k. The distance between the sequences can then be accurately estimated from the slope of a certain function that depends on Nk. Instead of exact word matches, the program can also use SpaMs w.r.t. a predefined binary pattern of “match positions” and “don’t care positions.”
spaced [41,42,43] is similar to previous methods that compare the k-mer composition of DNA or protein sequences. However, the program uses the so-called spaced words instead of k-mers. For a given binary pattern P of length l representing “match positions” and “don’t care positions,” a spaced word w.r.t. P is a word of length l with nucleotide or amino acid symbols at the “match positions” and “wildcard characters” at the “do not care positions.” The advantage of using spaced words instead of exact k-mers is that the obtained results are statistically more stable. This idea has been previously proposed for database searching [98, 99]. The original version of Spaced [41] used the Euclidean or Jensen–Shannon [100] distance to compare the spaced-word composition of genomic sequences. By default, the program now uses a distance measure introduced by Morgenstern et al. [43] that estimates the number of substitutions per sequence position.
Underlying Approach [47] estimates phylogenetic distances between whole genomes using matching statistics of common words between two sequences. The matching statistics are derived from a small set of independent subwords with variable lengths (termed irredundant common subwords). The dissimilarity between sequences is calculated based on the length of the longest common subwords, such that each region of genomes contributes only once, thus avoiding counting shared subwords multiple times (i.e., subwords occurring in genomic regions covered by other more significant subwords are discarded).
Benchmarks
Evaluation of structural and evolutionary relationships among proteins
To test the capacity of AF distance measures to recognize SCOPe relationships (i.e., family, superfamily, fold, and class), we used a benchmarking protocol from previous studies [5, 8]. Accordingly, the benchmarking procedure takes the distances between all sequence pairs present in the data set file. The distances between all protein pairs are subsequently sorted from minimum to maximum (i.e., from the maximum to minimum similarity). The comparative test procedure is based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in the SCOPe database and 0 corresponds to other outcomes. The group can be defined at one of the four different levels of the database (family, superfamily, fold, and class), exploring the hierarchical organization of the proteins in that structure. Therefore, each protein pair is associated with four binary classifications, one for each level. At each SCOPe level, ROC curves and AUC values computed in scikit-learn [101] are obtained to give a unique number of the relative accuracy of each metric and level according to the SCOP classification scheme. The overall assessment of method accuracy is an average of AUC values across all four SCOPe levels.
Evaluation of functionally related regulatory sequences
To test how well AF methods can capture the similarity between sequences with similar functional roles, we used the original benchmarking protocol introduced by Kantorovitz et al. [6]. Briefly, a set of CRMs known to regulate expression in the same tissue and/or developmental stage is taken as the “positive” set. An equally sized set of randomly chosen noncoding sequences with lengths matching the CRMs is taken as the “negative” set. Each pair of sequences in the positive set is compared, as is each pair in the negative set. The test evaluates if functionally related CRM sequence pairs (from the positive half) are better scored by a given AF tool (i.e., have lower distance/dissimilarity values) than unrelated pairs of sequences (from the negative half). This procedure is done by sorting all pairs, whether they are from the positive set or the negative set, in one combined list and then counting how many of the pairs in the top half of this list are from the positive set. The overall assessment of method accuracy is the weighted average of the positive pairs across all seven subsets.
Evaluation of phylogenetic inference
The accuracy of AF methods for data sets from three categories—gene tree inference, genome-based phylogeny, and horizontal gene transfer—was evaluated by a comparison of topology between the method’s tree and the reference tree. The pairwise sequence distances obtained by the AF method were used as input for the neighbor-joining algorithm (fneighbor in the EMBOSS package [87], version: EMBOSS:6.6.0.0 PHYLIPNEW:3.69.650) to generate the corresponding method tree. To assess the degree of topological (dis) agreement between the inferred and reference trees, we calculated the normalized Robinson–Foulds (nRF) distance [63] using the Tree.compare function in the ETE3 [102] toolkit for phylogenetic trees with the option unrooted = True. The Robinson–Foulds (RF) distance is a measure for the dissimilarity between two tree topologies with the same number of leaves and the same labels (species) at the leaves, i.e., it measures the dissimilarity of branching patterns and ignores branch lengths. More specifically, the RF distance between two trees is defined as the number of certain edit operations that are necessary to transform the first topology into the second topology (or vice versa). Equivalently, one can define the RF distance between two topologies by considering bipartitions of the leaves (species) of the trees, obtained by removing edges from the trees. The RF distance is then the number of bipartitions that can be obtained only from one tree but not from the respective other tree. The nRF measure normalizes the RF distance such that the maximal possible nRF distance for the given number of leaves is set to 1. Thus, the nRF distance has values between 0 and 1 with 0 for identical tree topologies and 1 for maximally dissimilar topologies, where no bipartition in the reference is recovered. Given certain shortcomings of nRF distance such as rapid saturation (i.e., relatively minor differences between trees can result in the maximum distance value) [103] and imprecise values (i.e., the number of unique values that the metric can take is two fewer than the number of taxa) [104], we supplemented the AFproject service with additional measure for topological disagreement, normalized Quartet Distance (nQD) [105], which is the fraction of subsets of four leaves that are not related by the same topology in both trees.
Performance summary criteria
Figure 2 shows the color-coded performance of the evaluated AF methods across 12 reference data sets.
Performance score
For our benchmarking data sets, we use different measures to assess the performance of each method for a given data set, for example, nRF or AUC. To make our benchmarking results from different data sets comparable, we converted these measures to a performance score with values between 0 and 100. For the protein sequence classification data sets, this score is defined as AUC × 100; for data sets from gene trees, genome-based phylogeny, and horizontal gene transfer categories, we define the performance score as (1 − nRF) × 100. For the regulatory element data set, the performance score is already a number between 0 and 100, namely, the weighted average performance across seven data subsets.
Moreover, we define an overall performance score (Additional file 1: Table S14) that assesses each method across the data sets and that also takes values between 0 and 100. For a given method, we calculate revised scores for each data set, on which the method was tested as (S − min_score)/(max_score − min_score) × 100, where S is the performance score obtained by the method and min_score and max_score are the minimum and maximum scores obtained with all methods for a given data set, respectively. This way, the best-performing method in a given data set receives a score of 100, and the worst performer receives a score of 0. The overall performance is an average of the revised scores across the data sets on which the given method was tested.