VirStrain: a strain identification tool for RNA viruses

Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02609-x).


The impact of the database size on the performance of VirStrain
To exam whether the number of reference genomes affects the performance of VirStrain, we build two databases with different numbers of reference genomes and repeat the experiment mentioned in Section 2.3 of the main article. These two databases contain 6880 and 3447 SARS-CoV-2 strains, respectively. The strains used to build the database with 6880 strains are divided into two parts. One part is the same strains from NCBI as the main article, and the other part is newly added strains from GISAID [1]. The newly added SARS-CoV-2 strains are downloaded according to their GISAID accession IDs in nextstrain's [2] SARS-CoV-2 global subsampling page. The database with 3447 strains is constructed using strains from GISAID. The result is shown in Supplementary Table S1. VirStrain has a slight improvement at lower coverage when there are fewer SARS-CoV-2 strains in the database. The number of multiple hits may increase when the number of reference genome increases and the sequencing depth decreases. (0, 1) (0, 1) (0, 1) (0, 1) 50x (0, 1) (0, 1) (0, 1) (0, 1) 100x (0, 1) (0, 1) (0, 1) (0, 1) Supplementary Table S1. The number of tie cases and the median number of best matches for the tie cases in VirStrain. There are 3 databases containing different numbers of reference genomes. Each cell contains a tuple with the first number being the number of tie cases and the second number being the median number of strains in the top-ranking group.

Benchmark experiment on all strains
In order to test the performance of VirStrain on all strains, we carried out a benchmark experiment with VirStrain, KrakenUniq, and Centrifuge. As Sigma and Pathoscope2 are more computationally expensive than other tools, they are not compared in this experiment. Kraken2 is also not compared due to its low accuracy on the single-strain dataset.
In this experiment, one of the SARS-CoV-2 strains in the reference database was randomly selected as the test strain. We simulated reads from it and used the reads as input to the three tools. This process is repeated for all other strains. The accuracy is computed for all the strains and shown in Supplementary Figure S3. VirStrain is able to identify all strains in the database while other tools have lower accuracy.
Supplementary Figure S1. The accuracy comparison of 3 tools on all strains in the database. These three tools have fast speed and also reasonable accuracy in the small scale experiment on 200 reference genomes.

Benchmark experiments on low coverage data
In order to test the performance of other tools at low sequencing depth, we evaluate other methods (Sigma, Patho-scope2, Centrifuge, Krakenuniq) on the same datasets used in Table 1 of the main article. Due to the long running time of Pathoscope2 and Sigma, we only utilized 200 strains as their reference database. The result is summarized in Supplementary Table S2, which shows the major problem of most tested tools is the low accuracy. When the sequencing depth decreases, the accuracy of Centrifuge, Krakenuniq, and Pathoscope2 decreases as well. Besides, the two computationally efficient tools, Centrifuge and Krakenuniq, also have many multiple hits when the sequencing depth is low. Though Sigma has a good performance in this evaluation, it took more than two weeks to analyze all datasets given 8 threads on an HPCC CentOS 6.8 node with 2.4Ghz 14-core Intel Xeon E5-2680v4 CPUs and 128 GB memory. On top of that, it has low accuracy in the mix-strain identification experiments ( Figure 3A).

In-depth analysis of the 6 haplotype reconstruction tools
All haplotypes investigated in this experiment refer to the haplotypes falsely constructed by 6 haplotype reconstruction tools in Section 2.5 of the main article. For each haplotype, we used MegaBLAST [3] to identify the closest reference strain (defined as s) in the database. Then, we identified the reference strains that have higher similarity to the ground truth than s. The number of these reference strains is shown in Supplementary Table S3. For "missed strains" in Table S3, there are four cases: 1) the dataset contains two strains but the tool only outputs one; 2) a tool outputs two haplotypes but their s is the same strain; 3) the haplotype cannot be aligned to any reference strain; 4) the tool outputs nothing. Because of the high divergence of the HIV strains, all tools perform better on HIV. ShoRAH and PredicHaplo tend to miss more strains than other tools. In particular, PredictHaplo failed to output any haplotype for a large number of samples. Many outputs of ShoRAH cannot be aligned to any reference genome. aBayesQR and TenSQR can output two haplotypes in more datasets than other tools. TenSQR outputs haplotypes that are closer to Supplementary Table S3. The statistics of falsely constructed haplotypes. Each value in the cell refers to the number of haplotypes in that category. For example, in the first row, "50" means there are 50 SARS-CoV-2 haplotypes missed by aBayesQR. "11" means there are 11 SARS-CoV-2 haplotypes that can be aligned to 1-10 strains in the reference database with higher similar to the ground truth. For each virus and category, the bold number represents the best metric value.

VirStrain identifies a recombinant strain from 3-strain simulated data
To test whether VirStrain can identify the recombinant strains from mixed strain infections, two mixed strain simulation datasets are generated. These two datasets contain three HBV and three HIV strains, respectively. Considering we don't know the exact parent strains of the recombinant strains and thus we constructed the dataset using strains from the parent strains' genotypes/subtypes. As a result, these selected strains contain both the recombinant strain and the strains in its parent genotypes/subtypes [4,5]. The sequencing data are simulated according to a sequencing depth of 10X:10X:10X. As shown in Supplementary Table S4, VirStrain can identify both the recombinant strain and the strains in its parent genotypes/subtypes. The databases for HBV and HIV used here is the same as the one used in the main article.

A large-scale experiment of novel strain detection
To further test the robustness of VirStrain, we applied VirStrain to detect the closest relative in a larger simulated dataset. Sigma was not included in this experiment because of its long running time. We randomly selected 100 strains from all the reference genomes. Then we generated 500 mutant strains from these 100 strains with 5, 7, 9, 11, 13 random point mutations. With ART, we simulated 600 datasets (for 500 mutant strains and 100 reference strains) and used them as input to VirStrain. VirStrain can still identify correct closest relatives in all the 600 simulated datasets (Supplementary Figure S5).
Supplementary Figure S2. The similarity between the falsely constructed haplotypes and the ground truth. The similarity is calculated by MegaBLAST. Supplementary Table S4. The performance of VirStrain on 2 mix-strain simulated datasets of HBV and HIV. "Rank" represents the ranking of each strain in the output of VirStrain.

The limitation of assembly tools
In this experiment, we apply 3 popular assembly tools to assemble viral genomes from 4 real SARS-CoV-2 sequencing data sets. As shown in Supplementary Table S5, these assembly tools are not able to generate the complete genomes from the input sequencing data, indicating that the whole virus genome may not always be available for downstream analysis. However, VirStrain can still be applied to these datasets to identify the most possible strain present in these samples. This experiment shows that VirStrain can be used for strain composition analysis when it is hard to obtain the complete genomes from the sequencing data.
Supplementary Figure S3.  Table S5. The performance of 3 assembly tools on 4 real sequencing data sets. "-" indicates that IVA is not tested on those datasets because it only accepts sequencing data of PE. The cell in grey means these datasets are metagenomic sequencing data.

Test four haplotype reconstruction tools using 2 HBV sequencing data sets with known composition
In this experiment, we applied 4 haplotype reconstruction tools to reconstruct strain sequences from 2 HBV mix-strain samples with known composition (MK720631.1 and MK720628.1). By using MegaBLAST [3], the reconstructed strain sequences are aligned to all strains in the reference database to find the most similar strain. As shown in Supplementary Table S6, the strains reconstructed by these tools are all most similar to the ground truth but are not 100% identical to the ground truth even though the similarity between the two strains is relatively low (89.97%).  ERR3253399). Each cell contains a tuple with the first number being the similarity between the reconstructed strains and the ground truth and the second number being the relative abundance of the reconstructed strains. The reconstructed strains with the highest similarity to the ground truth are highlighted in bold.

VirStrain detects possible minor strains from Washington State datasets
For RNA viruses with fast replication rates, it is not rare to see multiple strains infecting the same host. For COVID-19, is it possible to see more than one strain (or haplotype) in the same host? Reconstructing the complete haplotype from short reads is an active research area itself and is beyond the scope of this work. But we can examine whether a different local haplotype segment exist in a sample. If we find one such local haplotype segment, we can add it to our reference sequence database and search for it in other samples using VirStrain. Being able to detect the same local haplotype in multiple samples provides strong evidence behind the presence of a new haplotype. Towards this goal, we first assembled reads from sequenced SARS-CoV-2 samples at NCBI SRA using Megahit [6] and then mapped reads against the assembled genome. We found some relatively high-frequency mutations in a Washington State sample (SAMN15678406) from USA (Supplementary Figure S5). Based on the results of the reads mapping and a statistical model that considers the base quality, sequencing errors, coverage, and the number of reads covering the same SNV events, we obtained a possible local region of a minor strain, which we named as k59_23. We used nextstrain [2] to annotate k59_23 and found that there is a mutation G25358C that will cause the amino acid mutation D1259H of the surface protein. The statistical model used to find the possible local haplotype is described below.
There are four observed single nucleotide variants (SNVs) with frequencies much higher than sequencing errors at different loci of reference genome based on the read mapping results: 25345-A, 25355-C, 25357-C, 25358-C. We regard other types of SNVs at these four loci as sequencing errors because their frequencies are very low. We are not aiming to reconstruct the complete (genome-scale) haplotype, which is an active research field itself and is beyond the scope of this work. Instead, we will construct a local haplotype region containing SNV sites that are close enough to be covered by the same reads.
In the reference genome, the bases at the four loci are 25345-C, 25355-G, 25357-A, 25358-G. Thus, of the 2 4 −1 = 15 combinations of bases (excluding the reference one), our goal is to determine the most possible combination, which can define a local haplotype. For the sake of convenience, we use k = 1, 2, 3, 4 to denote these four loci. And we define the following notations: a) q (j) k is the Phred quality score of the base in read j when this base mismatches the base of reference genome at locus k in the alignment between the read j and the reference genome. If the base in read j matches the base of reference genome at locus k in the alignment, or read j does not cover locus k in the alignment, the value of q (j) k will be set to 0. b) c (j) k ∈ {0, 1} is an indication function. c (j) k = 1 if the base in read j is not a sequencing error at locus k in the alignment. c (j) k = 0 if the base in read j is a sequencing error at locus k in the alignment, or read j does not cover locus k in the alignmen. c) m k is the number of reads whose bases match the SNV at locus k in the alignments between the reads and the reference genome. d) d k is the number of aligned reads that are aligned to locus k. e) p (j) k is the probability that read j supports the SNV at locus k. And We use the equations(1) to calculate p There are four cases in Equation (1), also as shown in Figure S1. In case 1, the base in read j aligned to locus k of the reference genome matches the SNV at locus k. In this case we can use the Phred quality score to calculate the SNV probability directly. In case 2, the base in read j aligned to locus k of the reference genome matches the base in the reference genome and thus this read does not support the SNV event. So it is reasonable to assume that the probability of read j containing SNV at locus k is 0. In case 3, the base in read j aligned to locus k of the reference genome is a sequencing error. The true base can be the reference one or the SNV. Thus, we can use the error rate and the nucleotide frequencies observed at the locus k, to estimate the probability of read j supporting this SNV event. In case 4, if the alignment between read j and the reference genome does not cover locus k (missing information), we use the nucleotide frequencies observed at the locus k to estimate the probability of read j supporting this SNV event.
Similarly, we define r (j) k as the "reference probability" that read j supports the reference base at locus k. And we can use the equation (2) to calculate r Based on the SNV probability equation (1) and reference probability equation (2), we define Equation (3) to calculate the probability that read j is consistent with haplotype i, where i is one of the 15 SNV combinations.
1} is an indication function such that h (i) k = 1 if haplotype i contains the SNV at locus k and h (i) k = 0 if the haplotype i contains the reference base at locus k (e.g. the indication function for haplotype AGCG will be {1, 0, 1, 0}). To find out the most possible haplotype that is different from the reference genome, we use Equation (4) to calculate the normalized weight of each haplotype based on the read mapping quality of a set of given reads. This weight denotes the consistency between each haplotype and the set of reads.
map is the mapping quality of read j obtained from bowtie2 [7]. (1 − 10 − Q (j) m ap 10 ) denotes the estimated probability that the alignment of read j corresponds to the true region. R is a set of reads that cover and match at least one of the four loci of the reference genome we mentioned above. Table S7 shows the weights of different haplotypes.
We can see that the haplotype ACCC at four loci is the most possible one of the fifteen possible haplotypes. Its weight is about 0.2235, which is about the twice of the second-highest weight. Thus we add this possibly novel haplotype into the reference database and used VirStrain to search for this haplotype in other Washington State samples.
In order to validate this finding, we checked whether other samples also contain k59_23. Towards this goal, we added this strain to the VirStrain database and then used VirStrain to detect it in other SRA datasets from Washington State. The output shows that k59_23 was also detected in three other Washington State samples.

VirStrain predicts HIV subtype and H1N1 clade
The identification result generated by VirStrain can also be used to extract the subtype or clade information of identified strains. Thus, users can obtain the strain subtype or clade information from reads directly without assembly, which is useful for datasets that cannot form quality assemblies. In this experiment, we applied VirStrain to four real datasets and output the subtype or clade information of identified strains using a bottom-to-up method. To know the subtype or clade type of the input dataset, we can check the metadata of "the most possible strain" output by VirStrain. Then, we use the subtype or clade label of the identified strain as the final output.
Supplementary Table S9 shows that the subtype or clade identified by VirStrain is the same as the truth. Besides, the region information is also consistent with the source region of these samples.

Limitation posed by iterative search
As many reference strains share high sequence similarity, after finding the best match in the reference genomes, VirStrain will remove the matched SNV sites before the next iteration so that other possible strains with different SNV sites can achieve better V score. Without removing the matched SNV sites, strains that share high similarity with the identified strain will naturally returned as other possible strains, incurring false positive identifications. However, there is a tradeoff between the false positive identification and sensitivity. When the SNV sites of the best matched strain are removed, the strains that are highly similar to the best matched one may be missed by VirStrain. To investigate this, we constructed a difficult case where three strains with different similarities co-exist in one sample and tested the performance of VirStrain. Thus, the iterative search procedure poses a constraint on the number of different SNVs between strains in the same sample. Strains with too few differences will be missed by VirStrain. In order to provide the guidance on the number of expected different SNVs, we tested a hard case for our method. The input data contain reads from three strains, with one being the major one (100x) and other two being minor ones (10x). In addition, two of the three strains are highly similar with only 1 different SNV while the rest one has more than 3 different SNVs (strains with 3, 5 and at least 10 different SNVs were used and considered as 3 groups in this experiment). In total, there are 30 simulated datasets and 10 simulated datasets in each group. Supplementary Table S10 shows VirStrain is always able to identify strains with high abundance and more than 3 different SNVs but misses the strain with 1 different SNV. In addition, we found the median number of best matches decrease with the increase of the number of different SNVs. Based on these results, we recommend users to consider 10 as a minimum number to identify other possible strains in multi-strain infection cases. It should be noted this value may increase for viruses with large genome sizes. For example, we take 150 as the threshold for HCMV. For low abundance strains identified in all tie cases, VirStrain will output multiple best matches in descending order of V score of the first iteration. We found the correct strain was always the top 1 strain under this sorting rule, which could be a useful information for users to identify the correct strain among multiple best matches. Supplementary Table S10. The performance of VirStrain on 30 3-strain simulted datasets with different settings (10 simulated datasets in each group). "H" in "Group" column represents the high abundance strain, and other two values in the tuple represent the strain with x SNVs (x refers to the value in the tuple). For "Identification result", "Tie cases", "Median number of best matches" columns, each value in the tuple represents the count of these attributes for correspond strain among all simulated datasets of that group. For example, "(10, 0, 10)" in the first row, "Identification result" column represents high abundance strain and the strain with 3 different SNVs can be identified in all 10 datasets while the strain with 1 different SNV is lost in all tested datasets.

k-mer extraction from SNV sites
We will extract k-mers from these SNV sites, with the center base of each k-mer coming from this site. The complete k-mer is then constructed by including k−1 2 bases to the left and right side of a variation site. Supplementary Figure  S1 shows an example of k-mer extraction.
Supplementary Figure S7. The process of k-mer extraction. Given the chosen site, k-mers are extracted from the centering base and neighbouring sequences.
Supplementary Figure S8. The relationship between k-mer size and the number of redundant k-mers. When k ≥ 20, the only repetitive k-mer is a repeat of "A", which is not used in our method.
If k-mers repeat at different sites in the reference genomes, false positive strain identification will happen. Intuitively, bigger k-mers tend to appear less frequently than small k-mers. We thus investigate how many times we will observe the same k-mer at different sites in an MSA. As the reference genomes SARS-CoV-2 have high sequence similarity and larger genome sizes than other known RNA viruses, we conducted this experiment on SARS-CoV-2. The results are summarized in Supplementary Figure S2. As the figure shows, with the increase of k, the repeat numbers of the k-mer at different sites reduce quickly. Thus, either large k can be chosen or we can directly exclude those columns that contain repetitive k-mers in order to fulfill the assumption.

Supplementary Figures
Supplementary Figure S5. A. The VirStrain text report of one SARS-CoV-2 simulated dataset (Truth: MT451123.1). The detailed explanation of the columns can be found at the Github website of VirStrain. B. The VirStrain html report of one SARS-CoV-2 simulated dataset (Truth: MT451123.1). X-axis refers to all selected sites by the greedy covering algorithm. The top Y-axis represents the predicted sequencing depth of each site. The bottom Y-axis represents the number of strains sharing this site, which reflects the site uniqueness.
Supplementary Figure S6. An example showing the problem caused by dash.