Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale

Table 1 Bystro, VEP, ANNOVAR offline command-line performance

Software	Dataset	Samples	Variants	Variants/s	Bystro vs
Bystro	1000G Phase 3 chr1	2504	1 × 10⁶	8156 ± 195	–
	1000G Phase 3 chr1	2504	2 × 10⁶	8484 ± 67.9	–
	1000G Phase 3 chr1	2504	4 × 10⁶	8516 ± 57.2	–
	1000G Phase 3 chr1	2504	6.5 × 10⁶	7779 ± 21.8	–
	1000G Phase 1	1092	3.9 × 10⁷	5417 ± 76.8	–
	1000G Phase 3	2504	8.5 × 10⁷	7904 ± 15.9	–
VEP	1000G Phase 1	1092	3.9 × 10⁷	18.67 ± 0.58	290×
VEP	1000G Phase 3	2504	8.5 × 10⁷	10.00 ± 0.00	790×
ANNOVAR	1000G Phase 3 chr1	2504	1 × 10⁶	74.67 ± 0.21	109×
	1000G Phase 3 chr1	2504	2 × 10⁶	75.32 ± 0.06	113×
	1000G Phase 3 chr1	2504	4 × 10⁶	75.15 ± 0.39	113×
	1000G Phase 3 chr1	2504	6.5 × 10⁶	NA	NA
	1000G Phase 1	1092	3.9 × 10⁷	NA	NA
	1000G Phase 3	2504	8.5 × 10⁷	NA	NA

Bystro, VEP, and ANNOVAR were similarly configured with eight threads on Amazon i3.2xlarge servers. “Dataset” refers to the VCF file used. “Variants/s” is the average of three trials. VEP performance was recorded after 2 × 10⁵ sites in consideration of time. In runs of 1 × 10⁶ or more annotated sites, VEP performance did not deviate from the 2 × 10⁵ value. ANNOVAR could not complete the full Phase 1, Phase 3, or Phase 3 chromosome 1 datasets due to memory limitations. Thus, ANNOVAR was compared to Bystro on subsets of 1000 Genomes Phase 3 chromosome 1. Bystro run times included time taken to compress outputs. 1000 Genomes Phase 1 performance reflects IO limitations

ISSN: 1474-760X