Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale

Table 2 Online comparison of Bystro and recent programs in filtering 8.49 × 10⁷ variants from 1000 Genomes

Group	Search query	Time (s)	Variants	Tr:Tv
1	Exonic	0.030 ± 0.030	993,343	2.96
2 (a)	cadd > 20 maf < .001 pathogenic expert review missense	0.029 ± 0.009	65	1.71
2 (b)	cadd > 20 maf < .001 pathogenic expert’s review non-synonymous	0.036 ± 0.019	65	1.71
2 (c)	cadd > 20 maf < .001 pathogen expert-reviewed nonsynonymous	0.044 ± 0.025	65	1.71
3 (a)	Early onset breast cancer	0.046 ± 0.029	4335	2.51
3 (b)	Early-onset breast cancer	0.037 ± 0.020	4335	2.51
3 (c)	Early onset breast cancers	0.033 ± 0.015	4335	2.51
4 (a)	Pathogenic nonsense Ehlers-Danlos	0.038 ± 0.027	1	NA
4 (b)	Pathogenic nonsense E.D.S	0.078 ± 0.087	1	NA
4 (c)	Pathogenic stopgain eds	0.040 ± 0.022	1	NA

The full 1000 Genomes Phase 3 VCF file (853 GB, 8.49 × 10⁷ variants, 2504 samples) was filtered in the publicly available Bystro web application using the Bystro natural-language search engine. VEP, GEMINI, and wANNOVAR (not shown) were also tested, but were unable to annotate this dataset or filter it. Bystro’s search engine uses a natural language parser that allows for unstructured queries: queries in groups 2, 3, and 4 show phrasing variations that did not affect results returned, as would be expected for a search engine that could handle normal language variation. “Tr:Tv” is the transition to transversion ratio automatically calculated for each query by the search engine. The transition to transversion ratio of 2.96 for the “exonic” query is close to the ~ 2.8–3.0 ratio expected in coding regions, suggesting that the search engine accurately identified exonic (coding) variants

ISSN: 1474-760X