Vcfanno: fast, flexible annotation of genetic variants
© Pedersen et al. 2016
Received: 1 March 2016
Accepted: 3 May 2016
Published: 1 June 2016
The integration of genome annotations is critical to the identification of genetic variants that are relevant to studies of disease or other traits. However, comprehensive variant annotation with diverse file formats is difficult with existing methods. Here we describe vcfanno, which flexibly extracts and summarizes attributes from multiple annotation files and integrates the annotations within the INFO column of the original VCF file. By leveraging a parallel “chromosome sweeping” algorithm, we demonstrate substantial performance gains by annotating ~85,000 variants per second with 50 attributes from 17 commonly used genome annotation resources. Vcfanno is available at https://github.com/brentp/vcfanno under the MIT license.
KeywordsGenetic variation SNP Annotation VCF Variant Variant prioritization Genome analysis
The VCF files  produced by software such as GATK  and FreeBayes  report the polymorphic loci observed among a cohort of individuals. Aside from the chromosomal location and observed alleles, these loci are essentially anonymous. Until they are embellished with genome annotations, it is nearly impossible to answer basic questions such as “was this variant seen in ClinVar,” or “what is the alternative allele frequency observed in the 1000 Genomes Project?” An extensive and growing number of publicly available annotation resources (Ensembl, UCSC) and reference databases of genetic variation (e.g., ClinVar, Exome Aggregation Consortium (ExAC), 1000 Genomes) provide context that is crucial to variant interpretation. It is also common for individual labs and research consortia to curate custom databases that are used, for example, to exclude variants arising in genes or exons that are systematic sources of false positives in exome or genome resequencing studies. Other annotations, such as low-complexity regions , transcription factor binding sites, regulatory regions, or replication timing , can further inform the prioritization of genetic variants related to a phenotype. The integration of such annotations is complementary to the gene-based approaches provided by snpEff , Annovar , and VEP . Each of these tools can provide additional, region-based annotation, yet they are limited to the genome annotation sets provided by the software. While extensive variant annotation is fundamental to nearly every modern study of genetic variation, no existing software can flexibly and simply annotate VCF files with so many diverse data sets.
We have therefore developed vcfanno as a fast and general solution for variant annotation that allows variants to be “decorated” with any annotation dataset in common formats. In addition to providing the first method that is capable of annotating with multiple annotation sets at a time, vcfanno also avoids common issues such as inconsistent chromosome labeling (“chr1” versus “1”) and ordering (1,2,…10…, or 1,10,11…) among the VCF and annotation files. To maximize performance with dozens of annotation files comprised of millions of genome intervals, we introduce a parallel sweeping algorithm with high scalability. In an effort to make vcfanno’s annotation functionality as flexible as possible, we have also embedded a lua (http://www.lua.org/) scripting engine that allows users to write custom operations.
Overview of the vcfanno functionality
Vcfanno annotates variants in a VCF file (the “query” intervals) with information aggregated from the set of intersecting intervals among many different annotation files (the “database” intervals) stored in common genomic formats such as BED, GFF, GTF, VCF, and BAM. It utilizes a “streaming” intersection algorithm that leverages sorted input files to greatly reduce memory consumption and improve speed. As the streaming intersection is performed (details below), database intervals are associated with a query interval if there is an interval intersection. Once all intersections for a particular query interval are known, the annotation proceeds according to user-defined operations that are applied to the attributes (e.g., the “score” column in a BED annotation file or an attribute in the INFO field of a VCF annotation file) data within the database intervals. As a simple example, consider a query VCF of single nucleotide variants (SNVs) that was annotated by SNVs from an annotation database such as a VCF file of the dbSNP resource. In this case, the query and database variants are matched on position, REF, and ALT fields when available and a value from the overlapping database interval (e.g., minor allele frequency) is carried forward to become the annotation stored in the INFO field of the query VCF. In a more complex scenario where a query structural variant intersects multiple annotation intervals from each database, the information from those intervals must be aggregated. One may wish to report each of the attributes as a comma-separated list via the “concat” operation. Alternatively, one could select the maximum allele frequency via the “max” operation. For cases where only a single database interval is associated with the query, the choice of operation will not affect the summarized value.
Overview of the chrom-sweep algorithm
The chromosome sweeping algorithm (“chrom-sweep”) is an adaptation of the streaming, sort-merge join algorithm, and is capable of efficiently detecting interval intersections among multiple interval files, as long as they are sorted by both chromosome and interval start position. Utilized by both BEDTOOLS [10, 11] and BEDOPS , chrom-sweep finds intersections in a single pass by advancing pointers in each file that are synchronized by genomic position. At each step in the sweep, these pointers maintain the set of intervals that intersect a particular position and, in turn, intersect each other. This strategy is advantageous for large datasets because it avoids the use of data structures such as interval trees or hierarchical bins (e.g., the UCSC binning algorithm ). While these tree and binning techniques do not require sorted input, the memory footprint of these methods scales poorly, especially when compared with streaming algorithms, which typically exhibit low, average-case memory demands.
Limitations of the chrom-sweep algorithm
Owing to the fact that annotation sets are not loaded into memory-intensive data structures, the chrom-sweep algorithm easily scales to large datasets. However, it does have some important limitations. First, it requires that all intervals from all annotation files adhere to the same chromosome order. While conceptually simple, this is especially onerous since VCFs produced by variant callers such as GATK impose a different chromosome order (1, 2, …21, X, Y, MT) than most other numerically sorted annotation files, which would put MT before X and Y. Of course, sorting the numeric chromosomes as characters or integers also results in different sort orders. Discrepancies in chromosome ordering among files are often not detected until substantial computation has already been performed. A related problem is when one file contains intervals from a given chromosome that the other does not, it’s not possible to distinguish whether the chromosome order is different or if that chromosome is simply not present in one of the files until all intervals are parsed.
Second, the standard chrom-sweep implementation is suboptimal because it is often forced to consider (and parse) many annotation intervals that will never intersect the query intervals, resulting in unnecessary work . For example, given a VCF file of variants that are sparsely distributed throughout the genome (e.g., a VCF from a single exome study) and dense data sets of whole-genome annotations, chrom-sweep must parse and test each interval of the whole-genome annotations for intersection with a query interval, even though the areas of interest comprise less than 1 % of the regions in the file. In other words, sparse queries with dense annotation files represent a worst-case scenario for the performance of chrom-sweep because a high proportion of the intervals in the data sets will never intersect.
A third limitation of the chrom-sweep algorithm is that, due to the inherently serial nature of the algorithm, it is difficult to parallelize the detection of interval intersections and the single CPU performance is limited by the speed at which intervals can be parsed. Since the intervals arrive in sorted order, skipping ahead to process a new region from each file in a different processing thread is difficult without a pre-computed spatial index of the intervals and reporting the intervals in sorted order after intersection requires additional bookkeeping.
A parallel chrom-sweep algorithm
To address these shortcomings, we developed a parallel algorithm that concurrently chrom-sweeps “chunks” of query and database intervals. Unlike previous in-memory parallel sweeping methods that uniformly partition the input , we define (without the need for preprocessing ) chunks by consecutive query intervals that meet one of two criteria: either the set reaches the “chunk size” threshold or the genomic distance to the next interval exceeds the “gap size” threshold. Restricting the chunk size creates reasonably even work among the threads to support efficient load balancing (i.e., to avoid task divergence). The gap size cutoff is designed to avoid processing an excessive number of unrelated database intervals that reside between distant query intervals.
Vcfanno is written in Go (https://golang.org), which provides a number of advantages. First, Go supports cross-compilation for 32- and 64-bit systems for Mac, Linux, and Windows. Go’s performance means that vcfanno can process large data sets relatively quickly. Go also offers a simple concurrency model, allowing vcfanno to perform intersections in parallel while minimizing the possibility of race conditions and load balancing problems that often plague parallel implementations. Moreover, as we demonstrate in the Results section, vcfanno’s parallel implementation of the chrom-sweep algorithm affords speed and scalability. Lastly, it is a very flexible tool because of its support for annotations provided in many common formats such as BED, VCF, GFF, BAM, and GTF.
Scalability of VCF annotation
The impact of interval distribution on performance
Comparison to other methods
Speed comparison with other methods
Number of processors
Vcfanno includes additional features that provide unique functionality with respect to existing tools. Annotating structural variants (SV) is complicated by the fact that, owing to the alignment signals used for SV discovery, there is often uncertainty regarding the precise location of SV breakpoints . Vcfanno accounts for this uncertainty by taking into account the confidence intervals (defined by the CIPOS and CIEND attributes in the VCF specification) associated with SV breakpoints when considering annotation intersections. The confidence intervals define a genomic range in which the breakpoints are most likely to exist; therefore, it is crucial for vcfanno to take these intervals into consideration when it considers annotations associated with SV breakpoints. Moreover, since SVs frequently affect hundreds to thousands of nucleotides, they will often intersect multiple intervals per annotation file. In such cases, the summary operations described above can be used to distill the multiple annotation intersections into a single descriptive measure.
Use case demonstration
Genetic studies of rare familial disease typically annotate the resulting VCF file with predictions of the consequence of each identified genetic variant on transcript function using VEP, snpEff, or VEP. However, further annotation with other reference databases is required to isolate the handful of variants that could plausibly underlie the phenotype. As an example, vcfanno could be used to further annotate the clinical significance of each variant, as well as any diseases known to be associated with the variant using the VCF files provided by ClinVar. In addition, the allele frequency and number of heterozygous and homozygous alternative genotypes observed in ExAC could be added to prevent consideration of alleles that are either too common or have too many homozygous alternative genotypes to be plausible for a rare, recessive disorder. We have provided a vcfanno configuration file that demonstrates this variant prioritization scenario at Github (https://github.com/brentp/vcfanno/blob/master/scripts/paper/example.conf).
We have introduced vcfanno as a fast and flexible new software resource that facilitates the annotation of genetic variation in any species. We anticipate that vcfanno will be useful both as a standalone annotation tool and also in conjunction with downstream VCF filtering and manipulation software such as snpEff , BCFTools , BGT , and GQT . There are, however, caveats to the proper use of vcfanno and it exhibits poorer performance in certain scenarios. Performance will vary depending upon the number of sample genotypes that are present in the input VCF file, as more samples yield a larger VCF file that requires more processing time. For example, the performance described in Figs. 4 and 5 reflect VCF files that lack sample genotypes (i.e., solely sites of genetic variation). When annotating the 1000 Genomes VCF that includes 2504 sample genotypes, vcfanno requires 42 minutes using 16 cores, versus 17 minutes without genotypes. Secondly, vcfanno’s relative performance is, not surprisingly, less impressive on very sparse datasets (e.g., one or two variants every 20 kb), such as a VCF resulting from the exome sequencing of one individual. While annotating these files is still quite fast (typically between 3 and 5 minutes), the sparsity of data exposes the overhead associated with using Tabix to create streams of database intervals that are germane to the current chunk. Tabix must decompress an entire BGZF block from each annotation file even if the query chunk merely includes a single variant because Tabix’s smallest block represents a genomic range of 16 kb. Therefore, when the query VCF is very sparse, an entire block from each annotation is frequently (and wastefully) decompressed for each query variant. In future versions of vcfanno, we will explore alternative approaches in order to avoid this limitation, thereby maximizing performance in all usage scenarios. Lastly, when annotating with other VCF files, it is recommended that both the variants in the query VCF and each database VCF are normalized and decomposed in order to ensure that both variant sites and alleles are properly matched when extracting attributes from the database VCF files .
Vcfanno is an extremely efficient and flexible software package for annotating genetic variants in VCF format in any species. It represents a substantial improvement over existing methods, enabling rapid annotation of whole-genome and whole-exome datasets and provides substantial analytical power to studies of disease, population genetics, and evolution.
Availability of data and materials
Project name: vcfanno
Source code: https://github.com/brentp/vcfanno
Pre-compiled binaries: https://github.com/brentp/vcfanno/releases
Scripts used in the manuscript: https://github.com/brentp/vcfanno/tree/master/scripts/paper
Archived version: DOI: http://dx.doi.org/10.5281/zenodo.49500
Operating system(s): Linux, OSX, Windows
Programming language: Go
Any restrictions to use by non-academics: None
We acknowledge Liron Ganel for helpful suggestions in developing support for annotating structural variants.
This research was supported by a US National Human Genome Research Institute award to ARQ (NIH R01HG006693).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8.
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.
- Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv [q-bio.GN]. 2012.
- Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Koren A, Polak P, Nemesh J, Michaelson JJ, Sebat J, Sunyaev SR, McCarroll SA. Differential relationship of DNA replication timing to different forms of human mutation and variation. Am J Hum Genet. 2012;91:1033–40.
- Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6:80–92.
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38, e164.View ArticlePubMedPubMed CentralGoogle Scholar
- McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26:2069–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Exome Aggregation Consortium, Lek M, Karczewski K, Minikel E, Samocha K, Banks E, Fennell T, O‟Donnell-Luria A, Ware J, Hill A, Cummings B, Tukiainen T, Birnbaum D, Kosmicki J, Duncan L, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Cooper D, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki M, et al. Analysis of protein-coding genetic variation in 60,706 humans. bioRxiv. 2015:030338
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Quinlan AR. BEDTools: the Swiss-Army tool for genome feature analysis. Curr Protoc Bioinformatics. 2014;47:11.12.1–11.12.34.View ArticleGoogle Scholar
- Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, Sandstrom R, Humbert R, Stamatoyannopoulos JA. BEDOPS: high-performance genomic feature operations. Bioinformatics. 2012;28:1919–20.
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006.
- Layer RM, Quinlan AR. A parallel algorithm for $N$-way interval set intersection. Proc IEEE. 2015;99:1–10.
- McKenney M, McGuire T. A parallel plane sweep algorithm for multi-core systems. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM; 2009. pp. 392–395. http://dl.acm.org/citation.cfm?id=1653827.
- Khlopotine AB, Jandhyala V, Kirkpatrick D. A variant of parallel plane sweep algorithm for multicore systems. IEEE Trans Comput Aided Des Integr Circuits Syst. 2013;32:966–70.View ArticleGoogle Scholar
- Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27:718–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Daniel Kortschak R, Adelson DL. bíogo: a simple high-performance bioinformatics toolkit for the Go language. bioRxiv. 2014:005033.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.
- Quinlan AR, Hall IM. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 2012;28:43–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H. BGT: efficient and flexible genotype query across many samples. Bioinformatics. 2016;32:590–2.View ArticlePubMedGoogle Scholar
- Layer RM, Kindlon N, Karczewski KJ, Exome Aggregation Consortium, Quinlan AR. Efficient genotype compression and analysis of large genetic-variation data sets. Nat Methods. 2015
- Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants. Bioinformatics. 2015;31:2202–4.View ArticlePubMedGoogle Scholar