Skip to main content

Table 1 Requirements for a summary statistics storage format and solutions offered by the VCF

From: The variant call format provides efficient and robust storage of GWAS summary statistics

Requirement

Solution using the variant call format

Human readable and easy to parse

Read with any text viewer. Mature open-source parsing libraries are available (HTSLIB [17] and HTSJDK [17]) and implemented in most modern programming languages, for example, VariantAnnotation [18] R-package is available from Bioconductor [19,20,21] and Python package pysam [17, 22]. Bcftools [23], GATK [24], bedtools [25] and others provides user-friendly functionality from the command line.

Unambiguous interpretation of the data

Data field descriptions, value types and number of values are required and defined in the file header. File validity is enforced during each read/write.

Unambiguous representation of bi-allelic, multiallelic and insertion-deletion variants

Every variant substitution is represented by reference and alternative allele haplotypes defining the exact base change on the forward strand. The reference allele is required to match genome sequences defined in the file header. The alternative allele is always the effect allele allowing consistency between studies for ease of comparison.

Genomic information can be validated

The file header contains information about reference genome assembly and contigs. Reference alleles must match the sequence in the referenced genome build (in FASTA format). GATK [24] ValidateVariants can be used to verify file format validity and compare reference allele information against the corresponding genome reference sequence.

Flexibility on which GWAS fields are recorded and enforcement of essential fields

All fields are defined in the file header and can be set optional or required as desired. The specification contains essential fields and their reserved names.

Capacity to store metadata about the study and traits

The file header contains information about the source and date of summary statistics, study IDs (e.g. PMID/DOI of publication describing the study, or accession number and repository of individual-level data), description of the traits studied (e.g. type, association test used, and measurement unit) as well as the source and version of trait IDs (e.g. IEU OpenGWAS database [26], Experimental Factor Ontology [27], Human Phenotyping Ontology [28], Medical Subject Headings [29], IDs for clinical and other traits, Ensembl Gene IDs for eQTL datasets or any other ontology to describe the data).

Allows multiple traits to be stored together

The SAMPLE column was chosen to store variant-trait association data to allow for storage of multiple traits in a single VCF file or as individual files if desired.

Rapid querying by variant identifier, genomic position interval or GWAS summary statistics value (range or exact value)

The file is sorted karyotypically and indexed by chromosome position using tabix [30] to enable fast queries by genomic position. Secondary indexing on dbSNP [31] identifier is also provided using rsidx [32]. Refer to performance comparisons of indexed VCF files and standard UNIX tools.

File compression

VCF files may be compressed with block GZIP [23] or converted to a binary call file which is a binary VCF companion format [23].

Readable by existing open-source tools

A large number of tools support VCF files including GATK [24], Picard [33], bcftools [23], bedtools [25], vcftools [16] and plink [7]. Bcftools [23] can also provide a tabular extract for use with non-compatible tools.

Amenable to cloud-based streaming and database storage

Genomic intervals may be extracted over a network using a range request which extracts file segments without transferring the whole file. This enables rapid streaming of queries over the Internet. For high-throughput and distributed storage and querying, VCF files can be easily imported into GenomicsDB [34].

  1. GWAS genome-wide association study, dbSNP database of single-nucleotide polymorphisms, HTSLIB high-throughput sequencing data library, HTSJDK high-throughput sequencing data Java development kit, GATK genome-analysis toolkit, dbSNP single nucleotide polymorphism database, eQTL expression quantitative trait loci