High-performance web services for querying gene and variant annotation

Efficient tools for data management and integration are essential for many aspects of high-throughput biology. In particular, annotations of genes and human genetic variants are commonly used but highly fragmented across many resources. Here, we describe MyGene.info and MyVariant.info, high-performance web services for querying gene and variant annotation information. These web services are currently accessed more than three million times permonth. They also demonstrate a generalizable cloud-based model for organizing and querying biological annotation information. MyGene.info and MyVariant.info are provided as high-performance web services, accessible at http://mygene.info and http://myvariant.info. Both are offered free of charge to the research community. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0953-9) contains supplementary material, which is available to authorized users.

. Gene-specific annotation fields available from MyGene.info web services. The first column is the field name; the second column indicates if the field is indexed (therefore it can be queried on); the third column is the data type of the field (like string, integer, object, etc.

MyVariant.info and MyGene.info Use Case
The following R script demonstrates the utility of the MyVariant.info and MyGene.info R clients to annotate variants and prioritize candidate genes in patients with rare Mendelian diseases. This specific study uses data obtained from the database of phenotype and genotype (dbGaP) study. FASTQ files generated by Ng et al for the Miller syndrome study (http://www.ncbi.nlm.nih.gov/pubmed/19915526) were processed according to the Broad Institute's best practices. Individual samples were aligned to the hg19 reference genome using BWA-MEM 0.7.10. Variants were called using GATK 3.3-0 HaplotypeCaller and quality scores were recalibrated using GATK VariantRecalibrator.

Initial Library Imports and Data Loading
In mendelian.R defines some helper functions that are used in the analysis occurring after annotation retrieval: replaceWith0 -replaces all NAs in a data.frame with 0.

Annotating variants with MyVariant.info
The following function reads in each output VCF file using the VariantAnnotation package available from Bioconductor. Install with biocLite("VariantAnnotation"). formatHgvs (from the myvariant Bioconductor package) is a function that reads the genomic location and variant information from the VCF to create HGVS IDs which serve as a primary key for each variant. The function getVariants makes the queries to MyVariant.info to retrieve annotations.

-Filtering for Nonsynonymous and Splice Site Variants
Mendelian diseases are most likely to be caused by nonsynonymous mutations. The CADD database annotates the mutation type in the field "cadd.consequence".

-Filtering for Allele Frequency Annotated by ExAC
The third filter keeps rare variants according to the ExAC data set with allele frequency < 0.01. Rare diseases are likely caused by mutations that have not been documented yet.

-Filtering for Allele Frequency Annotated by 1000 Genomes Project
The fourth filter keeps rare variants according to the 1000 Genomes Project with allele frequency < 0.01.

-Filtering by GO Biological Process Annotation using MyGene.info
Since Miller Syndrome is known to be an inborn error of metabolism, this filter keeps only genes involved in metabolic processes according to their GO biological process annotation. To accomplish this, GO biological process annotations are pulled for each remaining gene using the MyGene.info R client, which can be installed from Bioconductor (biocLite("mygene")). Here, the queryMany function is used, requesting the necessary annotations using the fields parameter.