Scaling computational genomics to millions of individuals with GPUs

Current genomics methods are designed to handle tens to thousands of samples but will need to scale to millions to match the pace of data and hypothesis generation in biomedical science. Here, we show that high efficiency at low cost can be achieved by leveraging general-purpose libraries for computing using graphics processing units (GPUs), such as PyTorch and TensorFlow. We demonstrate > 200-fold decreases in runtime and ~ 5–10-fold reductions in cost relative to CPUs. We anticipate that the accessibility of these libraries will lead to a widespread adoption of GPUs in computational genomics.


Background
Current methodologies for analyzing genomic data were designed for datasets with tens to thousands of samples, but due to the continuing decrease in sequencing costs and growth of large-scale genomic projects, datasets are reaching sizes of millions of samples or single cells. The need for increased computational resources, most notably runtime, to process these growing datasets will become prohibitive without improving the computational efficiency and scalability of methods. For example, methods in population genetics, such as genome-wide association studies (GWAS) or mapping of quantitative trait loci (QTL), involve billions of regressions between genotypes and phenotypes. Currently, the state-of-theart infrastructures for performing these tasks are largescale clusters of central processing units (CPUs), often with thousands of cores, resulting in significant costs [1] (960 cores on a standard Google Cloud machine currently costs $7660.80 per day of compute). In contrast to CPUs, a single graphics processing unit (GPU) contains thousands of cores at a much lower price per core (Nvidia's P100 has 3584 cores and currently costs $35.04 per day of compute).
However, these implementations were often complex and based on specialized libraries, limiting their extensibility and adoption. In contrast, recent open-source libraries such as TensorFlow [7] or PyTorch [8], which were developed for machine learning applications but implement general-purpose mathematical primitives and methods (e.g., matrix multiplication), make the development of GPU-compatible tools widely accessible to the research community. These libraries offer several major advantages: (i) they implement most of the functionalities of CPU-based scientific computing libraries such as NumPy, and thus are easy to use for implementing various algorithms; (ii) they easily handle data transfer from the computer's memory to the GPU's internal memory, including in batches, and thus greatly facilitate computations on large datasets (e.g., large genotype matrices) that do not fit into the GPU's memory; (iii) they are trivial to install and run, enabling easy sharing of methods; and (iv) they can run seamlessly on both CPUs and GPUs, permitting users without access to GPUs to test and use them, without loss of performance compared with other CPU-based implementations (Additional file 1: Figure S1). Moreover, users do not need to explicitly specify how to parallelize algorithms across the GPU cores. We hypothesized that the use of these libraries would result in significant improvements in computational efficiency and enable scaling computational genomics methods to millions of samples.

Results and discussion
To study the efficiency and benchmark the use of Ten-sorFlow and PyTorch for large-scale genomic analyses on GPUs, we re-implemented methods for two commonly performed computational genomics tasks: (i) QTL mapping [9,10] (which we call tensorQTL [11]) and Bayesian non-negative matrix factorization (NMF) [12] (named SignatureAnalyzer-GPU [13]). We executed the same scripts in identical environments (configured with and without a GPU) and also compared them to previous CPU-based implementations. As a baseline, we also benchmarked the performance of individual mathematical operations such as matrix multiplication, for which we observed up to~1000-fold faster runtimes on a GPU vs. a single CPU core (Additional file 1: Figure S1 and Additional file 2). For SignatureAnalyzer-GPU (SA-GPU) [13], we used the mutation counts matrix generated by the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which contains 2624 tumors represented by 1697 mutational features of somatic single-nucleotide variants as well as short insertions and deletions (defined based on their sequence contexts) [14]. Our PyTorch implementation ran approximately 200 times faster on a GPU than the current implementation of SignatureAnalyzer (SA) in R (run on a single CPU core), with mean times for 10,000 iterations of 1.09 min using SA-GPU vs. 194.8 min using SA (Fig. 1a). Using simulated data, we showed that SA-GPU scales linearly with the number of samples (Additional file 1: Figure S2A). When applied to previously published mutational signatures generated by SA [15], we found the results of the 2 methods were essentially identical, taking into account the stochastic nature of the underlying algorithm (mean R 2 = 0.994, min R 2 = 0.960; Fig. 1b). Additionally, we tested the performance of SA-GPU on multiple GPUs, a task that is easily achieved in PyTorch and enables, for example, faster hyperparameter optimization. For 20 decompositions using the same data as above, we found that performance scaled linearly with the number of GPUs and yielded equivalent results (Additional file 1: Figure S2B-C).
To further demonstrate the scalability of the Bayesian NMF to millions of data points, we used SA-GPU to identify the cell types and their associated transcriptional Fig. 1 Performance of GPU implementations for QTL mapping and signature analysis. a Average runtime to compute 10,000 iterations of Bayesian NMF using SignatureAnalyzer (SA) in R (gold) and SignatureAnalyzer-GPU (SA-GPU; purple). b Correlation heat map of mutation signatures derived from the R and GPU implementations of SignatureAnalyzer using the same input mutation counts matrix. c t-distributed stochastic neighbor embedding (t-SNE) of 1 million embryonic mouse brain cells. Colors indicate clustering based on SA-GPU decomposition performed in~15 min. d Comparison of runtimes for cis-QTL (FastQTL on CPU (gold) and tensorQTL on GPU (purple)) and trans-QTL (tensorQTL on CPU and GPU). e GPU runtime of tensorQTL for the indicated numbers of samples and phenotypes. f Empirical cis-eQTL p values from the V7 GTEx release replicated using tensorQTL. Error bars indicate standard deviation of the mean programs from single-cell RNA sequencing of 1 million mouse brain cells (SRA: SRP096558, Fig. 1c). The average time per SA-GPU run was 14.5 min (using a V100 Nvidia GPU; average over 10 runs) corresponding to an average of 6853 matrix updates per run. A similar analysis on a CPU would require > 2 days per run. Our analysis was able to identify 32 distinct transcriptional programs.
For tensorQTL [11] benchmarking, we generated random data representing up to 50,000 people, each with 10 7 genotypes representing common variants. For each individual, we also simulated up to 50,000 phenotypes, resulting in 500 × 10 9 all-against-all association tests (each calculated for up to 50,000 individuals). Our implementation of cis-QTL mapping with permutations to estimate the empirical false discovery rate was > 250 times faster than the current state-of-the-art implementation (FastQTL [10]; Fig. 1d). Likewise, trans-QTL mapping (i.e., 500 billion regressions) took less than 10 min, a~200× increase in speed compared to running on a CPU ( Fig. 1d and Additional file 1: Figure S3A). Our current implementation does not scale linearly as a function of samples (Additional file 1: Figure S3B) due to limitations in data transfer from the memory of the CPU to the GPU, rather than computational capacity; we leave this additional optimization for future work (Fig. 1e, Additional file 1: Figure S3B). We used data from the V6p and V7 releases of GTEx [16] generated using Matrix eQTL [9] and FastQTL [10], respectively, to demonstrate the reproducibility of our implementation ( Fig. 1f and Additional file 1: Figure S3C).
In addition to the savings in computation time, implementation in TensorFlow or PyTorch also results in significant cost savings-at the time of writing, GPU compute time cost~$0.50-0.75/h on multiple cloud platforms compared to~$0.01-0.05/h for a CPU core. Thus, the same analyses were~5-10-fold cheaper on GPUs.

Conclusions
In summary, the implementation of many commonly used methods in genomics based on new GPU-compatible libraries can vastly increase runtime and reduce costs compared to CPU-based approaches. Indeed, by simply reimplementing current methods, we were able to achieve an order-of-magnitude higher increase in speed than may be achieved through sophisticated approximations for optimizing runtimes on CPUs [17,18]. Our findings indicate that the scale of computations made possible with GPUs will enable investigation of previously unanswerable hypotheses involving more complex models, larger datasets, and more accurate empirical measurements. For example, our GPU implementation enables the computation of empirical p values for trans-QTL, which is cost-prohibitive on CPUs. Similarly, our results show that GPU-based approaches will enable scaling of single-cell analysis methods to millions of cells. Given the availability of libraries that obviate the need for specialized GPU programming, we anticipate a transition to GPU-based computing for a wide range of computational genomics methods.
The following QTL mapping modalities are implemented: -Cis-QTL: nominal associations between all variantphenotype pairs within a specified window (default ± 1 Mb) around the phenotype (transcription start site for genes), as implemented in FastQTL.

Benchmarking
To benchmark tensorQTL, we compared its trans-QTL mapping performance on a machine with and without an attached GPU, and cis-QTL mapping relative to the CPUbased FastQTL [10] (an optimized QTL mapper written in C++). For FastQTL, we computed the runtime per gene by specifying the gene and cis-window using the --include-phenotypes and --region options, respectively. The cis-mapping comparisons were performed using skeletal muscle data from the V6p release of GTEx [16]. To facilitate the comparison of GPU vs. CPU performance when mapping trans-QTLs across a wide range of sample sizes, we used randomly generated genotype, phenotype, and covariate matrices. All tensorQTL benchmarks were conducted on a virtual machine on Google Cloud Platform with 8 Intel Xeon CPU cores (2.30 GHz), 52 GB of memory, and an Nvidia Tesla P100 GPU. For CPU-based comparisons, computations were limited to a single core.

SignatureAnalyzer-GPU
SA-GPU is a PyTorch reimplementation of Signature-Analyzer [21], a method for the identification of somatic mutational signatures using Bayesian NMF [22]. Signa-tureAnalyzer was originally developed in R and is available for download at https://software.broadinstitute.org/ cancer/cga/. Currently, SA-GPU requires the input data matrix and decomposition matrices (W and H) to fit into the GPU memory; however, since high-memory GPUs are readily available (e.g., Nvidia Tesla v100 has 16GB), we do not foresee this limiting its practical use.
In case data sizes were to exceed this limit, the method is easily extensible to multiple GPUs using shared memory with built-in PyTorch methods. SA-GPU can run a single Bayesian NMF or an array of decompositions in parallel, leveraging multiple GPUs. Users should specify a data likelihood function (Poisson or Gaussian) and either exponential or half-normal prior distributions on the elements of W and H, corresponding to L1 or L2 regularization, respectively.

Benchmarking
To benchmark the performance of SA-GPU, we compared SA-GPU with the previous implementation in R. We ran the R implementation using R 3.2.3 with the "Matrix" package for efficient matrix operations. All SA-GPU benchmarks were conducted on a virtual machine on Google Cloud Platform with 12 Intel Xeon CPU cores (2.30GHz), 20 GB of memory, and a Nvidia Tesla V100 GPU. For CPU-based comparisons, a single core was used.
Additional file 1: Figure S1. Performance of matrix multiplication on a single CPU core (2.30GHz Intel Xeon) vs. a GPU (Nvidia Tesla P100), using NumPy (compiled with OpenBLAS) and PyTorch. Runtimes were measured for multiplication of two random (uniform~U[0,1]) square matrices (in 32-bit floating point) with the indicated dimensions. For the 'PyTorch GPU' runtimes, only the matrix multiplication itself was timed. For the 'PyTorch GPU w/ copy' runtimes, the copy of the two input matrices from CPU to GPU memory was included in the timing. Runtimes are shown as the median and median absolute deviation of 15 iterations each. Figure S2. Performance scaling of SignatureAnalyzer-GPU.  Figure S3. GPU performance scaling of tensorQTL. (a) GPU-to-CPU runtime ratio for tensorQTL, across the indicated phenotype and sample sizes, for 10 7 common variants. The ratio is non-constant due to data load and CPU-to-GPU memory input/output times ("i/o") that are more limiting for large sample sizes (number of individuals). (b) CPU runtime of tensorQTL for the indicated range of sample and phenotype sizes (left panel). CPU runtimes scale linearly, demonstrated by the collapse of the compute time when measured as a function of number of operations (approximated as phenotypes × samples × variants; middle panel), whereas GPU runtimes do not show this collapse for large sample sizes due to i/o limitations (right panel). (c) Nominal significant trans-eQTL p values from the V6p GTEx release replicated using tensorQTL.
Additional file 2. Benchmarking code from Additional file 1: Figure S1 Additional file 3. Review history.

Review history
The review history is available as Additional file 3.
Authors' contributions ATW and FA outlined and planned the development. ATW, FA, GG, and EMV prepared and reviewed the manuscript and figures. FA and ATW developed the tensorQTL. ATW, NH, JK, and SG developed the SignatureAnalyzer-GPU. FA performed and planned the benchmarking analyses for tensorQTL. ATW, NH, SA, and JK performed and planned the benchmarking analyses for SignatureAnalyzer-GPU. All authors reviewed and edited the manuscript. All authors read and approved the final manuscript.

Availability of data and materials
All software is available on GitHub and implemented in Python using opensource libraries. tensorQTL is released under the open-source BSD 3-Clause License and is available at https://github.com/broadinstitute/tensorQTL [11]. SignatureAnalyzer-GPU is released under the open-source BSD 3-Clause License and is available at https://github.com/broadinstitute/SignatureAnaly zer-GPU [13].
Ethics approval and consent to participate Not applicable.
Competing interests ATW holds stock in Nvidia and Google (the developers of TensorFlow). ATW, FA, and GG have filed a patent application related to the methods described in this work. GG receives research funds from IBM and Pharmacyclics. GG is an inventor on multiple bioinformatics-related patent applications. EMV is a consultant for Tango Therapeutics, Genome Medical, Invitae, Foresite Capital, and Illumina. EMV received research support from Novartis and BMS, as well as travel support from Roche/Genentech. EMV is an equity holder of Syapse, Tango Therapeutics, and Genome Medical. EMV holds stock in Microsoft. Broad permits non-profit institutions and government agencies to operate under Broad patents to conduct internal research, including sponsored research to the extent such research does not include the production or manufacture of products for sale or offer for sale or performance of commercial services for a fee.