Churchill parallelized workflow
Each of the required data processing steps was carefully examined (Figure S1 in Additional file 1) and optimal approaches for parallelized processing were determined. Alignment of individual reads to a reference genome is considered to be an embarrassingly parallel process as the 1 billion raw reads that are generated in sequencing a human genome can in theory be mapped to the reference genome independently of one another; the only constraint for paired-end reads is that both reads in a pair should be correctly oriented within proper distance. The remaining steps in the analysis workflow are not embarrassingly parallel by nature and, as such, required the development of novel approaches to achieve parallelization. One approach to enable a modest level of parallelization of the subsequent steps is to divide the analysis by individual chromosomes (22 autosomes (chromosomes 1 to 22) and two sex chromosomes (chromosomes X and Y)). However, doing so results in a significant load imbalance as the size of these chromosomes varies significantly, with chromosome 1 being approximately 5 times larger than chromosome 21 (Figure S3A in Additional file 1). In addition, limiting parallelization at chromosomal level restricts the use of processors to a total of 24, such that utilization of more than 24 CPU cores cannot improve performance.
To overcome this limitation of parallelization by chromosome, Churchill utilizes an approach that evenly subdivides the whole human genome into multiple regions with fixed boundaries (subregions), enabling a load balanced and independent execution of the local realignment, deduplication, recalibration and genotyping steps (Figure S3B in Additional file 1). However, there are four issues that arise with this strategy:
-
1.
Dependencies. There are several points at which the results of processes run on individual segments of the genome are not independent. First, duplicate read removal requires the entire set of reads in sorted order so that any number of read pairs that have identical mappings can be reduced to a single pair. If one were to separate the data, read pairs must be kept together. A second point at which different segments depend on each other is during base quality score recalibration. Best practices suggest that a true baseline of base qualities requires examination of covariates across the entire sample.
-
2.
Parallelization. Assuming these dependencies have been addressed, the issue then becomes how to parallelize these independent processes. One drawback of the computational techniques in genome resequencing and variant calling is the large memory requirements. Therefore, there may not be enough memory available to process as many segments as cores are available on the server. Also, load balancing is a concern.
-
3.
Determinism. Ideally, introduction of a parallelization strategy should not produce different results depending on how the parallelization was implemented. If determinism is not maintained, then different results could occur based on the available resources at the time of analysis, creating an unacceptable situation for clinical applications where reproducibility and determinism are essential.
-
4.
Interchromosomal reads. Most read pair distances will be normally distributed around a given insert size, which can vary between sequencing runs. Inherently, there will be outliers. These outliers could be either sequencing artifacts or improper mappings. In many cases, however, reads pairs with large insert sizes and those with each read of the pair on different chromosomes could indicate a structural variant and it is vitally important they are not disregarded. Shortcuts taken on the above described dependencies could result in lost information regarding interchromosomal reads.
In theory, the extremely large size of the human genome (approximately 3 billion base pairs) enables achievement of near-embarrassingly parallel execution of these steps. For example, dividing the genome into 3,000,000 base pair chromosomal subregions would enable execution of these steps in 1,000 parallel processes. The number of subregions created by Churchill can be specified by the user, although increasing this variable to twice the number of cores available for processing leads to improved load balancing. In order to ensure proper processing of regional boundaries, at both ends of each region, we include a 3 kilobase overlap of the adjacent region. This overlap acts as a ‘buffer zone’ to ensure appropriate detection of variants near or spanning region boundaries, as is possible in the case of indels. The resulting region and overlap boundary information is saved in the GATK intervals file format.
However, the post-alignment steps of the analysis process (local realignment, duplicate read removal, base quality score recalibration, genotyping and variant quality score recalibration) could not simply be performed on these subregions without significant refinement of each step to achieve high levels of parallelization without sacrificing data integrity and quality. The five steps of the Churchill workflow and the optimization that was performed are detailed below.
Step 1: parallelized alignment to a reference sequence
For the initial alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. The speed of the process can be increased through utilization of inbuilt multithreading capabilities of the alignment algorithm by executing the aligner in multithreading mode (for example, using the bwa aln -t option to specify the number of threads). However, implementation of alignment within the Churchill pipeline utilizes an approach whereby the total raw input sequencing data (typically 400 to 800 million paired reads) is split into multiple smaller FASTQ files and aligned using multiple single-threaded parallel instances of the alignment algorithm. The number of paired-end FASTQ files generated during the sequencing run is controlled by the --fastq-cluster-count parameter of Illumina’s BCL-conversion process (CASAVA 1.8.2), which specifies the maximum number of reads per output FASTQ file. The default value of 4,000,000 works well with Churchill. However, decreasing the number of reads per FASTQ to 1,000,000 results in increased alignment speed due to better load balancing.
Step 2: parallelized generation and deduplication of subregional BAMs
At the heart of the Churchill pipeline is the novel algorithm we developed to convert the raw BAM files produced during alignment into subregions, enabling the parallel implementation of all of the subsequent analysis steps (Figure S4 in Additional file 1). This approach consists of 5 sequential steps:
-
1.
Split raw BAM by region. The genome is split into M chromosomal subregions, where the value of M is defined by the desired level of parallelization. Utilization of the Churchill parallelized alignment approach generates N raw BAM files (derived from alignment of N pairs of input FASTQ files to the entire genome). These BAM files are split according to the coordinates of the subregions, yielding M × N split BAM files. Read pairs in which mates map to different subregions (including both interchromosomal and intrachromosomal reads) are temporarily transferred to separate split BAM files, one for each of the N input BAM files, identified as chrI.bam (‘I’ is short for inter/intrachromosomal mapping).
-
2.
Merge split BAMs by subregion. For each of the genomic subregions, the N split BAM files corresponding to a given subregion are merged into M subregional BAM files, each containing all of the read pairs mapped within the boundaries of that subregion.
-
3.
Merge split chrI BAMs. The N chrI BAM files are merged into a single genome-wide interchromosomal BAM file.
-
4.
Parallelized deduplication. Duplicate reads are identified and removed from region and interchromosomal BAM files. Reads containing amplification errors may be represented in artificially high numbers and, as such, failure to remove these reads from the data set would have a significant negative effect on the final result by introducing variants that reflect these errors rather than true biological polymorphisms. The deduplication process identifies read pairs with identical external coordinates and subsequently reduces the data set to include only one copy of the duplicate sequence with highest mapping quality. Picard Tools MarkDuplicates is the tool most commonly utilized to identify duplicate reads both within and between chromosomes. Current best practices require that the deduplication process be performed using a single BAM file, containing all of the reads from the sequencing run. This is the approach utilized by the GATK-Queue analysis pipeline. However, in addition to this prolonged serial deduplication, the process of merging the BAM files into a single file cannot be parallelized. These processes result in lengthy single-threaded computations that substantially increase analysis run time. The Churchill algorithm overcomes this significant limitation by keeping interchromosomal reads together initially and deduplicating them. This step happens before the individual reads in the pair are merged by coordinates into the appropriate subregional BAMs. This innovative approach ensures proper deduplication of these interchromosomal reads and enables safe parallelization of the remainder of the deduplication process across both chromosomes and chromosomal subregions. In this way it is possible to achieve high levels of parallelization of the duplicate marking and removal process without compromising data integrity. The Churchill deduplicated BAM is indistinguishable from the results obtained from the lengthy process of post-alignment processing of a single merged genome-wide BAM file.
-
5.
Merge chrI reads with subregional BAMs. The deduplicated interchromosomal paired reads are split according to subregion, and the individual reads are merged back into the appropriate subregion BAM according to the read coordinates. The resulting alignment files contain both appropriately deduplicated interchromosomal and regular reads. In addition, a copy of this chrI.bam file is kept as it may aid in detection and analysis of structural variants.
The final output of this step is multiple BAM files, one for every genomic subregion, which include appropriately mapped and deduplicated reads, thereby enabling parallelization of the subsequent steps.
Step 3: parallelized local realignment around indels
In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases creating false positive SNP calls. Local realignment uses these mismatching bases to determine if a site should be realigned, and applies a computationally intensive Smith-Waterman algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts [34]. The major advantage of Churchill in parallelizing local realignment is that all reads from a given sample are used to perform the local realignment, ensuring the greatest possible accuracy and improving novel indel detection. Moreover, applying sample-level local realignment across Churchill’s subregions results in significant improvements in speed.
Step 4: parallelization of base quality score recalibration
Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. The reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping, where they are used in the Bayesian genotype likelihood model employed by GATK’s UnifiedGenotyper [5]. After recalibration, the recalibrated quality scores in the output BAM will more closely correspond to the probability of a sequencing error. Moreover, the recalibration tool corrects for variation in quality with respect to machine cycle and sequence context, thus producing both more accurate and widely dispersed quality scores. Churchill uses GATK’s base quality score recalibration (BQSR) algorithm, which analyzes covariation among several features of a base, including the reported quality score, the position within the read and the preceding and current nucleotide (sequencing chemistry effect). These covariates are then applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a given BAM file. However, according to the Broad’s best practices, BQSR requires a pool of all covariates from across the genome for proper calculation. Therefore, to ensure integrity of the recalibration process, Churchill merges the covariate results for each subregion so that each parallel recalibration instance has input data from the entire genome rather than just its region. The benefit of this approach is that it enables Churchill to use the entire dataset for recalibration purposes, improving accuracy and avoiding downsampling, which will lead to non-determinism.
Step 5: parallelization of variant calling
In the penultimate step of the Churchill analysis process, variant calls can be generated with GATK UnifiedGenotyper [5], GATK HaplotypeCaller [15] or Freebayes [23] using the analysis ready reads generated during recalibration. Churchill is capable of implementing these algorithms on both single sample data and multi-sample data, where variant information from all samples in a given experiment is utilized to improve genotyping accuracy. Due to the overlapping buffer zones at the ends of each region, it is possible that a variant occurring in one of these zones may be called twice: once in the region to which it belongs and once in the overlap zone of the adjacent region (Figure S5 in Additional file 1). This is corrected by assigning the variant to the appropriate subregion and removing its buffer-zone duplicate from the final merged raw variants file. This determination is made based solely on the location of the variant call and its position relative to the fixed subregion boundaries. The raw genotype calls from each subregion are concatenated into genome-wide VCF files for both SNPs and indels ready for down-stream analysis and interpretation.
Churchill implementation
A schematic representation of the entire Churchill process is shown in Figure S2 in Additional file 1. Compared with alternative analysis pipelines, implementation of Churchill is simpler, faster, and more widely applicable to various shared memory/distributed high performance computing (HPC) clusters. Churchill only requires a small number of pre-installed components. Python (with PySam), BWA, Samtools, Picard Tools, and GATK are the only required software not included in Churchill. Setup and execution are performed with a single command. Churchill is implemented as a mixture of Bash and Python scripts, linking and preparing for parallelization the inputs and outputs of BWA, Picard, SAMTools, and GATK. A single configuration file defines the paths to raw data, installed software, required database files, and delivery directories. Churchill is initialized and executed using a single python script, ‘churchill.py’, and a Cython-compiled C library. Churchill begins by creating the scripts required to run the entire pipeline and then proceeds to execute (or submit to the job scheduler) the scripts in the desired parallelization method (shared memory, GNU make, Sun Grid Engine (SGE), or Portable Batch System (PBS)) specified by the user in an argument to ‘churchill.py’. To ensure that Churchill would be of utility to the widest number of researchers, the pipeline was developed such that it could be executed with three of the most commonly utilized environments for distributed computing: shared memory machine or server with explicit task creation, shared memory machine or server with task creation by GNU Make, and HPC clusters and cloud implementations that support distributed Make, such as PBS and SGE (Table S1 in Additional file 1). As such, Churchill is compatible with a wide range of Linux systems, including high-performance workstations, small single servers, moderate in-house clusters with shared or non-shared memory servers, large HPC systems housed at supercomputing centers (including the Ohio Supercomputing Center and the Pittsburgh Supercomputing Center) and in the cloud. The software is available for download at [35].
Data availability
Sequence data were generated in the Biomedical Genomics Core, The Research Institute at The Nationwide Children’s Hospital. Initial development and testing of Churchill was conducted using Illumina HiSeq 2000 100 bp paired-end whole genome sequencing data sets with 30× average coverage. Informed consent was obtained from study subjects or parents of subjects less than 18 years of age (assent was obtained from subjects 9 to 17 years of age) under protocols approved by the Institutional Review Board (IRB) at Nationwide Children’s Hospital (protocol number IRB11-00215). Analysis timing metrics and validation were performed using NA12878 sequence data deposited in the Sequence Read Archive (accession ERP001229) down-sampled to 30× coverage as described below. The phase 1 1KG data are available through the consortiums FTP sites (see [36] for details). The NIST benchmark SNP and indel genotype calls generated by the GIAB Consortium can be downloaded from [37]. All experiments have been conducted according to the principles expressed in the Declaration of Helsinki.
Validation
Validation was performed using FASTQ files from the Sequence Read Archive study ERP001229 for whole human genome sequencing of the 1KG CEU female NA12878 (Illumina HiSeq 2000 paired-end 100 bp reads, split into 431 pairs of FASTQ files, each containing 2,000,000 reads). The VCF files produced from these data by Churchill, GATK-Queue and HugeSeq were compared with NIST benchmark SNP and indel genotype calls generated by the GIAB Consortium [13]. First, VCFs were filtered to remove low quality variant calls as indicated by a ‘LowQual’ flag generated by the given pipeline. Second, the VCF was filtered to the GIAB callable regions using the vcflib [38] tool vcfintersect with the BED file provided by GIAB. Third, complex variants were decomposed into a canonical SNP and indel representation using the vcflib tool vcfallelicprimitives. Finally, VCF files were converted to tables and compared with the GIAB validation dataset (version 2.18) using custom scripts in R.
Profiling and benchmarking
In addition to measuring the running time, we recorded the CPU utilization profile using the collectl utility [39], a comprehensive tool to measure the performance of a linux system. CPU, memory, disk, and network usage were measured at 10 s intervals. The output was then parsed and plotted using scripts customized for this purpose.
Analysis with the bcbio-nextgen pipeline
The bcbio-nextgen run was performed using version 0.7.9 of the software that was installed using the provided installer script (bcbio_nextgen_install.py). After installation, the GATK software was upgraded using the provided upgrade script (bcbio_nextgen.py upgrade) to version 3.2-2 so that GATK’s HaplotypeCaller could be used. The run was performed on a single r3.8xlarge AWS EC2 instance. The run requested 32 cores to be used (-n 32) since 32 cores were available on the r3.8xlarge instance. This resulted in BWA-MEM being assigned 16 cores (-t 16) and sambamba being assigned 16 cores (-t 16).
Churchill processing of 1000 Genomes Project data
To process each sample, the input FASTQ files for the sample were first copied from the 1000genomes S3 bucket to local storage on an EC2 instance. These input files were then processed by the Churchill pipeline to produce a set of realigned and recalibrated BAM files, one for each Churchill region. Finally GATK’s UnifiedGenotyper was run over the realigned and recalibrated BAM files from each sample to produce a single multi-sample VCF. A hard filtering strategy was employed similar to that used by the 1KG group’s original analysis of these data [28]. The single multi-sample VCF was filtered to remove indels with an unfiltered depth over all samples (DP) <2,566, DP >16,320, inbreeding coefficient < -0.8, quality by depth (QD) <1.5, or strand bias estimated using Fisher’s Exact Test (FS) >200. SNPs were removed with DP <2,566, DP > 16,320, QD <1.5, root mean square of the mapping quality of reads across all samples (MQ) <30, FS >80, or consistency of the site with strictly two segregating haplotypes (HaplotypeScore) >13. The VCF file is available to download at [35].