Sequencing of HapMap trio by Illumina HiSeq2000
The Hapmap Trio DNA samples (NA10835, NA12248, and NA12249) were purchased from Coriell Cell Repositories (Camden, NJ). The concentration and quality were quantified using a NanoDrop 2000c. The OD260/280 ranged from 1.8 to 1.89. The original DNA samples were diluted to 50 ng/μL and 100 μL from each sample and were used for Illumina sequencing.
Genomic DNA was quantified prior to library construction using PicoGreen (Quant-iT™ PicoGreen® dsDNA Reagent, Invitrogen, Catalog #: P11496). Quants were read with Spectromax Gemini XPS (Molecular Devices).
Paired-end libraries were manually generated from 500 ng to 1 μg of gDNA using the Illumina TruSeq DNA Sample Preparation Kit (Catalog number FC-121-2001), based on the protocol in the TruSeq DNA PCR-Free Sample Preparation Guide. Pre-fragmentation gDNA cleanup was performed using paramagnetic sample purification beads (Agencourt® AMPure® XP reagents, Beckman Coulter). Samples were fragmented and libraries were size selected following fragmentation and end-repair using paramagnetic sample purification beads, targeting 300-bp inserts. Final libraries were quality controlled for size using a gel electrophoretic separation system and were quantified.
Following library quantitation, DNA libraries were denatured, diluted, and clustered onto v3 flow cells using the Illumina cBot™ system. cBot runs were performed based on the cBot User Guide, using the reagents provided in Illumina TruSeq Cluster Kit v3.
Clustered v3 flow cells were loaded onto HiSeq 2000 instruments and sequenced with 100 bp paired-end, non-indexed runs. All samples were sequenced on independent lanes. Sequencing runs were performed based on the HiSeq 2000 User Guide, using Illumina TruSeq SBS v3 Reagents. Illumina HiSeq Control Software (HCS) and real-time analysis (RTA) was used on HiSeq 2000 sequencing runs for real-time image analysis and base calling.
Sequencing of Chinese quartet and NA12878 by Illumina XTen
Chinese Quartet reference materials were from four immortalized lymphoblastoid cell lines (LCLs) of a “Chinese Quartet” family including father, mother, and two monozygotic twin daughters. These family volunteers were from the Fudan Taizhou cohort, representing a typical Chinese ethnicity genetic background. Lymphoblastoid cell lines were immortalized from blood B cells using Epstein-Barr virus (EBV) transformation. This study was approved by the independent ethics committee at the School of Life Sciences of Fudan University. All volunteers provided written informed consent to participate in the study.
The LCLs were cultured in RPMI 1640 (Gibco Catalog No. 31870-082) supplemented with fetal bovine serum (Gibco 10091-148) to a final concentration of 10% by volume. Cells were maintained at 37 °C with 5% CO2 and were sub-cultured every 3 to 4 days. After six passages, a total of about 2 × 109 cells were used for DNA extraction. The collected cells were washed with PBS for twice before DNA extraction using Blood & Cell Culture DNA Maxi Kit (QIAGEN 13362). The extracted DNA samples were stocked in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0).
The NA12878 DNA reference material (RM8398) was purchased from the National Institute of Standards and Technology (NIST).
WGS data for Chinese Quartet and RM8398 reference materials were generated from three sequencing labs (ARD: Annoroad, WUX: WuXi NextCODE, and NVG: NovoGene) using the Illumina XTen machine.
Libraries were prepared for whole genome sequencing using TruSeq DNA nano (Illumina catalog number 15041110) according to the manufacturer’s instructions. In total, 200 ng DNA was used for the TruSeq library preparations. All labs unified in-house fragmentation conditions using Covaris with a target size of 350 bp. All reference materials were prepared with three replicates in a single batch. The library concentrations were measured by the Qubit 3.0 fluorometer with the Quant-iT dsDNA HS Assay kit (Thermo Fisher Scientific, catalog number Q32854). The quality of all libraries was assessed using an Agilent 2100 Bioanalyzer or TapeStation instrument (Agilent).
These whole-genome libraries were sequenced on the Hiseq XTen (Illumina) with paired end 150 bp read length leveraging synthesis (SBS) chemistry. Sequencing was performed following the manufacturer’s instructions.
Sequencing of HapMap trio and Chinese quartet by Illumina NovoSeq with library preparation kit Nextera
Samples were quantified for dsDNA content with the Qubit dsDNA HS assay kit. Out of 21 samples, two contained less than 100 ng DNA. For samples with sufficient DNA, 100 ng was used as input for the Illumina Nextera DNA Flex library preparation kit (Illumina, catalog number 20018704). Libraries were prepared according to the manufacturer’s instructions (Illumina, Nextera DNA Flex Library Prep Reference Guide), with five PCR cycles used for amplification. For the two lower input samples, one sample had 62 ng DNA input and was prepared the same as the 100-ng samples. The other sample had 17 ng DNA input, so the PCR cycle number was increased to nine cycles, which resulted in shorter insert sizes in the final library for this sample.
Library yield and fragment size were quantified using the Qubit dsDNA HS assay kit and Agilent 2100 Bioanalyzer HS DNA chip, respectively. Libraries were loaded onto two NovaSeq S4 flow cells and clustered according to manufacturer’s instructions. Run data sets were uploaded to BaseSpace, and fastq files were generated.
Sequencing of HapMap Trio and Chinese quartet by Illumina NovaSeq with library preparation kit TrueSeq
Libraries were prepared using the TruSeq DNA PCR-Free Library Prep Kit (Illumina, catalog number 20015962) with a modified protocol to target 450 bp insert using 600 ng input. Shearing was performed with Covaris LE220 (18% Duty factor, 450 PIP (W), 200 Cycles/Burst, 60 s, 4 to 8.5 °C) and SPRI dilution to remove large DNA fragments (88 μL SPB + 72 μL Water), and IDT for Illumina Unique Dual Indexes (Illumina, catalog number 20020178). Sequencing was performed on the Illumina NovaSeq6000 Sequencing System with Xp loading on an S4 flowcell and 151 × 8 × 8 × 151 cycles. Raw run data were streamed onto the BaseSpace Sequence Hub from the sequencer. Fastq files were generated using bcl2fastq on the BaseSpace Sequence hub with default parameters. Adapters were trimmed during fastq generation using AGATCGGAAGAGCACACGTCTGAACTCCAGTCA as the read 1 adapter and AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT as the read 2 adapter. To confirm quality and coverage of samples, fastqs were processed through the Whole Genome Sequencing v8.0.1 BaseSpace app, with alignment against the GRCh38Decoy reference genome with default parameters. Down sampling to 100× coverage was enabled in the Whole Genome Sequencing app for any sample with coverage beyond 100×. Two samples in the original NovaSeq 6000 S4 run did not reach 60× average autosomal coverage and were re-sequenced on an S1 flowcell. Re-sequenced samples were analyzed in the same manner, and we confirmed greater than 60× coverage.
Quality assessment of sequencing data
All fastq files were evaluated with FastQC [51] (v0.11.5) with default setting for assessment of base quality, adapter content, and so on. Per base sequence quality was extracted with shell script from the “fastqc_data.txt” file reported by FastQC to check if the data quality passed or not.
Sequence reads alignment
The short reads were first aligned to the latest human reference genome [43] (GRCh38 with decoy sequences downloaded from Genomic data commons of the National Cancer Institute) using four aligners: Bowtie2 (v2.2.9) [52], BWA [53] (v0.7.15), ISAAC [54] (v1.0.7), and Stampy [55] (v1.0.29). Default settings for Bowtie2 and BWA were applied. BWA was used as a pre-aligner for Stampy, which was suggested by Stampy’s developer for efficient alignment. Stampy’s default settings were used except for application of the “--bamkeepgoodreads”. The setting “--base-calls-format --stop-at Bam --keep-unaligned back --realign-gaps yes” was used in ISAAC alignment to get sorted BAM files. Resulting SAM files from the other three aligners were sorted and converted to BAM files by the SortSam module in Picard [56] (v2.7.1). Duplicates in the sorted BAM files were marked by module MarkDuplicates and read groups were assigned by module AddOrReplaceReadGroups in Picard (v2.7.1).
GATK realignment
All BAM files obtained from alignment in the original study were processed with GATK [35] realignment following the best practices recommended by the Broad Institute (Notice: the realignment recommendation was removed by the Broad Institute beginning with GATK v4.0). Each BAM file was processed with local-realignment by GATK modules RealignerTargetCreator and IndelRealigner and base-quality recalibration by GATK modules BaseRecalibrator and PrintReads by following the best practices from the Broad Institute. The known SNPs and indels for GRCh38 in DBsnp146 (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz) and two indel files (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz) and (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz) were used as reference in the realignment and recalibration process.
Variant calling
BAM files with and without GATK realignment were used for variant calling using six different callers: FreeBayes [57] (v1.1.0), GATK-HaplotypeCaller [35] (v3.7), ISAAC [54] (v 1.0.7), Samtools [58] (v1.3.1), SNVer [59] (v0.5.3), and VarScan [60] (version 2.3.9). The running options “-X -0 -u -v” in FreeBayes, “-rf BadCigar –dbsnp dbsnp_146.hg38.VCF --stand_call_conf 30” in GATK-HaplotypeCaller, “minMapq = 20; minGQX = 30” in ISAAC, “-ugf” and “-vmO” from bcftools (v1.3.1) in Samtools, “-p 0.05” in SNVer, and “-p 0.05 --min-coverage 8 --min-reads2 2 --p-value 0.05” in VarScan were used in variant calling. Variant calling results were stored in VCF format.
Variant calling by Sentieon
The decoy version of GRCh38 human reference genome (https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference -files; GRCh38.d1.dv1.fa) from the Genomic Data Commons (GDC) was used in variant calling by Sentieon (v201711.02). The Sentieon DNAseq pipeline is a tool for variant calling from raw Fastq files, including read mapping by BWA, duplicate removal, indel realignment, base quality score re-calibration, and variant calling by Haplotyper. The Sentieon DNAseq pipeline, Sentieon [61] v201711.03, provides a complete rewrite of the mathematical models of the GATK Best Practices with a focus on computational efficiency, accuracy, and consistency. In variant calling by Sentieon, sequence reads were first aligned to GRCh38.d1.vd1.fa with Sentieon BWA, followed by sorting and indexing with the Sentieon utility. Subsequently, duplicate reads were removed, and base qualities were recalibrated with the Sentieon driver program. Variant calling was then performed with Sentieon Haplotype caller.
Variant calling by Dragen
Reads were aligned to human genome reference GRCh38 in BaseSpace using DRAGEN Germline Pipeline version 3.2.8 in Whole Genome Sequencing v7.7.0 (WGSv7). Although the DRAGEN aligner was able to use all reads for alignment, WGSv7 requires fewer than 1 billion paired end reads for analysis. Therefore, fastq files were down-sampled to 990 million paired-end reads in BaseSpace using FASTQ Toolkit v2.2.0 to enable WGSv7 analysis. Variant calling was performed in BaseSpace using DRAGEN Germline Pipeline version 3.2.8. The DRAGEN pipeline was run with default settings with “Map/Align + Variant Caller” selected, and CNV calling, SV calling, and duplicate marking enabled.
Variant calling by RTG
The following describes the processing used to align reads and call variants using RTG [62] alignment and variant calling algorithms.
All FASTQs underwent quality-based filtering to trim off poor quality read ends (using “rtg fastqtrim” --end-quality-threshold 15) and formatting to the RTG SDF format (which allows random access to arbitrary chunks of reads during mapping) using “rtg format”. FASTQ file pairs for each replicate were merged to a single per-replicate SDF and assigned a unique read group.
Alignment of the reads for each sample to the reference genome GRCh38 was via “rtg map,” processing reads from the input SDF file in chunks to permit partitioning of the alignment across multiple nodes. A typical chunk size was 40 million read-pairs. During alignment, an appropriate pedigree file was supplied to the mapping command to allow the aligner to lookup the sex of the sample. After primary alignment, an additional mate-pair rescue tool (currently in development) was executed on any reads which were unmapped but for which the other arm of the pair was uniquely mapped, and any rescued alignments were included in subsequent variant calling.
Across the various samples and families in the SEQC2 project, several variant calling modes were employed. When calling a single sample in isolation, the “rtg snp” command was used, for example: rtg snp -t GRCh38.d1.vd1.sdf \ -T 8 --pedigree pedigree.ped \ --enable-allelic-fraction --XXcom.rtg.variant.mask-homopolymer=true \ -o snp_HG001-r1-H3WNJDSXX_S8 \ map_HG001-r1-H3WNJDSXX_S8.sdf_*/alignments.bam. The final argument supplies all the alignment BAMs corresponding to the particular sample.
Analysis of Mendelian inheritance errors were computed using “rtg mendelian.” Overall variant statistics for each sample were computed using “rtg vcfstats.”
Variance analysis
Variants concordance was calculated using the average Jaccard Index as follows: ((A∩B)/(A∪B) + (A∩C)/(A∪C) + (B∩C)/(B∪C))/3, where A, B, and C are variants from the three replicates of each sample. Contribution of four factors (caller, aligner, platform, and sample) to the variation in concordances was estimated by a non-linear Gradient Boosted Tree (JMP Pro v14.3). The importance of each factor was estimated by how often it is used to make key decisions with decision trees. Boosting is the process of building a large, additive decision tree by fitting a sequence of smaller decision trees, called layers. The tree at each layer consists of a small number of splits. The tree is fit based on the residuals of the previous layers, which allows each layer to correct the fit for poorly fitting data from the previous layers. The final prediction for an observation is the sum of the predictions for that observation over all of the layers. The factor contributions were estimated in the model fitting, which is based on the total number of instances over all of the trees when the specific factor is used to split the data. The proportion of the contribution of each factor was calculated as sum of squares attributed to the factor divided by the total sum of squares.
In addition to the non-linear method, we also estimated the contribution of all possible 2-way interactions of the factors in a Variance Components Analysis (JMP Pro v14.3). The variance components were parameterized using an unrestricted method [63] in a mixed model fitted with restricted maximum likelihood (REML). Student’s t test was used to assess the contribution difference between factors. Variance components were estimated through fitting a random effect model as follows:
$$ Y= Z\gamma +\varepsilon, $$
$$ \gamma \sim N\left(0,G\right) $$
$$ \varepsilon \sim N\left(0,{\sigma}^2{I}_n\right) $$
where Y denotes an n × 1 vector of response (Jaccard index), Z is the design matrix for the random effects, and γ is a vector of unknown random effects with design matrix Z. Both γ and ε are assumed following normal distribution with means at 0. G and σ2 are the variance components that need to be estimated. The ratio of the contribution of each factor to the overall variability was calculated by the variance component of each factor divided by the total.
Generation of highly reproducible variants (HRV) and highly reproducible regions (HRR)
We generated HRVs and defined HRRs to assess the upper bound of reproducibility based on the suggestion from GIAB4. We used the workflow shown in Additional file 8: Fig. S1 to generate the HRRs. Detailed procedures for defining the HRRs are given below.
Determining mappable regions
Each of the BAM files from all aligners (with and without GATK realignment), replicates, and labs was used to generate a region file having aligned reads using GATK-CallableLoci with cutoffs of minimum depth of 6, maximum depth of 160, minimum mapping quality of 10, and minimum base mapping quality of 20%. We created 81 and 27 region files for each Chinese Quartet sample and HapMap sample (including NA12878), respectively. The mappable regions for each sample were determined as the genome regions that were covered by any of the region files of the sample. Technically, the mappable regions for a sample are the union of the region files.
Identify consensus mappable regions
The determined mappable regions for a sample were covered by a different number of region files. To identify the consensus mappable regions of the sample, its determined mappable regions were ranked by the number of region files that cover the regions. The top 99% ranked determined mappable regions were elected as the consensus mappable regions. The resulting consensus mappable regions are the regions covered by ≥ 10 region files for CQ-5, CQ-6, and CQ-7, by ≥ 11 region files for CQ-8, by ≥ 9 for NA12878, and ≥ 3 for all three HapMap samples.
Determine callable regions
Some genomic regions present variant calling difficulties and were removed from the identified consensus mappable regions. Specifically, simple repeats including homopolymer regions and super duplications defined in “SimpleRepeat_imperfecthomopolgt10_slop5.bed” and “remapped_superdupsmerged_all_sort.bed” by GIAB and GA4GH were removed using the subtract command from bedtools. The remaining consensus mappable regions were determined to be callable regions.
Define HRVs
First, the variants called from the same pipelines for three replicates of the same sample were compared and the variants called in only one replicate were filtered out as discordant variants. The remaining variants were used as replicate-concordant variants for the sample. Comparing the replicate-concordant variants from the three labs for the Chinese Quartet samples further filtered discordant variants that were found in the replicate-concordant variants of only one lab. The replicate-concordant variants of the HapMap samples and NA12878 as well as the post-filter replicate-concordant variants of the Chinese Quartet samples were then compared among aligners to determine aligner-concordant variants by filtering discordant variants that were identified in the replicate-concordant variants from only one aligner. The aligner-concordant variants were further compared among callers by filtering discordant variants that were shared by six or less callers. The remaining aligner-variants were determined as caller-concordant variants. The caller-concordant variants for NA12878 were defined as HRVs. The caller-concordant variants for the twins of Chinese Quartet were compared to filter discordant variants between the twins and the remaining caller-concordant variants were used for Mendelian rule compliance checking together with the call-concordant variants of the parent samples of Chinese Quartet and the HapMap trio samples. Variants violating the Mendelian rule were filtered out as discordant variants and the remaining Mendelian rule compliant variants were defined as HRVs.
Defining HRRs
For each sample, its HRVs and all discordant variants were used to filter the callable regions. For each of the discordant variants, the genome region 50 bp to its left and 50 bp to its right was compared with HRVs. If no HRV was located in this region, this region was removed from the callable region. When this region had HRVs, half of the region between the discordant variant and the nearest HRV were removed. After removal of such regions for all discordant variants, the remaining callable regions were defined as the HRR of the sample.
Filtering variants from different pipelines
Different callers report variants with different minimum read depths. We applied depth filtering prior to lower bound reproducibility calculation so that the variants in reproducibility calculation have the same minimum read depth. We also filtered variants with very high read depth using a cutoff of mean read depth plus three times standard deviation of the read depth of all variants. Specifically, we used a minimum read depth of 8 for all samples and a maximum read depth of 223 for the Chinese quartet samples and NA12878 and a maximum read depth of 350 for HapMap trio samples. To assess the upper bound of reproducibility, we selected variants only in HRR. Specifically, the vcffilter command of RTG tool was used to filter the variants outside HRR.
Reproducibility calculation
We calculated four types of reproducibility: technical reproducibility, lab reproducibility, aligner reproducibility, and caller reproducibility.
A reproducibility value was calculated between two sets a (Na variants) and b (Nb variants) using eval command of RTG Tools (v3.9). First, we indexed the reference genome to sdf format with RTG’s format command. Then we took set a as querying and set b as baseline for the calculation. Four sets of variants were output from the calculation: unique variants in a, variants of a found in b (na), unique variants in b, and variants of b found in a (nb). Basic information such as variant type and number was counted by RTG’s stat command. All numbers were extracted with a shell script written to extract the numbers and to calculate reproducibility R using equation (1).
$$ R=\frac{1}{2}\left(\frac{n^a}{N^a}+\frac{n^b}{N^b}\right) $$
(1)
Calculation of precision, recall, and F-score
Precision, recall, and F-score were calculated by comparing a set of variants with its corresponding HRVs using equations (2-4). The comparison was done using RTG’s eval command.
$$ \mathrm{Precision}=\frac{Q^c}{Q^c+{Q}^u} $$
(2)
$$ \mathrm{Recall}=\frac{Q^c}{Q^c+{H}^u} $$
(3)
$$ {\mathrm{F}}_{\mathrm{score}}=2\ast \mathrm{Precsion}\ast \frac{\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} $$
(4)
where Qc is number of common variants; Qu is number of variants in the comparing set but not in the HRVs; and Hu is number of variants in the HRVs but not in the comparing set.