Assessing reproducibility of inherited variants detected with short-read whole genome sequencing

Background Reproducible detection of inherited variants with whole genome sequencing (WGS) is vital for the implementation of precision medicine and is a complicated process in which each step affects variant call quality. Systematically assessing reproducibility of inherited variants with WGS and impact of each step in the process is needed for understanding and improving quality of inherited variants from WGS. Results To dissect the impact of factors involved in detection of inherited variants with WGS, we sequence triplicates of eight DNA samples representing two populations on three short-read sequencing platforms using three library kits in six labs and call variants with 56 combinations of aligners and callers. We find that bioinformatics pipelines (callers and aligners) have a larger impact on variant reproducibility than WGS platform or library preparation. Single-nucleotide variants (SNVs), particularly outside difficult-to-map regions, are more reproducible than small insertions and deletions (indels), which are least reproducible when > 5 bp. Increasing sequencing coverage improves indel reproducibility but has limited impact on SNVs above 30×. Conclusions Our findings highlight sources of variability in variant detection and the need for improvement of bioinformatics pipelines in the era of precision medicine with WGS. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02569-8.

CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese mother). The left eight and right eight stacked bars show the base composition (left y-axis) of the highly reproducible insertions in the coding regions and non-coding regions (x-axis label), respectively. The four bases are color coded (blue for G, red for C, yellow for A, and purple for T). The black dashed line represents the frequency of G and C in the reference genome. The solid white circles show the percentage of insertions (left y-axis) in the coding region (1.06-1.11%) and non-coding region (98.89-98.94%). The solid cyan diamonds depict the number of insertions (right y-axis, in log10 unit) in the coding region  and non-coding region (183772-236155). The left eight and right eight stacked bars show the base composition (left y-axis) of the highly reproducible deletions in the coding regions and non-coding regions (x-axis label), respectively. The four bases are color coded (blue for G, red for C, yellow for A, and purple for T). The black dashed line represents the frequency of G and C in the reference genome. The solid white circles show the percentage of deletions (left y-axis) in the coding region (1.09-1.12%) and none-coding region (98.88-98.91%). The solid cyan diamonds depict the number of deletions (right y-axis, in log10 unit) in the coding region (1706-2242) and non-coding region . : GIAB reference sample NA12878; CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese mother). The repeatability distributions from the call sets without or with bed file filtering in the original study are plotted with solid and dashed curves respectively, while the repeatability distributions from the call sets without or with bed file filtering in the confirmatory study are plotted with dotted and dash-dot curves respectively. Four-letter codes are used for samples (HM-A: NA10835; HM-C: NA12248; HM-D: NA12249; HG001: GIAB reference sample NA12878; CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese mother). The repeatability distributions from the call sets without or with bed file filtering in the original study are plotted with solid and dashed curves respectively, while the repeatability distributions from the call sets without or with bed file filtering in the confirmatory study are plotted with dotted and dash-dot curves respectively. Four-letter codes are used for samples (HM-A: NA10835; HM-C: NA12248; HM-D: NA12249; HG001: GIAB reference sample NA12878; CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese mother). The repeatability distributions from the call sets without or with bed file filtering in the original study are plotted with solid and dashed curves respectively, while the repeatability distributions from the call sets without or with bed file filtering in the confirmatory study are plotted with dotted and dash-dot curves respectively.    . Three comparison pair presented in different colors: red for comparison between reproducibility from lab pair ARD_NVG and reproducibility from lab pair ARD_WUX; green for comparison between reproducibility from lab pair ARD_NVG and reproducibility from lab pair NVG_WUX; blue for comparison reproducibility from lab pair ARD_WUX and reproducibility from lab pair NVG_WUX. . Three comparison pair presented in different colors: red for comparison between reproducibility from lab pair ARD_NVG and reproducibility from lab pair ARD_WUX; green for comparison between reproducibility from lab pair ARD_NVG and reproducibility from lab pair NVG_WUX; blue for comparison reproducibility from lab pair ARD_WUX and reproducibility from lab pair NVG_WUX.           S29. GATK impact on repeatability of original HapMap data without filtering by HRRs. The points represent repeatability for pipelines with GATK realignment and without GATK realignment for various sample sets. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts repeatability calculated from pipelines without GATK realignment in their variant calling progress, while y-axis represents repeatability from the related pipeline (the same aligner and caller) with GATK realignment.

Fig. S30
. GATK impact on repeatability of original CQ data without filtering by HRRs. The points represent repeatability for pipelines with GATK realignment and without GATK realignment for various sample sets. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts repeatability calculated from pipelines without GATK realignment in their variant calling progress, while y-axis represents repeatability from the related pipeline (the same aligner and caller) with GATK realignment.

Fig. S31
. GATK impact on repeatability of original HapMap data with filtering by HRRs. The points represent repeatability for pipelines with GATK realignment and without GATK realignment for various sample sets. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts repeatability calculated from pipelines without GATK realignment in their variant calling progress, while y-axis represents repeatability from the related pipeline (the same aligner and caller) with GATK realignment.

Fig. S32
. GATK impact on repeatability of original CQ data with filtering by HRRs. The points represent repeatability for pipelines with GATK realignment and without GATK realignment for various sample sets. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts repeatability calculated from pipelines without GATK realignment in their variant calling progress, while y-axis represents repeatability from the related pipeline (the same aligner and caller) with GATK realignment.

Fig. S33
. Scatter plot of the impact of GATK realignment on lab reproducibility for original CQ data without filtering by HRRs. The points represent lab reproducibility of variants called from alignments with GATK realignment and without GATK realignment. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts lab reproducibility calculated from variants obtained without GATK realignment, while the y-axis represents lab reproducibility of variants obtained with GATK realignment.

Fig. S34
. Scatter plot of the impact of GATK realignment on lab reproducibility for original CQ data with filtering by HRRs. The points represent lab reproducibility of variants called from alignments with GATK realignment and without GATK realignment. Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts lab reproducibility calculated from variants obtained without GATK realignment, while the y-axis represents lab reproducibility of variants obtained with GATK realignment.   S40. Scatter plot for the impact of GATK realignment on caller reproducibility of original CQ data without filtering by HRRs. The points represent caller reproducibility for calling pipelines with GATK realignment and without GATK realignment for various sample sets (CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese quartet mother). Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts caller reproducibility without GATK realignment in variant calling, while the y-axis represents caller reproducibility with GATK realignment.

Fig. S41.
Scatter plot for the impact of GATK realignment on caller reproducibility of original HapMap data with filtering by HRRs. The points represent caller reproducibility for calling pipelines with GATK realignment and without GATK realignment for various sample sets (HM-A: NA10835; HM-C: NA12248; HM-D: NA12249). Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts caller reproducibility without GATK realignment in variant calling, while the y-axis represents caller reproducibility with GATK realignment.

Fig. S42.
Scatter plot for the impact of GATK realignment on caller reproducibility of original CQ data with filtering by HRRs. The points represent caller reproducibility for calling pipelines with GATK realignment and without GATK realignment for various sample sets (CQ-5 and CQ-6: Chinese quartet twin daughters; CQ-7: Chinese quartet father; CQ8: Chinese quartet mother). Three type of variants are colored in three different colors: red for SNV, green for Insertion and blue for Deletion. Different samples are presented in different shapes. The x-axis depicts caller reproducibility without GATK realignment in variant calling, while the y-axis represents caller reproducibility with GATK realignment.