Reproducible germline variant detection with whole genome sequencing (WGS) is vital for the implementation of precision medicine. However, the detection of variants in repetitive and difficult regions of the genome remains challenging, despite these regions harboring known, disease-associated genes with clinical importance. The WGS workflow is also lengthy and complex, with each step, from sample preparation, sequencing, and bioinformatic analysis affecting the diagnosis of germline variants.
To evaluate the detection of germline variants, SEQC2 performed WGS on reference genomes from two human populations using most major platforms and methods, including PCR-free, short-read, long-read, whole-genome, and targeted exome sequencing methods. Variants were then detected from the resulting sequencing data using more than fifty combinations of alignment and variant-calling bioinformatic tools. Performance was evaluated according to read alignment and coverage, error rates, and the sensitivity and specificity for correctly detecting known germline and structural variants in the reference genomes. These metrics were then stratified across genome, regions including repeats, transposons, duplicated, and challenging regions of the human genome [8].
The analysis found that the bioinformatic workflow, including alignment and variant-calling tools, had the largest impact on reproducibility between laboratories. For example, most errors were false negatives that were missed by variant callers. The detection of insertions and deletions (indels) was particularly challenging, and larger, complex structural variants were routinely missed by variant callers. This highlights the primary sources of variability in the detection of germline variants and the need for improved and standardized bioinformatics workflows to support the use of WGS in precision medicine.
These studies showed the reliable detection of variants in difficult, repetitive, or polymorphic human genome regions remains challenging. Given that natural genomes are unable to provide a clear reference standard for these difficult regions, SEQC2 developed synthetic controls that provide an unambiguous representation of difficult sequences, including complex variants, viral and transposon insertions, duplications, translocations, haplotype blocks, and immune receptors. These synthetic controls were used to benchmark the performance of diverse sequencing technologies in resolving these difficult regions and provide best-practice guidelines to optimize analysis that ultimately expands diagnostic yield of WGS into these difficult regions.