Samplot provides a quick and straightforward platform for rapidly identifying false positives and enhancing the analysis of true-positive SV calls. Samplot images are a concise SV visualization that highlights the most relevant evidence in the variable region and hides less informative reads. This view provides easily curated images for rapid SV review. Samplot supports all major sequencing technologies and excels at the comparison between samples and technologies. Users generally require fewer than 5 seconds to interpret a Samplot image [12], making Samplot an efficient option for reviewing thousands of SVs. The simple images contrast with existing tools such as IGV, bamsnap, and svviz which allow more in-depth, but more complex and time-consuming, SV-region plotting (see Fig. 1, Additional file 1: Figures S1-S5).
Samplot is also designed for easy application to various types of SV study, such as comparing the same region across different samples (Additional file 1: Figure S6) and sequencing technologies (Additional file 1: Figure S7) for family, case-control, or tumor-normal studies. Annotations such as genes, repetitive regions, or other functional elements can be added to help add context to SV calls (Fig. 1).
Samplot supports short-read sequencing from Illumina, long-read sequencing from Pacific Biosciences or Oxford Nanopore Technologies, and linked-read sequencing from 10X Genomics. Samplot works well for most SV types with each of these sequencing technologies and can also plot images without specifying a variant type, enabling review of complex or ambiguous SV types, or non-SV regions.
Producing images that appropriately summarize the evidence supporting an SV without overwhelming the viewer is an intricate task. Samplot includes the three most essential categories of SV evidence: split reads, discordant pairs, and coverage anomalies. To reduce confusion, we distinguish between sequences and alignments. A sequence (also called a read) is a series of nucleotides produced by a short- or long-read sequencing platform. An alignment describes how a sequence (or read) maps to the reference genome. Sequences that originate from a region of a sample’s genome that does not include an SV will have a single complete continuous alignment. When a sequence includes an SV, it will produce multiple alignments or unaligned segments. The configuration of these alignments indicates the SV type. Deletions create gaps between alignments, and duplications create overlapping alignments; inversions produce alignments that switch between strands, etc. An SV in the unsequenced region between the paired-end sequencing reads will have a discordant alignment whose configuration similarly indicates the SV type.
Samplot identifies, color-codes, and elevates split or discordant alignments so that users can clearly and quickly distinguish between normal reads and reads supporting different SV types (Fig. 2, Additional file 1: Figure S8, Additional file 1: Supplemental Note 1). These plots often include scatterings of misaligned reads that can fool automated tools. A visual review can generally quickly determine whether or not groups of reads support an SV, allowing rapid high-confidence variant review.
Coverage depth is also an essential piece of data for evaluating the SVs that affect genomic copy number (copy number variants or CNVs) and can, in some cases, provide the best signal of a CNV. Samplot includes a background track with up to base-pair resolution of the fluctuations in coverage depth across the plot region. Samplot follows a minimal decision-making strategy and makes no computational attempt to assign reads or coverage deviations to putative variant coordinates; this task is left instead to the user via visual curation.
Samplot is implemented in the Python language and utilizes the pysam [17] module to extract read information from alignment (BAM or CRAM) files, then plots reads for review in static images. Speed is a key goal of Samplot, in keeping with the overall focus on simple and rapid SV review, so plots are created using the Matplotlib library, which has been optimized for rapid creation of high-quality images.
Filtering and viewing SV call sets with Samplot VCF
When working with large SV call sets, especially multi-sample VCF files, users often need to review evidence for SVs in multiple samples together. Samplot provides a VCF-specific option to interrogate such call sets using cohort genotypes, an optional pedigree file in family-based cohorts, and additional annotation fields for filtering and plotting multiple SVs across multiple samples. This enables users to focus on rare variation, variants in certain genome regions, or other criteria related to a research goal. A simple query language that is inspired by slivar [18] allows users to customize filters based on variant annotations in the VCF file. From the chosen variants, a web page is dynamically created with a table of variant information, additional filtering options, and quick access to Samplot images for visual review (Fig. 3).
Samplot VCF can be readily adapted to experimental needs common in SV studies. For example, a team attempting to identify a causal SV in a familial rare disease study might include a small number of control samples as well as the affected family and use built-in filtering options to plot only variants which appear uniquely in the offspring, with controls included in the resulting images for comparison. Samplot VCF is equally well-suited for other problems such as cohort-based analysis of common SVs or tumor-normal comparison (potentially with multiple samples in each category).
Automated SV curation with Samplot-ML
Convolutional neural networks (CNNs) are an effective tool for image classification tasks. Since Samplot generates images that allow the human eye to adjudicate SVs, it motivated us to test whether a CNN could discern the same patterns. To that end, we developed Samplot-ML, a CNN built on top of Samplot to classify putative deletions, the most common SV type, automatically. The workflow for Samplot-ML is simple: given a whole-genome sequenced sample (BAM or CRAM [19]) as well as a set of putative deletions (VCF [20]), Samplot-ML re-genotypes each putative deletion using the Samplot-generated image. The result is a call set where most false positives are flagged.
Using Samplot-ML, we demonstrate a 51.4% reduction in false positives while keeping 96.8% of true positives on average across short-read samples from the Human Genome Structural Variation Consortium (HGSVC) [21]. We also trained a long-read model with the same architecture and reduced false positives by 27.8%. Our model is highly general and can classify SVs in sequences generated by libraries that differ in depth, read length, and insert size from the training set. The Samplot-ML classification process is completely automated and runs at about 100 SVs per second using a GPU and 10 SVs per second using only a CPU. Most SV call sets from methods such as LUMPY [22] and MANTA [23] running on a single genome that yield between 7000 and 10,000 SVs will finish in about 1 min. The result is an annotated VCF with the classification probabilities encoded in the FORMAT field.
While Samplot-ML could support any SV type, the current model only includes deletions. There are too few called duplications, insertions, inversions, and translocations in the available data to train a high-quality model. For example, the 1000 Genomes Project phase 3 SV call set [4] included 40,922 deletions, 6006 duplications, 162 insertions, 786 inversions, and no translocations.
To evaluate the short-read model, we considered the samples from the HGSVC with long-read-validated SVs. First, we called SVs in HG00514, HG00733, and NA19240 using LUMPY/SVTYPER [10] (via smoove [24]) and MANTA. Next, we filtered those SVs using the heuristic-based method duphold [11], a graph-based SV genotyper Paragraph [25], a support vector machine classifier SV2 [26], and our CNN. In each case, we measured the number of true positives and the number of false positives with respect to the long-read validated deletions using Truvari [27] (Fig. 4a–c, Additional file 2: Table S1). In all cases, both duphold and Samplot-ML removed hundreds of false positives while retaining nearly every true positive. Paragraph and SV2 remove most of the false positives but retain far fewer true positives. Paragraph, similar to other graph-based methods, is also highly sensitive to breakpoint precision (Additional file 1: Figure S9), which explains the differences in its performance between LUMPY and MANTA calls. On average, duphold reduces the number of false positives by 32.6% and reduces true positives by 1.1% (Additional file 1: Figures S10-S11). Samplot-ML reduces false positives by 53.4% and true positives by 2.4%. Paragraph and SV2 reduce false positives and true positives 62.5% and 29.9%, and by 84.2% and 63.4%, respectively. A more refined analysis that evaluates the performance by genotype could measure the extent to which the model learns one-copy and two-copy loss states, but this truth set did not include genotypes. The Genome in a Bottle (GIAB) truth set [28] discussed next had genotypes, and one- and two-copy loss results are decomposed below.
The long-read model uses the same architecture and process as the short-read model, except it is trained on genomes sequenced using PacBio Single Molecule, Real-Time (SMRT) Sequencing. Since training used the HGSVC samples, the evaluation is based on the GIAB truth set [28] which includes multiple validations, including visual review, for long-read sample HG002. We called SVs using Sniffles [29], filtered those SVs using the CNN, and measured the number of true positives and false positives with Truvari (Fig. 4d, Additional file 2: Table S1). Samplot-ML reduces false positives by 27.8% and true positives by only 1.4%.
Generality can be an issue with machine learning models. A distinct advantage of training and classifying with Samplot is that its images are relatively consistent across different sequencing properties and the models still perform well when using different sequencing libraries. For example, our short-read model was trained on paired-end sequences 20X with 150-bp reads and a 400-bp insert size and the samples in the evaluations above (Fig. 4a–c) had shorter reads (126-bp reads) and a large insert size (500bp) and were sequenced at greater depth (68X). Additionally, we considered two libraries from the same Genome in a Bottle sample (HG002), where one was sequenced at 20X coverage with 150-bp reads and 550-bp insert size and the other was sequenced at 60X coverage with 250-bp reads and a 400-bp insert size (Fig. 5a). The model performed equally well across all libraries, clearly demonstrating that new models are not required for each library. Additionally, between LUMPY and Manta, Samplot-ML correctly genotyped 91.28% of hemizygous deletions (1-copy losses) and 97.26% of homozygous deletions (2-copy losses) for the 20X run (Fig. 4a). For the 60X run (Fig. 4b), Samplot-ML correctly genotyped 94.57% of hemizygous deletions and 97.26% of the homozygous deletions. These results clearly show that the model has learned both copy loss states.
Samplot-ML is intended for the evaluation of germline deletion calls, but may also be useful in some somatic variant call sets. Calling SVs in tumor samples can be a challenge when subclones and normal tissue contamination produce variants with a wide range of allele balances (the ratio of reads from the variant allele to the total number of reads). The result is fewer discordant alignments and a less distinct change in coverage, which has a direct effect on the Samplot images (Additional file 1: Figure S12). To test how well our model performs in these instances, we mixed sequences from two homozygous diploid cell lines (CHM1 and CHM13) at different rates (Fig. 5b) then reclassified SVs from a truth set [8] using duphold, SV2, and Samplot-ML. Paragraph was omitted from this experiment due to unresolved runtime errors. For each combination, we compared how many true-positive SVs each method recovered from the minor allele. While the recovery rates between the Samplot-ML and duphold were similar, ranging from over 70% when the samples were equally mixed (0.5 allele balance) to less than 40% when the SV minor allele was at 0.1 (Additional file 3: Table S2), Samplot-ML provided an improvement over duphold especially as the minor allele became more rare, peaking at a 12.9% improvement when CHM1 was the minor allele at 20%. SV2’s low sensitivity resulted in poor performance as the minor allele balance decreased. Low-frequency SVs are clearly difficult to detect and filter, and in many cases such as mis-identified intrachromosomal translocations called as deletions may prove overly complex for automated SV evaluation tools, but Samplot-ML’s classifier is robust to evidence depth fluctuations, further proof of the generality of the model.