Overview of Inspector
Inspector is a tool for evaluating long-read de novo assembly results. As shown in Fig. 1, inspector consists of the following main functions: (1) standard assembly metrics; (2) structural error identification; (3) small-scale error identification; and (4) assembly error correction. Inspector also introduces a Quality Value (QV) to estimate the overall assembly quality. Given a reference genome, Inspector can assess synteny by aligning contigs to the reference genome. The detailed methods and implementation are described below.
Contig continuity and read alignment
Inspector first calculates standard assembly statistical metrics and then evaluates contig continuity based on the lengths of all contigs. Standard statistical metrics include number of contigs, total bases in the assembly, longest and second longest contig lengths, and N50, which reflect continuity of assembly results.
The statistics of read-to-contig alignments are also calculated to assess assembly quality, including read mapping rate, read splitting rate, and average alignment depth. Read mapping rate indicates the proportion of reads that can be aligned to assembled contigs. A higher read mapping rate suggests better completeness of the assembly, while a lower mapping rate suggests that parts of the genome have not been reconstructed in the assembly. The read splitting rate is the proportion of aligned reads that have split alignments. A low read splitting rate indicates better consistency between reads and assemblies and fewer large assembly errors. In contrast, a high splitting rate suggests that there are more assembly errors which have caused the divergence between reads and assembled contigs. The average alignment depth is calculated as total length of aligned reads divided by total contig length. For good assembly, average alignment depth should be similar to sequencing depth of input reads.
Structural assembly errors
Inspector detects structural assembly errors (≥ 50 bp) based on disagreement between reads and assembled contigs. The first step is to scan all read alignments for raw error signals of expansion (gap in read alignment), collapse (extra sequence in read), and inversion (inverted read alignment). Density-based clustering is then performed independently for each type of structural error. Instead of setting a fixed window size for clustering raw signals, Inspector’s density-based clustering utilizes adjustable window size to tolerate larger shifts of raw signal positions within repetitive regions while keeping tight window size for clear genomic regions. Expansions and collapses are merged to identify haplotype switches, in which expansions overlap with collapses. To remove noise caused by sequencing errors or incorrect read alignments, Inspector filters out candidates with numbers of supporting reads below a threshold value (three by default).
To remove false-positive candidates caused by genetic variants, Inspector includes a filter based on the ratio of error-supporting read, local coverage, and read mapping quality. The ratio of error-supporting read is the fundamental criterion and computed with the number of error-supporting reads divided by the local coverage. As shown in Additional file 1: Fig. S9, read alignments at homozygous variants do not show inconsistency with the contig, as both haplotypes are the same as the contig sequence. Heterozygous variant regions show an alternative allele in about 50% of reads (from one haplotype). However, at true assembly error regions, both haplotypes are different from the contig, including the haplotype switch, leading to a theoretical ratio of about 100% for error-supporting reads. The ratio of error-supporting read for assembly errors can be lower than 100% in practice due to sequencing errors or inaccurate read alignments but are still higher than heterozygous variants, as shown in Additional file 1: Fig. S10. The filter also discards candidates with extremely high coverage or poor average read mapping quality to ensure the reported assembly errors are confident. By default, Inspector reports coordinates on contigs for all assembly errors in BED format, which can be easily loaded to visualization tools such as IGV [47].
Small-scale assembly errors
Inspector identifies small-scale assembly errors (< 50 bp) to estimate the base accuracy of an assembly. Samtools [48] is used to generate pileup information for each contig based on read-to-contig alignments. Inspector then scans pileup results for candidate small-scale errors in regions that are enriched with mismatches or indels. All bases with less than 20% of reads supporting a small-scale error were excluded to remove most noise caused by sequencing errors. Similar to structural errors, a true small-scale error is expected to be supported by reads from both haplotypes (100% of reads), while mismatches or indels caused by heterozygous variants are supported by only one haplotype (50% of reads). For a given position on the assembly, each aligned read is treated as an independent experiment, containing either the same or a different base (or indel) with the base in the contig. All bases in the reads at this position follow a binomial distribution, with n being the number of reads and p being the probability that the base is a different base from the contig. Inspector performs a one-tailed binomial test for each candidate position to distinguish small-scale errors from genetic variants. The null hypothesis of the binomial test is that the probability of a read that contains a different base against the contig is 0.5 (genetic variant at this location), and the alternative hypothesis is that the probability is higher than 0.5 (small-scale error at this location). A significant p value from the binomial test would reject the null hypothesis and support that there is a small-scale error at the tested position. The p value of binomial test is computed as:
$$ p\_\mathrm{value}=\sum \limits_{i={n}_{\mathrm{supp}}}^{n_{\mathrm{reads}}}\mathrm{Binomial}\left(i|p=0.5,n={n}_{\mathrm{reads}}\right) $$
where nreads is the total number of reads aligned to this position and nsupp is the number of reads supporting the mismatch/indel. The probability of a read to support an error used in binomial test is set to 0.5 for high-accuracy HiFi data, and set to 0.4 for low-accuracy data (CLR and Nanopore), considering the sequencing error rate of 15–20%. Candidates with significant p values (< 0.01 for HiFi and < 0.05 for CLR and Nanopore data) are reported as small-scale errors. Similar to structural errors, small-scale errors are also reported in BED format.
Assembly quality estimation
Structural and small-scale assembly errors are used to estimate the overall accuracy of an assembly result. Given a list of structural errors and small-scale errors of the assembly, the total bases of assembly error, NErr, can be calculated as:
$$ {N}_{Err}={N}_{Exp}+{N}_{Col}+{N}_{Her}+{N}_{Small}+{n}_{Inv} $$
where NExp , NCol, NHer, and NSmall are the total bases affected by expansions, collapses, haplotype switches, and small-scale errors, while nInv is the total number of inversion errors. Since the number of total bases in an assembly, Nasm, is usually very large, NErr can be considered as the expectation of incorrect bases. Thus, the estimated error rate, E, can be defined as:
$$ E=\frac{N_{\mathrm{Err}}}{N_{\mathrm{asm}}}=\frac{N_{\mathrm{Exp}}+{N}_{\mathrm{Col}}+{N}_{\mathrm{Her}}+{N}_{\mathrm{Small}}+{n}_{\mathrm{Inv}}}{N_{\mathrm{asm}}} $$
The Phred quality score is computed as QV = − 10log10E.
Assembly error correction
Inspector includes an error correction module to address identified structural and small-scale assembly errors. For small-scale errors, Inspector substitutes problematic bases with bases supported by the majority of reads. For structural assembly errors, Inspector collects the error-supporting reads and performs a local de novo assembly with Flye (v2.8.3) [15] for each assembly error. In particular, for haplotype switches, Inspector only collects reads from one haplotype to perform the local assembly. For each structural error, the local assembly uses the reads from the region around the error and from the same haplotype, which simplifies the assembly process and can therefore generate a more accurate contig than whole-genome de novo assembly. For structural errors located within repetitive regions, Inspector collects reads only from the current repeat unit without interference from other repeat units, increasing the accuracy of local assembly at repetitive regions. Inspector aligns the new contigs from local assemblies to the original contigs and substitutes the sequences flanking each error with new sequences from the local assembly results.
Reference-based mode
To assess the synteny of an assembly with a known reference genome, Inspector includes a reference-based module to evaluate assembly quality. The module aligns contigs to the reference genome with minimap2 [40] preset parameter “-x asm5.” Statistics for contig-to-reference alignment are calculated, including contig alignment NA50, contig mapping rate, and reference genome coverage. A Dotplot is generated based on contigs and reference alignment results. In addition, structural errors and small-scale errors are detected. Inspector reports coordinates on the reference genome and on the contig for all assembly errors. Note that assembly errors detected from contig-to-reference alignment also include genetic variants of the sequenced genome (including SVs, SNPs, and indels) and substitutions.
Simulation benchmark
To benchmark the evaluation accuracy of Inspector, testing used a simulated human whole-genome assembly containing both structural and small-scale assembly errors. A total of 1,000,000 SNPs and 20,000 SVs (deletions and insertions) were introduced into autosomes and X chromosome of human reference genome hg38. In total, 67% of all variants were randomly assigned as heterozygotes and 33% as homozygotes. PBSIM [41] was used to simulate 50X PacBio CLR-like and HiFi-like reads with options “--data-type CLR --model_qc model_qc_clr --length-mean 15000 --length-sd 3000 --accuracy-mean 0.85” and “--data-type CCS --model_qc model_qc_ccs --length-mean 15000 --length-sd 3000 --accuracy-mean 1.00,” respectively. The mean base accuracy was 0.85 for CLR-like reads and 0.98 for HiFi-like reads according to the log file from PBSIM. Assembled contigs were simulated by splitting the simulated human genome at “N” bases. Small fragments shorter than 10,000 bp were excluded. A total of 2000 structural errors (900 expansions, 900 collapses, 190 haplotype switches, and 10 inversions) and about 580,000 small-scale errors (50% base substitution, 25% 1-bp expansion, and 25% 1-bp collapse) were spiked in as the ground truth. A haploid human genome was also simulated by selecting only haplotype 1 from the diploid simulation.
Inspector was applied with default settings. The reported structural and small-scale errors were compared to the ground truth to calculate recall, precision, and F1 score (\( \frac{2\ast \mathrm{recall}\ast \mathrm{precision}}{\mathrm{recall}+\mathrm{precision}} \)). Human reference genome hg38 was provided to QUAST-LG as the reference. Although the minimum length for structural errors was 50 bp in simulated assemblies, QUAST-LG can only report the coordinates of extensive misassemblies longer than 85 bp. These extensive misassemblies were compared with a subset of ground-truth structural errors that were longer than 85 bp to assess the accuracy of QUAST-LG. Since Merqury requires high-accuracy reads as input data, the simulated HiFi dataset (with sequencing error rate < 2%) was provided to Merqury to identify erroneous k-mers that were only present in the assembly but not in the input reads. A series of overlapping k-mers were merged into one single event for the benchmark.
Whole-genome de novo assembly of HG002
Whole-genome de novo assembly was performed for HG002 with PacBio CLR, HiFi (15-20 kb), and Nanopore datasets. The expected genome size was set to 3.1G for all assemblers. Canu (v2.0) was run with options “-pacbio” for the PacBio CLR and “-pacbio-hifi” for the PacBio HiFi dataset. The Canu assembly of the Nanopore dataset was obtained from a previous publication [17]. Contigs marked with “suggestBubble = yes” were removed from evaluation. Flye (v2.8.2) was run with options “--pacbio-raw” for the CLR, “--pacbio-hifi” for the HiFi, and “--nano-raw” for the Nanopore dataset, respectively. Wtdbg2 (v2.5) was run with options “-p 17” for the CLR and Nanopore datasets, and preset “-x ccs” for the HiFi dataset. Hifiasm (v0.13) was only applied to PacBio HiFi datasets with the default settings. The Shasta assembly of Nanopore dataset was also obtained from a previous publication [17]. All assemblies were evaluated by Inspector with default settings. CLR assemblies were evaluated with the raw CLR dataset, HiFi assemblies were evaluated with the HiFi dataset (15-20 kb), and Nanopore assemblies were evaluated with the raw Nanopore dataset.
Other assembly evaluation tools
QUAST-LG (v5.0.2), a reference-based approach, and Merqury (v1.1), a k-mer based approach, were also used to evaluate assemblies. For QUAST-LG, GRCh38 was provided as the reference genome. QUAST-LG was run with command:
“quast-lg.py contig.fa -o output/ -r hg38.fa -m 10000 - × 86”
The number of misassemblies included both extensive and local misassemblies, and number of mismatches included both mismatches and indels.
For Merqury, a meryl database was first generated with approximately 50× Illumina paired-end reads with k-mer size of 21 bp. Merqury was then run based on the Illumina meryl database to evaluate HG002 assemblies with default settings:
“meryl k = 21 count output read-db.meryl allread.fa”
“merqury.sh read-db.meryl contig.fa output”
The assembly-only k-mers were collected from Merqury’s output and the overlapping k-mers were merged into a single event.
Benchmark of assembly error in HG002
The false discovery rate of assembly errors was calculated by comparing reported assembly errors to the genetic variant callset of HG002. Coordinates of assembly errors were projected to the human reference genome based on contig-to-reference alignment. Matched base pairs between contigs and the reference genome were stored in a hash table. The corresponding reference coordinate of an assembly error can be inferred from the hash table according to its assembly coordinate. Small-scale errors were compared to the small variant callset (v4.2.1) from GIAB. Since the high-confidence SV callset is only available in “benchmark regions” of HG002 [43], structural assembly errors located only in benchmark regions were compared to the SV callset to calculate FDRs.
Coordinates of misassemblies reported by QUAST-LG were extracted from filtered contig alignment. Misassemblies located within benchmark regions were compared to the SV callset for FDR assessment. Assembly-only k-mers from Merqury’s output were merged and projected to the reference genome. FDR was computed by comparing the locations of k-mers to the merged variant callset (SVs and small variants).
Down-sampling of HG002
To evaluate the robustness of Inspector, three HiFi datasets (11 kb, 15 kb, and 15–20 kb) of HG002 were merged to generate a HiFi dataset with an ultra-high depth. It was then downsampled to a series of depths, ranging from 10× to 100×, by randomly selecting reads. Depth was determined as total number of base pairs in reads divided by the human genome size (3.1 Gbp). Inspector was applied to identify assembly errors using default settings to validate its robustness in addressing datasets of varying depth.
Repeat annotation of assembly errors
Coordinates of assembly errors were projected to the human reference genome. Those assembly errors located in unaligned parts of the assembly cannot be projected to the reference genome and therefore were excluded from analysis. Repeat annotation of all assembly errors was performed by a custom Python script, which compared reference coordinates of assembly errors to the genomic repeat annotation downloaded from UCSC Genome Browser [49].
Polishing of HG002 assemblies
Inspector correction and other polishing methods were tested on HG002 assemblies. The error correction module of Inspector was tested with PacBio CLR (70×), PacBio HiFi (15–20 kbp, 51×), and Nanopore (53×) datasets with default settings. The input datatype was specified for each dataset to enable accurate local assembly in the structural error correction process. Racon (v1.4.20) and Pilon (v1.24) were tested with PacBio CLR, PacBio HiFi, and Nanopore datasets with default settings. GCpp (v 2.0.2) was tested with downsampled raw subreads of PacBio HiFi dataset (70×). Medaka (v 1.4.3) polished HG002 assemblies with Nanopore datasets with the options “--model r941_min_high_g303 --batch 200 --bam_chunk 2000000.” Nanopolish (v0.13.3) was tested with Nanopore dataset using default settings. CONSENT (v2.2.2) polished HG002 assemblies with PacBio CLR datasets with options “--windowSize 50000.” Nanopolish and CONSENT were tested on only one contig (10Mbp in length) per assembly due to the excessive requirement of computational resources for whole-genome correction. The input read alignment files for Racon, Pilon, Medaka, and Nanopolish were aligned by minimap2 and sorted by Samtools sort. The read alignment files provided to GCpp were aligned by pbmm2 and sorted by Samtools. All polishing tools were tested with only one round of the polishing process. We also polished the HG002 assemblies with Illumina dataset (downsampled to 50×) to assess the improvement of assembly quality from short reads. The original and polished assemblies were evaluated using Inspector with a merged HiFi dataset (11 kbp and 15 kbp, total of 58×) and using Merqury with meryl database generated from Illumina dataset.
Whole-genome assembly of Anna’s hummingbird sample
The PacBio CLR (~70×) data of Anna’s hummingbird (Calypte anna) was downloaded from the Vertebrate Genomes Project and used to for whole-genome de novo assembly with Canu, Flye, and wtdbg2 with genome size of 1.1 Gbp. Inspector was run with default settings to evaluate and correct errors for the three assemblies. The curated assembly was obtained from GenomeArk as the ground truth. The uncorrected and corrected assemblies were compared to curated assembly with Mauve [50] to visualize structural errors before and after Inspector error correction.