The sensitive detection of rare genetic variants in a population of cells is critical for multiple applications in biology and medicine, including industrial microbial engineering [1], drug-resistance management in infectious disease [2], and in oncology for the early detection [3] or non-invasive diagnosis [4] (known as liquid biopsy) of cancers. In these scenarios, it is highly desirable to detect bona fide mutations present at minuscule frequencies. Deep DNA sequencing by next-generation sequencing (NGS) technology holds great promise, but the sequencing accuracy remains a bottleneck for these applications. For example, the overall DNA sequencing error rate was reported to be > 1000 per million (pm) from 2011 to 2018 [2, 5,6,7]. We recently discovered that the overall NGS error rate can be computationally suppressed nearly 100-fold to between 10 pm and 100 pm through modeling of alignment artifacts and quality variation [8]. This in turn enabled a series of applications requiring highly sensitive detection of low-frequency events [9, 10]. In these reports, overall error rate (oER) is measured by a “reference DNA” method [8] where the DNA library is assumed to be mutation-free. However, generating mutation-free DNA is itself a challenge. For example, the highest-fidelity polymerase Q5 was reported to have an error rate (pER, for PCR error rate) of 0.53 pm [11], which can lead to genetically heterogeneous DNA molecules upon PCR amplification (Fig. 1a). In addition, human cells are estimated to have a mutation rate of ~ 10−8 (0.01 pm) per position per haploid genome [12, 13]. As a result, DNA extracted from cell populations can have bona fide low-frequency mutations (Fig. 1b). Therefore, in reference DNA-based methods, the sequencing readout is a product of mutations in cells or misincorporations during PCR amplification, as well as errors induced in sequencers (Fig. 1a, b). As a result, it remains unknown how to precisely measure sequencer error rates (sER), which are necessary to make informed decisions about platform (e.g., HiSeq vs NovaSeq) and sequencer (i.e., actual instrument) choices for deep sequencing applications, to diagnose sequencer problems, and to improve the accuracy of DNA sequencing.
In this work, we present a novel computational method, SequencErr, to precisely measure sER. The key idea is to utilize the paired-end sequencing methodology (Fig. 1c), which was designed to double the sequencing yield by sequencing the input DNA molecule from both ends. When the input DNA molecule is short, forward and reverse reads overlap and the overlapping base pairs are sequenced twice. Identical readouts are expected if there are no sequencer errors, and discordance between forward and reverse reads must be a result of an error in the sequencer (Fig. 1c). We note that overlapping reads have been extensively utilized to reduce errors in the literature [14,15,16,17,18], and the novelty here is to use overlapping reads to investigate the accuracy of the sequencer, flow cells, etc. We investigated error patterns associated with platforms, sequencers, flow cells, and tiles in flow cells (see Additional file 1: Supplementary Note 1 for cartoon illustration) by using 3777 datasets from 75 research institutions in 18 countries (Additional file 2: Table S1; see the “Methods” section). Our results provide critical insights into sequencer accuracy and suggest future directions to enhance instrument accuracy.
Calculating sequencer error rate with SequencErr
For a given read pair r, we denote the number of overlapping base pairs between forward and reverse reads as nr. The number of sequenced bases in this region is 2nr, where r = 1, …, K, and K is the number of read pairs in each evaluation unit (e.g., one tile). We denote the number of base pairs with a mismatch between forward and reverse reads as mr. Considering the nested structure of reaction units in a sequencing run, where a read belongs to a tile, a tile belongs to a swath, a swath belongs to a lane, a lane belongs to a surface, and a surface belongs to a flow cell (Additional file 1: Supplementary Note 1), we can define the sequencer error rate at different granularity scales. For example, the tile-level sER can be calculated as:
$$ {e}_t=\frac{\sum_r{m}_r}{\sum_r2{n}_r}\ \mathrm{where}\ \mathrm{read}\ \mathrm{pair}\ r\in \mathrm{tile}\ t $$
(1)
Similarly, the flow cell-level sER is defined as:
$$ {e}_f=\frac{\sum_r{m}_r}{\sum_r2{n}_r}\ \mathrm{where}\ \mathrm{read}\ \mathrm{pair}\ r\in \mathrm{flow}\ \mathrm{cell}\ f $$
(2)
and the surface-level sER is defined as:
$$ {e}_s=\frac{\sum_r{m}_r}{\sum_r2{n}_r}\ \mathrm{where}\ \mathrm{read}\ \mathrm{pair}\ r\in \mathrm{surface}\ s $$
(3)
Physical location information of reads, such as sequencer and flow cell identifiers and tile numbers, is stored in the read name, which is critical for our analysis (see the “Methods” section). We do not specifically analyze the lane effect because it is custom configurable (see the “Methods” section; Additional file 1: Fig. S1).
Measuring overall sequencing error rate at base pair level
When a reference DNA library has been deeply sequenced (e.g., with > 1,000,000× depth), the known wild-type bases can be used to calculate overall error rate (oER) in a site-specific fashion [8] as follows:
$$ {\mathrm{error}\ \mathrm{rate}}_i\left(g>m\right)=\frac{\#\mathrm{reads}\ \mathrm{with}\ \mathrm{nucleotide}\ m\ \mathrm{at}\ \mathrm{position}\ i}{\mathrm{Total}\#\mathrm{reads}\ \mathrm{at}\ \mathrm{position}\ i} $$
(4)
where g indicates the reference allele at genomic locus i, and m represents each of the three possible substitutions caused by sequencing error. For example, at a given site with reference allele A, we can calculate oER for the three possible mismatches, A>C, A>G, and A>T. Note that oER is a product of bona fide cellular mutations, PCR errors (pER), and sequencer error (sER). It is different from the sER measured in Eqs. 1, 2, and 3. The oER can be used to compare datasets generated by different sequencers, or to compare datasets with or without removing outlier tiles as discussed later.
Datasets for benchmarking SequencErr
Many datasets across a broad spectrum of platforms, sequencers, flow cells, and samples are needed to test the efficacy of our method. For this purpose, we analyzed datasets from the public repository NCBI Sequence Read Archive (SRA) (see the “Methods” section) for three major platforms—HiSeq, NextSeq, and NovaSeq (Fig. 1d, Additional files 3, 4, 5: Tables S2, S3, S4). HiSeq is the most common of these due to its earlier release (2010), followed by NextSeq (2014) and NovaSeq (2017). A significant challenge is that read names in many datasets have been reformatted in NCBI SRA (see the “Methods” section, Additional file 1: Supplementary Notes 2-3, Additional files 3, 4, 5: Tables S2, S3, S4), possibly to save storage space. This resulted in only 5.2% of the public datasets (n = 72,447) being suitable for our study (Fig. 1d). These were generated in 75 research institutes across 18 countries (Additional file 2: Table S1).
We analyzed the datasets according to flow cells considering that multiple samples (i.e., multiplexing with barcode) can be pooled in one flow cell, and a sample may be sequenced in multiple flow cells. In fact, different samples pooled in the same flow cell tend to have similar tile-level error rates, indicating a minimal sample effect (Additional file 1: Fig. S1; Additional file 6: Table S5). This resulted in 632, 54, and 24 flow cells from 108 HiSeq, 20 NextSeq, and 13 NovaSeq sequencers, respectively (Fig. 2a, Additional file 7: Table S6). Tile-level error rates (Eq. 1) of representative flow cells are illustrated in Fig. 1e–g, where the variability of error rates among sequencers, flow cells, surfaces, and tiles can be observed.
Comparison of overall error rate and sequencer error rate measures
As illustrated in Fig. 1a–c, the overall error rate (oER, Eq. 4) is a measure of sequencing errors with a mixture of error sources, including PCR artifacts and sequencer artifacts. On the other hand, the sequencer error rate (sER, Eqs. 1–3) is a direct measure of errors specific to a sequencer. We, therefore, compared these measures by using a previously published amplicon sequencing dataset (ENA project ID PRJEB35986) [8]. In this dataset, genomic loci flanking the spike-in somatic mutations are known to be wild-type and are used to measure the oER (Eq. 4). On the other hand, the sER was calculated with Eq. 2. Because exactly the same DNA library (generated by PCR enzyme Kapa and Q5, respectively) was sequenced by different NovaSeq sequencers from different sequencing providers (SJ: St. Jude Children’s Research Hospital; HAIB: HudsonAlpha Institute for Biotechnology) [8], it also provided a unique opportunity to benchmark the instruments. Consistent with the expectation that sER is a subset of the oER, the measured oER is consistently higher than sER (Fig. 1h). Strikingly, data generated by SJ demonstrated a significantly (two-sided Wilcoxon rank-sum test, P < 0.01) higher oER than that generated by HAIB, indicating a strong contribution of sequencer errors. Indeed, the sER of SJ is also significantly (two-sided Wilcoxon rank-sum test, P < 0.01) higher than that of HAIB. The consistent significantly (two-sided Wilcoxon rank-sum test, P < 0.01) lower overall error rate of Q5 than that of Kapa is consistent with our previous findings [8]. This data supports the value of measuring sER because lower sequencer error rate can result in lower overall error rate and measuring sER might help choosing the best sequencers for deep sequencing applications. For example, NovaSeq and NextSeq are on average preferred over HiSeq sequencers. Because different sequencers can have dramatically different error rates (such as the two NovaSeq sequencers studied here), specific sequencers with lower error rates are preferred.
To understand the effect of computational error suppression in sER, we performed a similar analysis as above except without computational error suppression (i.e., no quality filtering on mapping quality and Phred scores). As a result (Fig. 1h), the sER is now close to 1000 pm (10−3), which is consistent with previous reports [2, 6]. This result reinforced our previous observation that computational error suppression can lead to 10- to 100-fold error rate reduction [8]. With this observation, we will apply computational error suppression to calculate sER hereafter unless otherwise stated.
Comparison of sequencer error rates between platforms
We first studied the general sequencer error rate (sER) patterns associated with the HiSeq, NextSeq, and NovaSeq sequencing platforms. For this purpose, we summarized flow cell-level sER using Eq. 2. As can be seen from Fig. 2a and Additional file 7: Table S6, HiSeq, NovaSeq, and NextSeq have an average sER of 32.9, 11.1, and 1.5 pm, respectively. Because HiSeq and NovaSeq have the highest throughput, and possibly the most popular usage, we conclude that the current sER is ~ 10 pm.
We noticed that many flow cells in HiSeq and NextSeq platforms demonstrate elevated sER (red dots in Fig. 2a). To test if there are systematic error sources, we reorganized the data by focusing on sequencers that have data from multiple flow cells (Fig. 2b, Additional file 1: Fig. S2 and Additional file 7: Table S6). Only a few sequencers have sER greater than 100 pm where all flow cells appear to be affected (red dots in Fig. 2b and Additional file 1: Fig. S2). Therefore, we define outliers by using 100 pm as threshold hereafter. Interestingly, tiles in physical proximity in outlier sequencers tend to have concordant error rate patterns between flow cells (Additional file 1: Fig. S3, Additional file 8: Table S7), indicating a sequencer problem. On the other hand, flow cell-level sER appears to be highly stable across runs within the non-outlier sequencers (Fig. 2b, Additional file 1: Fig. S2), indicating that a successful initial sequencing experiment may ensure the generation of high-quality data across many flow cells. We identified two sequencers (E00332 and NS500183, 1.4% of 141 sequencers; Fig. 2b, Additional file 7: Table S6) as outlier sequencers and corresponding datasets were omitted from further analyses. In the non-outlier sequencers (Additional file 7: Table S6), 17 flow cells (2.5%, n = 661) have marginally high error rate (between 100 pm and 150 pm). One (0.15%, n = 661) flow cell (C5E39ANXX, Additional file 7: Table S6) has a very high error rate (15,225 pm) and was omitted from further analyses.
Flow cell surfaces
Because it appears that the top surface has a lower sER than the bottom surface in the representative flow cells (Fig. 1e–g), we next calculated sER in the top and bottom surfaces (Eq. 3) for each flow cell. As can be seen in Fig. 2c and Additional file 9: Table S8, the top surfaces have significantly (two-sided Wilcoxon rank-sum test, P < 0.01) lower median sER than bottom surfaces for HiSeq and NextSeq. For NovaSeq, the top surface tends to have lower median sER than the bottom surface, although statistical significance is not reached. This data indicates a systematic problem in the bottom surfaces of flow cells and an apparent quality improvement of bottom surfaces in the newer sequencers.
Flow cell tiles
Because there are outlier tiles with dramatically elevated sER at the flow cell level (Fig. 1e–g), we next studied the extent of outlier tiles. We defined a tile as an outlier if its sER (Eq. 1) is > 100 pm, with the observation that 96.3% flow cells have sER < 100 pm (Fig. 2b, Additional file 1: Fig. S2, Additional file 7: Table S6). As can be seen in Fig. 2d and Additional file 1: Fig. S4 and Additional file 10: Table S9, 6 out of 580 (1%) HiSeq flow cells have more than 10% outlier tiles in the top surface, while 397 out of 580 (68%) HiSeq flow cells have more than 10% outlier tiles in the bottom surface. For NextSeq, 1 out of 51 (2%) flow cells in the top surface and 5 out of 51 (10%) flow cells in the bottom surface have more than 10% outlier tiles. None of the 24 NovaSeq flow cells have more than 10% outlier tiles. This data indicates that a high number of HiSeq flow cells have quality problems originating from the bottom surface. An improvement of bottom surface quality from HiSeq (68% flow cells) to NextSeq (10% flow cells) and NovaSeq (0% flow cells) is observed (Fig. 2c). Notably, 44.4% and 88.9% of NovaSeq flow cells have at least one outlier tile in the top and bottom surface (n = 18; Additional file 1: Fig. S4, Additional file 11: Table S10). Overall, 94.2% (n = 580), 45% (n = 51), and 95.8% (n = 24) of HiSeq, NextSeq, and NovaSeq flow cells have at least one outlier tile, respectively (Additional file 9: Table S8).
We next asked if the outlier error-prone tiles in non-outlier sequencers demonstrate patterns in physical locations. As it turns out, the outlier tiles have a roughly uniform distribution of physical positions (Additional file 1: Fig. S5; Additional file 12: Table S11), although the enrichment of outlier tiles with higher position number is observed in HiSeq top surfaces while enrichment of outlier tiles with lower position number is observed in NovaSeq bottom surfaces. It should be noted that the current total number of publicly accessible datasets for NextSeq and NovaSeq is quite limited, and more robust estimates can be achieved when more datasets are evaluated.
Effect of sequencer on the overall sequencing error rate
We next studied the effect of different sequencers on oER (Eq. 4) in more detail. For this purpose, we utilized the COLO829 dilution datasets (see the “Methods” section) published previously [8], where a common reference DNA library was sequenced by two sequencing centers, HAIB (HudsonAlpha Institute of Biotechnology, Huntsville, AL) and SJ (St. Jude Children’s Research Hospital, Memphis, TN) using two NovaSeq sequencers, A00363 (HAIB) and A00214 (SJ). As can be seen in Fig. 3a and Additional file 13: Table S12, these sequencers have a 10-fold error rate difference, by which we expect the dataset generated by HAIB to have a lower oER. Notably, the HAIB dataset has a bi-modal distribution of tile-level error rates. In fact, the lower error rate tiles are located on the top surface and the higher error rate tiles are located on the bottom surface (Additional file 1: Fig. S6a, Additional file 13: Table S12), reinforcing the observation of elevated error rates on bottom surfaces (Fig. 2c).
We calculated the site-specific overall sequencing error rate (oER) as previously described (Eq. 4; Additional file 14: Table S13). As can be seen in Fig. 3b, the oER of HAIB is between 4- and 7-fold lower than that of SJ for A>C/T>G and C>G/G>C misincorporations, and the oER difference is less prominent for other error types. This result indicates two possibilities: (a) the sequencer might have elevated misincorporation types, such as A>C/T>G, or (b) the reduction of sER has a negligible effect because PCR error rate (pER) is an order of magnitude greater than sER. To test these hypotheses, we compared the sER (Eq. 2) against oER (Eq. 4). As it turned out, we found a statistically significant negative correlation between sER and oER in datasets from both A00214 and A00363 (Additional file 1: Fig. S6c-d), indicating dramatically different misincorporation types between NovaSeq sequencer and PCR enzymes.
Removing outlier poor-quality tiles improves overall sequencing error rate
Because there appear to be extreme outlier tiles in HAIB sequencer A00363 (tile #2159) and SJ sequencer A00214 (tile # 2201; Fig. 3a), we next studied the effect of removing outlier tiles on the overall sequencing error rate (oER). For this purpose, we compared site-specific oER (Eq. 4; Additional file 14: Table S13) by excluding outlier tiles. As can be seen in Fig. 3c and Additional file 1: Fig. S7a and S8a, 7.6% of C>G/G>C errors have > 2-fold error rate reduction, with a maximum reduction of 24-fold, followed by 6% of A>T/T>A errors with > 2-fold error rate reduction in the HAIB dataset. On the other hand, the allele fractions of spike-in true mutations (red dots, Fig. 3c, see the “Methods” section) are not affected by the removal of outlier tiles. Therefore, we conclude that the removal of outlier error-prone tiles can further reduce the overall error rate. Notably, the numeric error suppression in A00214 appears to have a much less dramatic effect than in A00363 (which has lower sER; Additional file 1: Fig. S7 and S8), indicating that removal of outlier tiles is most effective when the instrument/flow cell is of higher accuracy.
Comparison of SequencErr with FastQC
Because SequencErr is designed to understand sequencing quality, which is conceptually similar to FastQC [19]—a frequently used quality control tool for sequencing datasets—we also benchmarked our method against FastQC. However, these two methods are fundamentally different in their design principles. FastQC operates on Phred scores of sequenced bases (regardless of the actual sequence identity) and measures a sample-level quality metric as the percent of bases with the Phred score above a given threshold, e.g., Q30 percentage. In contrast to FastQC, SequencErr compares the actual sequence identity of aligned forward and reverse reads, although quality filters including Phred score distribution are applied to remove poor-quality data (see the “Methods” section). We used the previously published dataset [8] for this comparison. As seen in Additional file 1: Fig. S9a, the FastQC evaluation shows that the Q30 percentage is > 98% across all sequencing cycles, which would be considered good quality. However, by using our Eq. 1 at the sequencing cycle level, we observed a stable error rate of ~ 11 per million across the sequencing cycles (Additional file 1: Fig. S9b). Strikingly, we were able to observe the effect of problematic tiles on the sequencing cycles (Additional file 1: Fig. S9b,c,f), which cannot be observed by reviewing FastQC output file (Additional file 1: Fig. S9a,e).
Effect of DNA sequencing features on sequencer error rate
We next studied the effect of DNA sequence features on sequencer error rates, by focusing on (1) GC content, (2) read length, and (3) overall base quality, measured as a percentage of bases with Phred score greater than 30 (Q30 percentage). Because a large number of samples are needed to draw a robust conclusion, here we focused on 1167 publicly available HiSeq datasets (Additional file 3: Table S2). As seen from Additional file 1: Fig. S10, we did not detect a significant correlation between sER score and GC content and read length features, although a marginally significant negative correlation with Q30 percentage is observed, indicating that excessive percentage of low-quality bases can lead to unreliable sequencing output, even if after stringent quality filtering. This indicates that our SequencErr metric is being highly robust in measuring sequencer reliability in a wide range of parameter settings.
Comparison of SequencErr with error correction methods
In addition to our error suppression method, which operates by identifying and filtering (i.e., suppression) unreliable reads, considerable efforts have been devoted to error correction methods, which operates by identifying DNA contexts (i.e., k-mer) that are error-prone and followed by modifying (i.e., correction) corresponding readout. To benchmark with these methods, we focused on two error correction methods, Lighter (v1.1.2) [20] and Musket (v1.1) [21], which are considered to have the top performance in a recent study by Mitchell et al. [22]. We focused the analysis by using two representative samples (ERR3781298 and ERR3790800, Additional file 15: Table S14) sequenced using SJ and HAIB sequencers from previously published dataset [8] where the mutant and wild-type sites are well-defined (Additional file 14: Table S13). Here, the raw FastQ files were run through Lighter and Musket to correct errors, followed by standard pileup (see the “Methods” section). As can be seen in Additional file 1: Fig. S11a-b, the overall sequencing error rate (oER) obtained by SequencErr far outperforms that of Lighter and Musket for all 12 possible nucleotide changes, although for C>T/G>A changes the difference is least dramatic. Interestingly, the overall error rate of Lighter and Musket does not appear to show improvement compared to the direct standard pileup. We next hypothesized that DNA sequence context-based modifications may lead to overcorrection, which could be reflected in overlapping forward-reverse reads which are expected to have perfect matches. An increase in the mismatch rate would indicate overcorrection. Indeed, we observed ~ 10-fold increased forward-reverse mismatches in this dataset by Lighter or Musket than no correction (SequencErr measurement) (Additional file 1: Fig. S11c-d). Taken together, we have demonstrated that our error suppression method outperforms error correction methods.
Application of SequencErr on a non-human dataset
Although we have demonstrated SequencErr method can be used to measure sequencer fidelity in the above studies, it remains unknown whether it can be applied to non-human datasets. For this purpose, we applied SequencErr on a recently published Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) dataset (NCBI SRA BioProject PRJNA625551) to study the error rate of the corresponding sequencer. As can be seen in Additional file 1: Fig. S12, the involved flow cells/sequencer demonstrated a comparable sER with those observed in human datasets (red dashed line). We, therefore, conclude that our method can also be applied to non-human datasets, although we believe that a more extensive study on many different non-human datasets could significantly strengthen this conclusion, which is beyond the scope of this work.