Measuring Illumina Size Bias Using REcount: A Novel Method for Highly Accurate Quantification of Engineered Genetic Constructs

Quantification of DNA sequence tags associated with engineered genetic constructs underlies many genomics measurements. Typically, such measurements are done using PCR to enrich sequence tags and add adapters, followed by next-generation sequencing (NGS). However, PCR amplification can introduce significant quantitative error into these measurements. Here we describe REcount, a novel PCR-free direct counting method for NGS-based quantification of engineered genetic constructs. By comparing measurements of defined plasmid pools to droplet digital PCR data, we demonstrate that this method is highly accurate and reproducible. We further demonstrate that the REcount approach is amenable to multiplexing through the use of orthogonal restriction enzymes. Finally, we use REcount to provide new insights into clustering biases due to molecule length across different Illumina sequencing platforms.


13
Engineered genetic constructs underlie many experimental techniques in genetics and 14 genomics. For example, targeted perturbation of gene function using RNA interference or next-generation sequencing (NGS) of the small hairpin RNA (shRNA) (Sims et al. 2011;

Results
In order to characterize the REcount method, we constructed a pool of 20 synthetic plasmids containing REcount barcodes, mixed at an equimolar abundance (5% per plasmid) 49 based on fluorometric DNA concentration measurements. This pool was digested with MlyI and 50 sequenced on an Illumina MiSeq. All 20 barcodes were detected at relative abundances ranging A barcode-containing, Illumina adapter-flanked construct is liberated with a restriction enzyme (MlyI) digest and directly sequenced. B) Accuracy and reproducibility of REcount. C) Analogous measurements of the same plasmid pool shown in panel B using varying PCR cycle numbers. D) Root mean squared deviation from expected values (5% per construct) when the plasmid pool is measured using REcount, and varying cycles of PCR amplification of either the barcode construct (BC) or another variable sequence in these plasmids (V4). E) Pearson correlation heatmap comparing REcount measurements with droplet digital PCR data and with conventional PCR amplification of either the BC or V4 amplicons.    Figure S1). To generate a more accurately pooled reference standard 53 for subsequent experiments, we used this sequencing data as the basis for re-pooling the 20 54 plasmids and digested the new pool with MlyI and sequenced. The range of relative 55 abundances of the re-pooled plasmids was narrower, ranging from 4.52% to 5.58% (CV = 0.06),

56
indicating that the initial sequencing data was predictive in improving the accuracy of pooling as 57 assessed by REcount (Supplemental Figure S1). To assess the reproducibility of these 58 measurements, we digested and sequenced two additional replicates of the even plasmid pool.

59
The replicate REcount measurements were highly reproducible with an average CV of 0.02 60 ( Figure 1B).

61
Next, we compared REcount measurements of the even plasmid pool to PCR-based 62 measurements, either of the barcode construct (BC) or another construct-specific sequence 63 (V4). PCR-based measurements exhibited substantial construct-specific deviations from the 64 expected 5% values, the extent of which increased with greater numbers of PCR cycles ( Figure   65 1C-D). Furthermore, the construct-specific deviations from expected values were uncorrelated 66 for the BC and V4 amplicon measurements, suggesting that the PCR biases were a function of 67 template sequence (Supplemental Figure S2).

68
In order to independently measure the relative template concentrations in the even    Illumina DNA sequencers with no intervening clean-up step, to ensure that no material was lost.

122
Representative data from a single MiSeq run is shown in Figure 3B. Since each normalization 123 barcode is present at an equimolar ratio to the corresponding size standard (as they are on the 124 same plasmid), this allows any inaccuracies in plasmid pooling to be accounted for. Within a previous anecdotal observations ( Figure 3D). However, the magnitude of this effect and the shapes of the size bias curves differ substantially between the MiSeq, HiSeq 2500, HiSeq 4000, addition, we observed an effect of molecule length on sequencing quality score, with a general 132 trend towards longer molecules having lower quality scores (Supplemental Figure S5). The

152
It is also likely that a portion of the variability between flow cells is due to differences in 153 the size distributions of the libraries being sequenced together with the synthetic size standards, 154 as competition for clustering will occur between all molecules in the sequencing lane. We 155 observed a shift in the curve corresponding to a decreased representation of the larger size 156 standards when they were sequenced together with a library containing a significant amount of 157 material that was smaller than 300 bp on the HiSeq 4000 (Supplemental Figure 6). Although the 158 size standards were sequenced together with different libraries across the different instruments, 159 this context-dependent clustering is not sufficient to explain the large differences we see 163 Surprisingly, we also detected an instance of construct-specific size bias, specifically on 164 the HiSeq 2500 platform in Rapid Run mode (Supplemental Figure S5). In contrast to the 165 MiSeq, HiSeq 2500 High Output, HiSeq4000, NextSeq, and NovaSeq where no systematic 166 construct-specific biases were observed, the size bias curves for the 16S, GAPDH, and alpha-Tubulin constructs separated as size increased, with 16S showing much less of a drop-off with 168 increased molecule size. One possible explanation for this difference is that the 16S rRNA gene has substantial secondary structure (Woese et al. 1980), which may serve to shorten the 170 effective length of the molecule during the clustering process. This phenomenon may be due to libraries-between-sequencing-platforms.html). The HiSeq and MiSeq also have different

269
Golden Gate reactions were cycled with the following conditions: 10 cycles of 37°C for 5 270 minutes, 21°C for 5 minutes, then 1 cycle 37°C for 10 minutes, then 1 cycle 80°C for 20   The re-pooled even plasmid mix was quantified using a Quant-iT PicoGreen dsDNA assay 320 (Thermo Fisher Scientific), diluted to 1 ng/µl, and further diluted 1:10,000 to bring the pool to the 321 correct concentration for digital quantification. The following ddPCR reactions were prepared: 5 322 µl template DNA, 0.44 µl primer 1 (10 µM), 0.44 µl primer 2 (10 µM), 5.12 µl water, and 11 µl  The following MlyI digests were set up for PCR-free quantification: 200-500 ng even or 343 staggered pool DNA, 2 µl Cutsmart buffer (NEB), 1 µl MlyI (NEB), and volume was adjusted to minutes at 65ºC. 30 µl of water was added to each digest (to bring the volume up to 50 µl). 30 µl beads were collected on a magnet and the supernatant was transferred to a new tube (discarded beads). 80 µl (1x) of AmpureXP beads was added, washed 2x for 30 seconds using and libraries were normalized to 2 nM for sequencing.

360
The following primers were used to amplify the BC constructs:

431
(https://github.com/darylgohl/REcount). Analysis of the V4 amplicon data was performed using 432 the reference-based mapping pipeline described here: https://bitbucket.org/jgarbe/gopher-433 pipelines/wiki/metagenomics-pipeline.rst, using the reference file in Supplemental File 9 to build 434 the bowtie2 index (Langmead and Salzberg 2012). For the analysis of quality scores 435 (Supplemental Figure S5), the data for all runs on a given platform was concatenated into a 436 single fastq file, the split into individual fastq files for each individual construct, based on the 20 437 bp sequence barcodes in each construct. Next, the reads were trimmed to 50 bp using cutadapt 438 (Martin 2011), so that all constructs and sequencing runs could be compared in a standardized 439 manner. Mean quality scores were calculated for each construct that was represented by at 440 least 100 reads in the data set.

I. Supplemental Figures
Supplemental Figure 1 Initial and re-pooled even plasmid pool data.
Supplemental Figure 2 Lack of correlation between BC and V4 PCR.
Supplemental Figure 3 Droplet digital PCR assay validation and data.
Supplemental Figure 4 Assessment of REcount measurements of a staggered plasmid pool.
Supplemental Figure 5 Illumina size standard pool composition and data.
Supplemental Figure 6 Context-specific effects on clustering of size standards.  Figure 1. Initial and re-pooled even plasmid pool data. REcount measurements of an initial attempt at even plasmid pooling based on PicoGreen data, and a subsequent re-pooling informed by the initial pool sequencing data.

II. Supplemental Files
Supplemental Figure 2. Lack of correlation between BC and V4 PCR. Scatterplots of BC and V4 abundance data for the even plasmid pool, when amplified for A) 10, B) 20, C) 30, or D) 40 PCR cycles.  Supplemental Figure 3. Droplet digital PCR assay validation and data. A) Schematic depicting the two ddPCR assays that were developed for each construct in the plasmid pool. B) qPCR data showing the specificity of each assay for the target construct as assessed by amplification of each individual plasmid, the even plasmid pool, or a negative control, with each primer pair. C) Correlation between ddPCR data and REcount quantification for the original even plasmid pool. D) ddPCR counts for the re-pooled even plasmid pool. Bars are the average of triplicates of the forward and reverse ddPCR assays where data could be generated for both assays, or just the forward or reverse assay in the case where one assay failed. *For plasmid 16, both the forward and reverse assays failed and thus no ddPCR information is available for this construct. E) Correlation between ddPCR data and REcount quantification for the re-pooled even plasmid pool. F-I) Correlation between ddPCR data and BC PCR-based quantification of the re-pooled even plasmid pool amplified for F) 10, G) 20, H) 30, I) 40 PCR cycles.    size standard constructs, which consist of three different backbone molecules (16S rRNA, GAPDH, and Tubulin), ranging from 150 bp to 1500 bp in length. B) Between lane and between flow cell differences in size bias profiles for HiSeq2500 Rapid Run (on-board clustering) and HiSeq2500 High Output (cBot clustering). C) Template-specific size biases observed on the HiSeq2500 in Rapid Run mode. D) Platform and construct-specific mean quality scores for the Illumina size standard constructs for the first 50 bp of read 1. Figure