PCIP-seq: simultaneous sequencing of integrated viral genomes and their insertion sites with long reads

The integration of a viral genome into the host genome has a major impact on the trajectory of the infected cell. Integration location and variation within the associated viral genome can influence both clonal expansion and persistence of infected cells. Methods based on short-read sequencing can identify viral insertion sites, but the sequence of the viral genomes within remains unobserved. We develop PCIP-seq, a method that leverages long reads to identify insertion sites and sequence their associated viral genome. We apply the technique to exogenous retroviruses HTLV-1, BLV, and HIV-1, endogenous retroviruses, and human papillomavirus. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02307-0.

When we looked at this position in reads mapped to the provirus without first sorting based on insertion site (referred to as bulk) we saw a C called 36 and 38% of the time respectively in the Nanopore data. In the bulk Illumina data, generated from the same sample, we saw the C is called 0% of the time indicating a technical artifact. As a consequence, SNPs from this position were excluded. (b) In animal 233 we found 16 proviruses (provirus inclusion was based on the less stringent criteria of >10 reads covering the position, not filtered for PCR duplicates) carrying a T-to-C transition within the Tax ORF at position 8154, this variant does not change the amino acid. Shown are screen captures for 4 of the proviruses carrying the SNP, Illumina and Nanopore bulk sequencing from the same sample show C is called at a 2% frequency in Nanopore, while with Illumina C is called at a 1% frequency. This indicates that the SNPs observed in these proviruses are not a technical artifact.  In the chr2 provirus a T-to-C changes ATG to ACG and the first methionine to a threonine. In the chrX provirus an A-to-T changes CAT to CTT replacing a histidine at position 13 with a leucine. (b) Two proviruses, chr7:100.5 & chr19:34.9 identified as the products of recombination between major chrX and chr2 proviruses. IGV screen shot shows proviral reads from all four proviruses mapped to a full length proviral genome (the sequence of the chrX provirus was used as the reference). The colored vertical lines indicate SNPs and identify sequences derived from the chr2 provirus, highlighted in green. Sequences originating from the chrX provirus is highlighted in blue. Sequences of the provirus from chr19 and chr7 that match either chrX or chr2 provirus are highlighted in the appropriate color.

Fig. S7
Three PCIP-seq libraries were prepared in parallel using 5 μg of template DNA, all used the same guides and primers. Following sequencing and demultiplexing the Jurkat negative control produced 12,137 reads, Jurkat + U1 0.01% produced 234,421 reads and Jurkat + U1 0.1% 252,913 reads. (a) The resultant reads were mapped to the human genome, the major integration sites observed in U1 on chr2 and chrX are shown. (b) The reads were also mapped the HIV-1 genome. No reads of pure HIV-1 or chimeric HIV-1/host reads were observed in the Jurkat negative control. In Jurkat + U1 0.01% samples 12.6% of the reads were chimeric HIV-1/host, in Jurkat + U1 0.1% this rose to 43.2%. Red box at top indicates level of zoom on provirus.
This experiment gives us a rough estimation of the efficiency of PCIP-seq. In the dilution experiment, we started with 5ug of DNA. Assuming a diploid human genome size of 6.51 picograms this should equate to the DNA of approximately ~768,000 cells. At 0.01% the DNA from approximately ~77 U1 cells is present. This corresponds to ~154 proviruses as U1 contains two proviruses per cell (these numbers are probably inflated as cell lines often display extensive aneuploidy). Following circularization, the CRISPR cut and reaction clean-up, we are left with approximately ~12% of the DNA we started with, dropping the number of U1 genomes represented in the DNA to ~9, or ~18 proviruses. As can be seen from the resultant library we are able to identify both the proviruses in U1 and in the case of the provirus on chr2 we observe amplification from 4 molecules based on observing different shear sites. This means we captured 5 proviruses, this equates to an efficiency of 3.2%.

Fig. S8
Examples of HIV-1 proviruses from patient 02006 (a) Screen shot from IGV shows a proviral integration site in the host genome as well as the associated provirus sequence. Two reads have been highlighted (out of many) that are found both upstream and downstream of the provirus integration site. With a full-length provirus this would not be possible, however, with a provirus carrying a large deletion including the 5' or 3' regions targeted by the guides a single read can encompass the truncated provirus as well as host DNA from both upstream and downstream of the integration site. The provirus associated with this integration site has a 5' deletion that removes ~6.5kb. (b) Reads from the provirus inserted in chr16:29362672-29362679 mapped to the 02006 HIV-1 consensus sequence. Red box at top indicates level of zoom on provirus. This provirus has a ~115 bp deletion affecting the region containing the packaging signal (Ψ). This provirus was also amplified via clone specific PCR. In addition to the ~115 bp deletion the SNPs in the clone specific PCR mirror those observed from the PCIP-seq library. The elevated coverage towards the middle is where the two PCR products overlap. (c). Large deletions affecting both the 5' and 3' end of the provirus were frequently observed. Shown are reads from both PCIP-seq and clone specific nested PCR mapped to the 02006 HIV-1 consensus sequence. The pattern of SNPs in the clone specific PCR mirrors those observed from the PCIP-seq library (d) Partial conformation of two proviruses. In both cases the nested PCR for the 5' end did not work. The provirus at chr9:20608766-20608771 appears full length, the provirus at chr11:119279088-119279144 appears to have a 5' deletion. Pattern of SNPs in the clone specific PCR again mirrors those observed from the PCIP-seq library.      Numbers of SNPs identified in each sample.  Clinical information for the HIV-1 patients

HPV integration sites identified in patients HPV18_PX and HPV18_PY
Estimated read count refers to number of reads after PCR duplicates have been removed, see https://github.com/GIGA-AnimalGenomics-BLV/PCIP/blob/master/README.md

Supplementary Note 1 Rationale behind the use of CRISPR-cas9 to cleave circular DNA
It is established practice to linearize plasmids (generally via cutting with a restriction enzyme) prior to their use as template in PCR. It is believed that this avoids supercoiling and thereby increases PCR efficiency 1 . Following the same logic, we speculated that linearizing our circularized DNA could also increase PCR efficiency.
As we wanted to cut specific sequences we used the CRISPR-cas9 system. In many cases CRISPR-cas9 cleavage is not 100% effective. As can be seen in the gel below, for G2 a fraction of the target DNA remains uncut. In order to increase efficiency of cutting while also including redundancy against any SNPs in the targeted regions we pooled ~3 guides that targeted a relatively small region (generally a few hundred bases). We found that by pooling guides all the target DNA was successfully cut.
We next applied these pools to circular DNA. The gel below shows an experiment carried out using 8ug of DNA from a BLV infected sheep with a proviral load of 82.6%. The DNA was circularized and linear DNA was eliminated (to prevent PCR amplification/recombination involving the remaining linear fragments) using plasmid safe DNase (see methods for a complete description). One quarter of the resultant DNA was subject to CRISPR-cas9 cleavage using the Pool A guides, the second quarter was cleaved using the Pool B guides, the remaining half was kept aside. The linearized DNA was cleaned and used as template in 2x 50ul PCR reactions using the appropriate primer pairs for Pool A (PA) or Pool B (PB). For the uncut DNA half was used as template for 2x 50ul PCR reactions using the PA primers and the other half was used for 2x 50ul PCR reactions using the PB primers. Following 25 PCR cycles, 10ul of each reaction were loaded on a 1% agarose gel. As can be seen in the gel below, the band intensity for the CRISPR-cas9 cut samples is higher. It should be noted that in lane 3 the PCR smear is shifted down, we generally discard these types of products as the fraction of host-virus fragments is low. (A=unshared genomic DNA, B=genomic DNA sheared to 8kb) Following clean up and elution in ~40 ul of H2O we took an equal volume (3ul) of each library and indexed them via PCR, in a 50 ul reaction volume and using 8 cycles. Again, following clean up, an equal volume of library was pooled and a nanopore library (LSK-109) was prepared and sequenced on a r9.4 flow cell. Base calling and demultiplexing was carried out as described in the methods. The results are outlined in the table below. In addition the coverage of the resultant reads is shown. The table shows that libraries prepared with the CRISPR cut generally produced more raw reads and a much larger fraction of them is composed of the desired chimeric reads containing proviral and host DNA. The CRISPR cut libraries also identified a large number of integration sites. The comparison with an Illumina based library prepared from the same timepoint, using ~4ug of template, shows that PCIP can identify more integration sites. This experiment also shows that only libraries with a size distribution that mirrors that observed in the sheared DNA should be sequenced. Libraries with a preponderance of shorter fragments mainly represent nonspecific amplification.

Supplementary Note 2 Effect of coverage on SNP calling
One of the issues often raised regarding Nanopore sequencing is the error rate and the effect sequencing depth has on SNP calls. In this manuscript we have chosen to be quite conservative by only calling SNPs in proviruses with more than 10 reads after PCR duplicate removal. As base calling and variant calling algorithms continue to improve, such a cutoff is likely to be overly conservative. Without carrying out excessive and expensive validation via clone specific PCR it is difficult to decide on an optimal coverage cutoff for calling SNPs in a provirus due to the difficulty of distinguishing between false positives/negatives in the provirus as one adjust thresholds.
In order to get an estimation of the effect coverage has on variant calling we instead decided to look at the part of the host genome captured by PCIP-seq. We sequenced the two HIV-1 patient PCIP-seq libraries used to generate the Nanopore data with Illumina short reads on a MiSeq instrument. This produced ~2.3 million paired-end reads for each library. The shearing necessary to generate the Illumina library prevents us linking viral reads to a specific provirus/integration site, precluding a comparison of variants within the provirus using the two technologies and at different depths. Instead we compared variants found in the DNA flanking the integration sites sequenced by both technologies.
We concentrated on proviruses that were not clonally expanded. We did this for two reasons. Firstly, in the two HIV-1 samples used in this study the majority of proviruses fall into this category. Second, in such cases all the reads are PCR duplicates and cover the same part of the host genomes, at an even coverage, making the downsampling more consistent. We selected proviruses/integrations where the Illumina coverage was on average greater to or equal to ~50X to insure accurate variant calling. This left us with 118 sites in total, 77 sites for patient 02006 and 41 for patient 06042 (both represent 59% of the non-clonally expanded sites in the patients). The flanking host DNA associated with these proviruses covers ~202kb, while the median coverage from the Illumina sequencing was 274X.
We first called SNPs in the Illumina data using lowfreq. The resultant VCF was filtered with SnpSift to retain SNPs with an allele frequency in the reads of >0.6, leaving 157 SNPs. We compared these SNPs to a set of common human variants (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/) and found that 141 (90%) represent common SNPs segregating in the human population. Of the remaining 10% (16 SNPs), 13 had a high allele frequency in the reads (average AF= 0.98), suggesting that these are true SNPs at a low frequency in the population. (This assumes that the original Watson and Crick strands both generally contribute to the amplified pool of products and having the same base substitution on both strands in the very early rounds of amplification is rare). For the remaining 3 SNPs, the allele frequency average was ~0.65, suggesting PCR errors that had occurred in the first round of amplification on either the Watson and Crick strand, that had come to dominate the pool of products, lifting it over our 0.6 cutoff.
We next called SNPs in the Nanopore data from the same 118 regions (median coverage 547X). After filtering for an allele frequency in the reads of >0.6 we called 142 SNPs. All overlapped with the SNPs called in the Illumina data. This left 15 SNPs called in the Illumina data but absent from the Nanopore calls (9.6% false negatives). Of these, 10 had been called as SNPs by lowfreq in the Nanopore data but the allele frequency (average AF= 0.49) did not exceed the 0.6 cutoff. (Two of the potential false positives SNPs called in the Illumina data, with AF=~0.65, were not called as SNPs in the Nanopore data, the third false positive was also called in the Nanopore data.) For the remaining 5 SNPs, where lowfreq did not produce calls, examination of the reads in IGV showed them to be imbedded within a homopolymer (a known weakness of the R9.4 Nanopore data), although their presence could be deduced by examining the raw reads in IGV. Rebase calling the data with the latest high accuracy Nanopore base callers or using the R10.3 flow cell would help with these homopolymer regions. Alternative technologies like Pacbio Hi-Fi reads could also be employed if high single molecule accuracy is desired. Calling SNPs in the host DNA flanking HIV-1 proviruses with both Nanopore and Illumina technologies and examining the effect of coverage on number of SNPs called.
As a consequence, we can conclude that higher coverage helps reduce false negatives, but even with relatively modest coverage false positives are not a major issue, instead we are more likely to get false negatives as the coverage goes down.

PCR validation and Illumina sequencing
Clone specific PCR products as well as the PCIP-seq libraries generated from the HIV-1 patients were sheared to ~400bp using the Bioruptor Pico (Diagenode) and Nextera XT indexes added as previously described 2 . Illumina PCIPseq libraries were generated in the same manner. Sequencing was carried out on either an Illumina MiSeq or NextSeq 500. Clone specific PCR products sequenced on Nanopore were indexed by PCR, multiplexed and libraries prepared using the Ligation Sequencing Kit 1D (SQK-LSK108) and sequenced on a MinION R9.4 flow cell. Oligos used can be found in Supplementary Dataset 3.

Measure of HIV-1 DNA content in CD4 T-cell DNA isolates by digital droplet PCR (ddPCR)
The DNA was subjected to a restriction digest with EcoRI (Promega, Leiden, The Netherlands) for one hour, and diluted 1:2 in nuclease free water. HIV-1 DNA was measured in triplicate using 4 µL of the diluted DNA as input into a 20µL reaction, while the RPP30 reference gene was measured in duplicate using 1 µL as input. Primers and probes are summarized in Supplementary Dataset 3. Thermocycling conditions were as follows: 95˚C for 10 min, followed by 40 cycles of 95˚C for 30 s and 56°C for 60 s, followed by 98°C for 10 min. In the case of HIV-1 due to the large amount of variation generally seen within proviruses we found it necessary to generate guides and primers tailored to the patient. We first carried out nested PCR using 250 ng of template DNA to amplify the regions upstream and downstream of the 5' and 3' LTR. The resultant PCR products were then sequenced via Nanopore (Illumina could equally be used). A consensus of the resultant reads was then used to select amongst the existing HIV-1 guides and primers we had already generated or when necessary to design new ones using CHOP CHOP and Primer 3.  Shown below is the part of the gel from the Supplementary Note showing 10 ul of PCR1 loaded on a 1% agarose gel. Lanes 1,2 and 4 show PCRs that produced a high molecular weight band at about ~8 kb, when sequenced this will produce a good quality library. Lane 3 shows a PCR that has a preponderance of shorter fragments, sequencing of such PCR product is not advised as these molecules are mainly derived from the host and yield very few integration sites (see the Supplementary Note table). It should be noted that lanes 3 and 4 used identical input DNA and PCR mix, about ~10% of PCR reactions (in all viruses targeted) generate these patterns of short fragment stochastically, probably due to nonspecific amplification. PCRs that produce such a pattern should be repeated. The Supplementary Note details the DNA concentrations obtained for these libraries, the number of virus/host chimeric reads and number of integration sites captured. (The products of PCR2 when run on a 1% agarose gel should produce the same pattern.)
Quantify via nanodrop spectrophotometer and calculate the volume required for 50 ng in 23 μL H20 Save PCR1 as a backup.

3) Long range indexing by PCR, LongAmp Taq DNA Polymerase (New England Biolabs) # m0323
Options: Indexing of the samples can be done via a second PCR using the PCR Barcoding Expansion KIT 1-96 (EXP-PBC096) from Oxford Nanopore or via ligation of barcodes using the Native Barcoding Expansion 1-12 (EXP-NBD104) Oxford Nanopore.