Cell line and DNA extraction
Genomics DNA was prepared by ATCC using cell expansions from master banks of cells for the HCC1395 (ATCC, CRL-2324) and HCC1395BL (ATCC, CRL-2325) cell lines. Both cell lines were validated at ATCC using multiple cell-specific markers for each cell line. The karyotyping is described in “Cell line karyotyping” section.
Homo Sapiens Breast Carcinoma HCC1395 cells (expanded from ATCC CRL-2324) were cultured in ATCC-formulated RPMI-1640 Medium (ATCC 30-2001) supplemented with fetal bovine serum (ATCC 30-2020) to a final concentration of 10%. Cells were maintained at 37 °C with 5% carbon dioxide (CO2) and were subcultured every 2 to 3 days, per ATCC recommended procedures using 0.25% (w/v) Trypsin-0.53 mM EDTA solution (ATCC 30-2101), until appropriate densities were reached. Additionally, an Epstein-Barr virus (EBV) transformed B-lymphoblast cell line (HCC1395BL) (expanded from ATCC CRL-2325) was cultured in ATCC-formulated Iscove’s modified Dulbecco’s medium (ATCC Catalog No. 30-2005) supplemented with fetal bovine serum (ATCC 30-2020) to a final concentration of 20%. Cells were maintained at 37 °C with 5% CO2 and were subcultured every 2 to 3 days, per ATCC recommended procedures, using centrifugation with subsequent resuspension in fresh medium until appropriate densities were reached. Final cell suspensions were spun down and resuspended in PBS for nucleic acid extraction.
All cellular genomic material was extracted using a modified Phenol-Chloroform-Iso-Amyl alcohol extraction approach. Essentially, cell pellets were resuspended in TE, were subjected to lysis in a 2% TritonX-100/0.1% SDS/0.1 M NaCl/10mM Tris/1mM EDTA solution, and were extracted with a mixture of glass beads and Phenol-Chloroform-Iso-Amyl alcohol. Following multiple rounds of extraction, the aqueous layer was further treated with Chloroform-IAA and finally underwent RNAse treatment and DNA precipitation using sodium acetate (3 M, pH 5.2) and ice-cold ethanol. The final DNA preparation was resuspended in TE and stored at −80°C until use.
Cell line karyotyping
Karyotyping was performed by Cell Line Genetics (Madison, Wisconsin) essentially as described previously [38]. Cells were treated with Colcemid (Gibco) for 40 min followed by exposure to 0.075M KCl for 23 min at 37°C and then fixed with 3:1 methanol:glacial acetic acid. Slides were stained with Leishman’s stain before observation. During observation, roughly 20 metaphase cells were counted by microscope and numerical and structural chromosome aberrations were recorded. An analysis of 5–10 cell bands was performed at the microscope with a ×100 objective, with an effort to karyotype at least two cells from each clone.
Illumina WGS library preparation
The TruSeq DNA PCR-Free LT Kit (Illumina, FC-121-3001) was used to prepare samples for whole-genome sequencing. WGS libraries were prepared at six sites with the TruSeq DNA PCR-Free LT Kit according to the manufacturers’ protocol. One microgram of DNA was used for the TruSeq-PCR-free libraries, unless otherwise specified. All sites used identical fragmentation conditions for WGS by using Covaris with a 350-bp target size. All WGS replicates were prepared on a different day. The input amount of WGS runs with fresh DNA was 1 μg unless otherwise specified.
The concentration of the TruSeq DNA PCR-Free libraries for WGS was measured by qPCR with the KAPA Library Quantification Complete Kit (Universal) (Roche, KK4824). The concentration of all the other libraries was measured by fluorometry either on the Qubit 1.0 fluorometer or on the GloMax Luminometer with the Quant-iT dsDNA HS Assay kit (Thermo Fisher Scientific, Q32854). The quality of all libraries was assessed by capillary electrophoresis either on the 2100 Bioanalyzer or TapeStation instrument (Agilent) in combination with the High Sensitivity DNA Kit (Agilent, 5067-4626) or the DNA 1000 Kit (Agilent, 5067-1504) or on the 4200 TapeStation instrument (Agilent) with the D1000 assay (Agilent, 5067-5582 and 5067-5583).
For the tumor purity study, 1 μg tumor:normal dilutions were made in the following ratios using Resuspension Buffer (Illumina): 1:0, 3:1, 1:1, 1:4, 1:9, 1:19, and 0:1. Each ratio was diluted in triplicate. DNA was sheared using the Covaris S220 to target a 350-bp fragment size (peak power 140W, duty factor 10%, 200 cycles/bursts, 55s, temp 4 °C). NGS library preparation was performed using the Truseq DNA PCR-free protocol (Illumina) following the manufacturer’s recommendations.
Whole-genome libraries were sequenced on a HiSeq 4000 instrument (Illumina) at 2 × 150bp read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003), and on a NovaSeq instrument (Illumina) at 2 × 150bp read length using the S2 configuration (Illumina, PN 20012860). Sequencing was performed following the manufacturer’s instructions.
10X Genomics library preparation
10X Genomics library prep was performed following the 10X Genomics protocol which involves the generation of long-range information across the length of individual DNA molecules. Approximately 1.25–1.5 ng (~375–450 haploid genome equivalents) of template gDNA is required as input for the GEM Generation & Barcoding step. It is critical to quantify template HMW gDNA accurately to load the correct amount into the Sample Master Mix using the Qubit® Fluorometer system. The required concentrations of the diluted gDNA solution are 0.8–1.2 ng/μl before proceeding to preparing GEM Reagent Master Mix. Isothermal incubation of the GEMs produced barcoded fragments ranging from a few to several hundred base pairs. After incubation, the GEMs are broken and the pooled fractions are recovered. The read 1 sequence and the 10x™ Barcode are added to the molecules during the GEM incubation. The P5 and P7 primers, Read 2, and Sample Index are added during library construction via end repair, A-tailing, adaptor ligation, and amplification. The final libraries contain the P5 and P7 primers used in Illumina® bridge amplification.
PacBio library preparation
Fifteen micrograms of material was sheared to 40kb with Megarupter (Diagenode). Per the Megarupter protocol, the samples were diluted to below 50ng/μl. A 1× AMPure XP bead cleanup was performed. Samples were prepared as outlined on the PacBio protocol titled “Preparing >30kb SMRTbell Libraries Using Megaruptor Shearing and BluePippin Size-Selection for PacBio RS II and Sequel Systems”. After library preparation, the library was run overnight to size select using the Blue Pippin (Sage). The Blue Pippin was set to select the size range of 15–50kb. After collection, a 1× AMPure XP bead cleanup was performed.
Dovetail SELVA library preparation
Three Dovetail Hi-C libraries were prepared for each sample in a similar manner as described previously [39]. Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DpnII, the 5’ overhangs were filled in with biotinylated nucleotides, and then the free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library.
NuGEN ovation universal RNA-seq preparation and sequencing
We isolated mRNA in bulk from HCC1395 and HCC1395 BL cells using the miRNeasy Mini kit (QIAGEN, 217004). The NuGEN Ovation universal RNA-seq kit was used for library preparation. Briefly, 100 ng of total RNA was reverse transcribed and then made into double-stranded cDNA (ds-cDNA) by the addition of a DNA polymerase. The ds-cDNA was fragmented to ~200 bp using the Covaris S220, and then underwent end repair to blunt the ends followed by barcoded adapter ligation. The remainder of the library preparation followed the manufacturer’s protocol. All the libraries were quantified with a TapeStation 2200 (Agilent Technologies) and a Qubit 3.0 (Life Technologies). We sequenced the libraries on a NextSeq550 with 75 bp paired end sequencing and on a HiSeq4000 with 100 bp paired end sequencing.
DNA extraction and MinION sequencing
For HCC1395BL cells, Miltenyi dead cell removal kit (MACS Miltenyi Biotech, #130-090-101) was used to remove dead cells by following the manufacturer’s protocol. Briefly, HCC1395BL cells were pelleted and the supernatant was aspirated completely. Dead cell removal beads were used to resuspend the cells (100 μl per 1 × 107 total cells). After incubation at room temperature for 15 min, the mixture was loaded onto Miltenyi LS column (MACS Miltenyi Biotech, #130-042-401) in 3 ml 1X binding buffer. Ten milliliters 1X binding buffer was used to elute/rinse the column following the passage of the cell/bead mixture. The flow through containing live cells was collected and assessed using an automated cell counter (Countess II, Thermo Fisher). The cell viabilities were measuring above 95% after the dead cell removal procedure. HCC1395 cells were detached by Accutase (Innovative Cell Technologies, CA) followed by three washes with PBS. The cell viability was greater than 90% without using Miltenyi dead cell removal kit.
Genomic DNA from HCC1395 and HCC1395BL cell lines was extracted using QIAGEN MagAttract HMW DNA Kit (QIAGEN, Hilden, Germany). One microgram of initial DNA without fragmentation was used for library construction using SQK-LSK109 ligation sequencing kit (Oxford Nanopore Technologies, Oxford, UK). All library preparations were conducted as per the protocols provided by ONT with the exception of the end-prep step where samples were incubated for 20 min at 20 °C and 5 min at 65 °C. Each library was sequenced separately on an individual MinION FLO-MIN106D R9.4 flowcell. Prior to sequencing, flowcell pore counts were measured using the MinKNOW Platform QC script (Oxford Nanopore Technologies, Oxford, UK). About 300 ng of completed libraries was loaded as per instructions from ONT. Raw sequence reads were basecalled in real time via MinKNOW. Basecalled data passing quality parameters (qmean > 7) were converted to fastq. Only reads designated as pass were included in further analyses.
Affymetrix CytoScan array validation
We obtained DNA from two reference cell lines from ATCC: (HCC1395, SCCRL2324_ D; HCC1395 BL, ATCC®SCCRL2325_D). DNA concentration was measured with Nanodrop (Life technology), and integrity was evaluated with TapeStation 4200 (Agilent). Two hundred fifty nanograms of gDNA was used to proceed with the Affymetrix CytoScan Assay kit (Affymetrix). The workflow consisted of restriction enzyme digestion with Nsp I, ligation, PCR, purification, fragmentation, and end labeling. Then DNA was hybridized for 16 h at 50°C on CytoScan array (Affymetrix), followed by washing and staining in the Affymetrix Fluidics Station 450 (Affymetrix). Scanning was performed with the Affymetrix GeneChip Scanner 3000 G7 (Affymetrix). CytoScan Array CEL files were processed and analyzed with Affymetrix Chromosome Analysis Suite (ChAS, Affymetrix, Inc.). Array-specific annotations (NetAffx annotation release 36, built with human hg38 annotation, NetAffxGenomicAnnotation s.Homo_sapiens.hg38.na2 0170803.db) were used in analysis workflow module of ChAS. Karyoview plot and segment data were then exported in the ChAS browser with default parameters: losses ≥ 25 markers, gains ≥ 50 markers.
PCR-based SV validation
For further validation of CNVs and other types of structural variants (balanced translocations, tandem duplications, inversions), PCR was used to target the breakpoints associated with a subset of putative SVs. qPCR was also used as an additional means of quantifying duplication events that could not be definitively described as tandem. Target loci were chosen to broadly represent the size ranges observed for each of the SV types, as well as the consensus scores associated with them in the case of SVs detected in multiple datasets. PCR products from the cancer cell line were compared to the normal cell line to identify cancer-specific variants.
All primers were designed using Primer3 v4.1 (http://primer3.ut.ee/) to limit oligo lengths to 20–27bp, temperatures to 55–62°C, and product sizes to 250-450bp. All other settings were left as the defaults, and self and pair complementarity scores were minimized to zero where possible. Products were amplified on either a C1000 or S1000 thermal cycler (Bio-Rad) from 1 ng of sample DNA using Enhanced PCR Mix (EPM, Illumina) according to the manufacturer’s instructions. Amplification conditions were as follows: 95°C for 3min; 35 cycles of 30s @98°C, 30s @55–62°C, 1min @72°C; 5min @72°C; hold @8°C. Products were run on 2% E-gels (Invitrogen) for 17 min and visualized on an AlphaImager (ProteinSimple).
All primers’ designed targeted regions were free of repetitive elements and segmental duplications wherever possible. To capture deletions and translocations, primer pairs were selected in the region ±500bp of the breakpoint as identified in the cancer cell line. For validation of small (<3.5kb) deletions, products of different expected sizes were observed in the cancer vs normal cell line. Larger deletions and translocations showed the expected product only in the cancer cell line. To capture inversions, primer pairs were selected in the region ±500bp of either inversion breakpoint, with validated products being detected only in the cancer cell line. To capture duplications, multiple regions were selected within the duplicated locus and amplified by endpoint PCR as described above to ensure the expected product was observed in both the cancer and normal cell lines. The primer pairs that clearly generated the expected products were then used for qPCR amplification of the target to determine the difference in quantitation cycles (dCq) between the cancer cell line and the normal cell line. In some instances, copy number differences could also be visualized qualitatively by endpoint PCR.
When the approach described above failed to generate the expected PCR products, alternate primers were designed to target the unaltered locus in the normal cell line, rather than the site of the breakpoint in the cancer cell. In these instances, diminished product was expected in the cancer cell line compared to the normal cell line and was verified by qPCR when not readily observed by endpoint PCR.
Optical genome mapping for SV validation
Ultra-high molecular weight (UHMW) DNA was extracted from cryopreserved cells in frozen medium containing DMSO following the manufacturer’s protocols (Bionano Genomics, USA). Cells were digested with Proteinase K and RNAse A in a lysis binding buffer containing detergents. DNA was precipitated with isopropanol and bound with nanobind magnetic disk. Bound UHMW DNA was resuspended in the elution buffer and quantified with Qubit dsDNA assay kits (Thermo Fisher Scientific).
DNA labeling was performed following manufacturer’s protocols (Bionano Genomics, USA). Standard Direct Labeling Enzyme 1 (DLE-1) reactions were carried out using 750 ng of purified ultra-high molecular weight DNA. The fluorescently labeled DNA molecules were counter-stained and imaged across nanochannels on a 2nd-generation Saphyr instrument. Genomic coverage of approximately 400X was achieved for all tested samples.
Genome analysis was performed using software provided by Bionano Genomics (Bionano Solve [40] 3.5, Access 1.5). Rare Variant Analysis was performed to sensitively capture somatic SVs occurring at low allelic fractions. Briefly, molecules of a given sample dataset were first aligned against the public Genome Reference Consortium GRCh38 human assembly. SVs were identified based on discrepant alignment between sample molecules and GRCh38, with no assumption about ploidy. Consensus genome maps (*.cmaps) were then assembled from clustered sets of molecules that identified the same variant. Finally, the CMAPs were realigned to GRCh38, with SV data confirmed by consensus forming final SV calls.
Finally, fractional copy number analysis was performed from alignment of molecules and labels against GRCh38 (alignmolvref). A sample’s raw label coverage was normalized against relative coverage from normal human controls, segmented, and baseline CN state estimated from calculating mode of coverage of all labels. If Y Chromosome molecules were present, baseline coverage in sex chromosomes was halved. With a baseline estimated, CN states of segmented genomic intervals were assessed for significant increase/decrease from the baseline. Corresponding duplication and deletion copy number variant calls were output. Certain SV and CN calls were masked, if occurring in GRC38 regions found to be high variance (gaps, segmental duplications, etc.)
DNA sequencing
Illumina sequencing
Illumina whole-genome libraries were sequenced on a HiSeq 4000 instrument (Illumina) at 2 × 150 bp read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003), and on a NovaSeq instrument (Illumina) at 2 × 150 bp read length using the S2 configuration (Illumina, PN 20012860).
The sequencing depth was ~800M read pairs per library for a total 21 libraries.
FASTQ sequence files for whole-genome and whole-exome sequencing were generated from the Illumina sequencer images using the Illumina RTA 1.18.66.3 (HiSeq 2500) or 2.7.7 (HiSeq 4000) and bcl2fastq 2.17.1.14 software.
PacBio sequencing
The samples were loaded on the PacBio Sequel (Pacific Biosciences) following the protocol titled “Protocol for loading the Sequel.” The recipe for loading the instrument was generated by the Pacbio SMRTlink software v 5.0.0. Libraries were prepared using Sequel chemistry kits v 2.1, SMRTbell template kit 1.0 SPv3, magbead v2 kit for magbead loading, sequencing primer v3, and SMRTbell cleanup columns v2. Libraries were loaded at between 4 and 8pM.
10X Genomics sequencing
The 10X Genomics libraries contained the P5 and P7 primers which are compatible with Illumina sequencing platforms and are used in Illumina bridge amplification. The final purified product was then quantitated by qPCR before cluster generation and sequencing. The sequencing was done on the HiSeq4000 instrument using a 2×150bp pair-end run. The sequencing depth was ~940M read pairs per library for a total 22 libraries.
Hi-C sequencing
The Dovetail Hi-C libraries were loaded and sequenced on an Illumina HiSeq X with a pair-end run setup as 2×75bp. The sequencing depth is ~220M read pairs per library for a total of six libraries.
Bioinformatics pipelines
In this section, we described the step-by-step bioinformatics analysis pipelines that were used to call the SVs from each method and integrate the multiple call sets to obtain the somatic SV high-confidence call set. The bioinformatics workflow is shown in Fig. 1b. The exact commands are documented in the specific section for each task.
Reference genome
The reference genome we used was the decoy version of the GRCh38/hg38 human reference genome (https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files; GRCh38.d1.dv1.fa), which was utilized by the Genomic Data Commons (GDC). The gene annotation (GTF) file was downloaded from the 10X Genomics website as refdata-cellranger-GRCh38-1.2.0.tar.gz, which corresponds to the GRCh38 genome and Ensembl v84 transcriptome. All of the following bioinformatics data analyses are based on the above reference genome and gene annotation.
Illumina WGS read preprocessing and alignment
Illumina bcl2fastq2 (v2.17) [41] was used to demultiplex and convert binary base calls and quality scores to FASTQ format. FASTQC (v0.11.2 ) [42] was run on the raw reads to assess basecall quality, adapter content, G/C content, sequencing length, and duplication level. In addition, FASTQ_screen (v0.5.1) and miniKraken (v0.10.0) [43] were run to detect possible cross contamination with other species. A multiQC (v1.3) [44] run report was generated for each sample set. The sequencing reads were trimmed of adapters and low-quality bases using Trimmomatic (v0.30) [45]. The trimmed reads were mapped to the human reference genome GRCm38 (see the read alignment section) using BWA-mem (v0.7.12) [26] in paired end mode. In addition, the DNA Damage Estimator (v3) [46] was used to calculate the GIV score based on an imbalance between R1 and R2 variant frequency of the sequencing reads to estimate the level of DNA damage that was introduced in the sample/library preparation processes. Post alignment QC was performed based on BWA alignment BAM files, the genome mapped percentages and mapped read duplication rates calculated by BamTools (v2.2.3) [47] and Picard (v1.84) [48]. The genome coverage and exome target region coverages as well as mapped reads insert sizes, and G/C contents were profiled using Qualimap (v2.2) [49] and custom scripts. Preprocessing QC reports were generated during each step of the process. MultiQC (v1.3) [44] was run to generate an aggregated report in html format. A standard QC metrics report was generated from a custom script. For all alignments, we used BWA-MEM v0.7.17 [26] with the –M flag for downstream Picard compatibility.
Illumina short-read SV calling
SVs were detected by four independent pipelines. Illumina short reads (untrimmed) from 11 pairs of tumor/normal samples were aligned onto GRCh38 reference genome using bwa-mem (version 0.7.17-r1194) with default parameters to a GRCh38 human reference genome (see reference genome section above). All the alignment bam files were duplicate-marked by Picard before SV calling. Reads with a mapping quality of at least 20 were retained for SV detection.
In the first pipeline, SV calls were generated from this mapped data using Delly [50] (Version: 0.7.8) with default parameters was used to generate 11 Delly SV call sets following the procedures described in Delly’s Github repo (https://github.com/dellytools/delly) for somatic SV calling. Delly detects deletions, inversions, tandem duplications, insertions, and inter-chromosomal translocations.
In the second pipeline, SV calls were generated from this mapped data using NovoBreak [51] (version v1.1.3rc). Default parameters were used to generate 11 NovoBreak SV call sets following the detailed procedures were depicted at the tool’s GitHub repo (https://github.com/czc/nb_distribution).
In the third pipeline, TNscope [52] (v201711.03) was used to process the BWA-MEM alignment BAM file to call somatic SVs, and DNAscope (v201711.03) was used to call germline SVs.
For TNscope (v201711.03), we used the version implemented in Seven Bridges’s CGC. Sentieon TNscope is a somatic variant caller for SNV, Indel, and SV. Here, we use TNscope only for SV calling. TNscope identifies statistical confidence breakends that only occur in the tumor sample.
Command:
sentieon driver -r GRCh38.d1.vd1.fa -i $tumor.bam -i $normal.bam --algo TNscope --tumor_sample $tumor_sample_name --normal_sample $normal_sample_name --disable_detector snv_indel $output.vcf.gz.
Sentieon DNAscope is a germline variant caller for SNV, Indel, and SV. Here, we use DNAscope only for SV calling. DNAscope identifies statistical confident germline breakends. We used the version 201711.03 implemented in Seven Bridges’s CGC with the following commands:
Command:
sentieon driver -r GRCh38.d1.vd1.fa -i $tumor.bam -i $normal.bam --algo DNAscope --var_type bnd tmp.vcf.gz && sentieon driver -r GRCh38.d1.vd1.fa --algo SVSolver -v tmp.vcf.gz $output.sv.vcf.gz
The fourth pipeline was performed using the Manta [53] integrated in the Dragen pipeline [27]. The BWA-MEM alignment BAM files were used to call somatic SVs as follows:
Command:
dragen -f -r GRCh38.d1.vd1_hash --sv-reference GrCh38.d1.vd1.fa --output-file-prefix $tumor_vs_normal --tumor-bam-input $tumor.bam --bam-input $normal.bam --enable-map-align false --enable-map-align-output false --enable-sv true
10X Genomics linked read SV analysis
The 10X Genomics Chromium fastq files were mapped and reads were phased using LongRanger [54] to the hg38/GRCh38 reference genome using the LongRanger v2.2.2 pipeline [https://genome.cshlp.org/content/29/4/635.full]. The linked reads were aligned using the Lariat aligner [55] [https://genome.cshlp.org/content/29/4/635.full], which uses BWA [26] to generate alignment candidates, and duplicate reads are marked after alignment. SV calls that were within 10 kb of gaps or new sequences introduced in GRCh38 are also filtered because such calls likely represent misassemblies in hg19.
The Manta methods integrated in the Dragen pipeline were also used to call SVs from the 10X Genomics Chromium fastq files. The alignment BAM files were generated by the Dragen pipeline, and then used for somatic SV calling as follows:
Command:
dragen -f -r GRCh38.d1.vd1_hash --tumor-fastq-list-sample-id $tumor --tumor-fastq-list fastq_list.csv --fastq-list-sample-id $normal --fastq-list fastq_list.csv --enable-duplicate-marking true --enable-variant-caller true --output-file-prefix $tumor_vs_$normal --enable-map-align-output true --enable-bam-indexing true --dbsnp dbsnp_146.hg38.vcf.gz --cosmic COSMIC_82_hg38.vcf.gz --panel-of-normals 1KG.2504.plus.TCGA.ACgrt0.vcf.gz --bin_memory=60000000000
dragen -f -r GRCh38.d1.vd1_hash --sv-reference GRCh38.d1.vd1.fa --output-file-prefix $tumor_vs_normal --tumor-bam-input $tumor.bam --bam-input $normal.bam --enable-map-align false --enable-map-align-output false --enable-sv true
PacBio single-molecule long-read SV analysis
NGM-LR and Sniffles
The raw bam files were merged per sample and then were aligned using smrttools v5.0.1 software which includes PBSV. PBSV utilized NGM-LR [56] as default aligner and used hg38/GRCh38 as reference genome. The PBSV align commands also marked or removed the duplicated read bases on the reads coming from the same ZMW, the base pair tolerance was set to 100bp to remove the duplicated reads.
The resulting bam file was used in Sniffles [56] to call SV per sample from the PacBio data using the following command:
Command: sniffles -m tumor.merge.ngmlr.dedup.bam -v tumor.merge.ngmlr.dedup.vcf -t 16 --tmp_file ./ -l 30
Minimap2 and PBSV
Raw merged bam files were aligned using smrttools v6.0.1. pbmm2 which used minimap2 [57] as default aligner. SVs were called using PBSV discover and call command per sample.
pbmm2 align --sort --sort-memory 24G --total-threads 24 --sort-threads-perc 30 --preset SUBREAD normal.merge.raw.bam hg38.fa normal.pbmm2.bam
pbsv discover --log-file discover.log -s normal normal.pbmm2.bam normal.svsig.gz
pbsv call --log-file call.log -j 24 hg38.fa normal.filter.svsig.gz normal.pbsv.smrtv6.vcf
Somatic SVs from PacBio
The resulting variant vcf file was divided into two files: large SVs (translocations and SV >30kb) and small SVs (no translocation and SV <30kb). Analysis using Survivor was performed similarly as for 10X Genomics large SVs and small deletions to get somatic large and small SVs.
Oxford Nanopore sequencing data SV analysis
Basecalls were performed using Guppy v3.3.3. Data passing quality parameters (qmean > 7) were converted to fastq. The raw fastq files were merged for each sample and then were mapped to hg38 reference using mimimap v2-2.16.
minimap2 -ax map-ont --MD hg38.fa.fastq.gz | samtools view -bS - | samtools sort -m 16G -@16 - > tumor.ont.bam
Mapped bam files were used to call SVs using Sniffles v1.0.11 and Nanosv v1.2.4. Sniffles was run with -s 3 -l 50 -t 36 --report_BND parameters:
sniffles -s 3 -l 50 -t 36 --report_BND -m tumor.ont.bam -v tumor.ont.sniffles.vcf
Nanosv was run using default config parameters:
NanoSV -t 36 -o tumor.ont.nanosv.vcf -c config.txt -b default.bed tumor.ont.bam
Dovetail SELVA data SV analysis
A pair of normal and tumor libraries were sequenced to a depth of 34X and 37X respectively on a Illumina NextSeq for gene fusion identification. The ideal sequencing depth was determined via estimation of library complexity from a MiSeq QC process.
Reads were aligned to the GRCh38 human reference genome. The mapping pipeline maps each read separately using the aligner BWA-MEM [26]. BWA-MEM maps paired end data separately as two single end reads due to the potentially long separation distance between paired end reads. We combined two single end reads as a single paired end read in the BAM file and removed unmapped reads through a post processing step and used Picard tool to remove PCR duplicates.
Chromosomal rearrangements and gene fusions were assessed by dividing the reference genome into non-overlapping bins of width w, and tabulating Nij the number of read pairs which map with high confidence (MAPQ > 20) to bins i and j respectively.
To automatically identify genomic rearrangement junctions, we defined a statistic that identifies local contrasts in Nij characteristic of rearrangements. Assuming Poisson-distributed local read counts, we computed two z-scores at each bin i,j: \({Z}_{ij}^{+}=\frac{\left({N}_{ij}^{+}-{N}_{ij}^{-}\right)}{\sqrt{N_{ij}^{-}}}\) and \({Z}_{ij}^{-}=\frac{\left({N}_{ij}^{-}-{N}_{ij}^{+}\right)}{\sqrt{N_{ij}^{+}}}\)
Where \({N}_{ij}^{+}\) is the local sum over north-east and south-west quadrants of Nij up to a maximum range R: \({N}_{ij}^{+}=\sum_{k=i,l=j}^{k=i+R,j+R}{N}_{kl}+\)\(\sum_{k=i,l=j}^{k=i-R,j-R}{N}_{kl}\), and \({N}_{ij}^{-}\) is a similar sum over north-west and south-east quadrants: \({N}_{ij}^{-}=\sum_{k=i,l=j}^{k=i-R,j+R}{N}_{kl}+\)\(\sum_{k=i,l=j}^{k=i+R,j-R}{N}_{kl}\). All positions ij for which max(\({Z}_{ij}^{+}\),\({Z}_{ij}^{-}\)) > Zmin= 1 and max(\({Z}_{ij}^{+}\),\({Z}_{ij}^{-}\)) is a local maximum (no positions i,j have a higher value within a range of 3w) were defined as candidate fusion junctions.
After identifying candidate fusions at an initial bin size w0 = 50,000, we classified candidate fusions using a convolutional neural network (CNN) model [58] to identify positive and negative class. The probability of calling it a positive fusion was set to be above 0.7. After classifying, we refined the breakpoint position using two kernels with values decaying from the center as an exponential function. One kernel had values only on the north-east and south-west quadrants, and another kernel had values only on the north-west and south-east quadrants. Each kernel was convolved over a contact matrix that was centered at the candidate fusion and 2Mb to the left and right. After convolving, we calculated the global maxima of these two convolved images and selected the maximum as the refined breakpoint. Fusion event classification was done by selecting a new contact matrix with the same dimensions as above centered at the refined breakpoint and calculating 4 quadrant coverage. Depending on the quadrant values, fusion events are classified as a reciprocal translocation, non-reciprocal translocation, deletion, inversion, and segmental duplication.
Bionano SV analysis
SV calls were performed using Bionano Saphyr (Access 1.5) [40]. A non-redundant set of Bionano calls (SVs and CNVs) are selected only if they appear in at least 2 of the 3 replicates. Those variants must be in tumor and not in the paired normal libraries which were selected against the consensus callset from other technologies by clustering analysis. The SVs called by other technologies and validated by Bionano both have the same call, and the two calls would be clustered together. The criteria to determine if the two sets can be clustered together:
-
(1)
Insertion/deletion: only look at calls > 500 bp in size due to Bionano detection resolution is 500bp:
The two calls have similar size (>50% similarity) and two calls can be no more than 10kb apart
-
(2)
Inversion: only look at calls > 500 bp in size and require the two calls have similar size (>50% similarity) and the two calls can be no more than 10kb apart
-
(3)
Duplication: only look at calls > 500 bp in size and require the two calls have similar size (>50% similarity) and the two calls can be no more than 10kb apart. The direction of duplication (direct or inverted) is not considered, thus can be clustered together regardless of the direction
-
(4)
Translocation: the two calls’ breakpoints are not more than 50 kb apart and direction of translocation is not considered, thus can be clustered together regardless of the direction
Multi-platforms SV integration
The somatic SV events such as insertions, deletions, intra-chromosomal inversions, inter-chromosomal translocations, and complex breakpoint events were called in a tool-specific manner in the above sections described in platform/tool-specific SV analysis. If somatic calling mode was not supported by a specific software, Survivor [25] v1.0.3 SVs was used to get SVs in tumor sample and not in the normal sample for each pair of the sample.
Split VCF files
The SVs in each tool-specific VCF file were split based on the 5 window sizes (50 to100bp, 101 to 500bp, 501 to 1000bp, 1001 to 30,000bp, 30,000bp and above). Translocation events (TRA) were extracted from the split VCF file to save into a separate file for each window bin group. The Survivor filter function was used to split VCF files as follows:
SURVIVOR filter $input.vcf NA $min_size $max_size 0 -1 $window_bin/$output.vcf && cat $window_bin/$output.vcf | grep -w -v "SVTYPE=TRA" | vcf-sort >$window_bin/$output.noTRA.vcf && cat $window_bin/$output.vcf | grep -w "SVTYPE=TRA" | vcf-sort >$window_bin/$output.only.TRA.vcf
Somatic large SV
To get the tumor-only (somatic) SVs, the large SV (SV >30kb) VCF files for each pair were merged using Survivor. The SVs were merged if they were <10kb between the break points. SVs which were only present in the tumor samples and not in the normal sample were taken as somatic SVs for each pair of the sample. The command used for merging the VCF files for SV integration by window sizes (10,000bp) was:
Command: SURVIVOR merge $input.vcf.list 10000 1 0 0 0 10000 $output.vcf
Somatic small SV
To get tumor-specific small deletions, insertions, duplications, and inversions, the VCF file per sample was split into 4 files based on the length of the SV size: 50 to 100bp, 101 to 500bp, 501bp to 1kB and SV (size) greater than 1kb but smaller than 30kb. Each of the pairs of samples was then merged using Survivor. Their minimum SV length was used as the maximum distance between the break point. The resulting tumor-specific SVs were then merged from all the four files as tumor-specific SVs for each pair. The command used for merging the VCF files for SV integration by window sizes (50, 100, 500, 1000) was:
SURVIVOR merge $input.vcf.list $window_size 1 0 0 0 $window_size $output.vcf
Merging somatic call sets
Large SVs (SV > 30kb or translocations) from the tumor-only filter calls from all samples from all the tools (11 LongRanger, 11 GrocSV [59], 11 TNScope,11 Novobreak, 11 Delly, 1 Sniffles and 1 PBSV for PacBio data, 1 Sniffles and 1 NanoSV for ONT data, and 1 Dovetail Hi-C) were merged using the same command for large SV integration from Survivor:
Command: SURVIVOR merge $input.vcf.list 10000 1 0 0 0 10000 $output.vcf
Small SVs (SV < 30kb and no translocations) from tumor-only calls from all samples and all tools (11 LongRanger for short deletions, 11 TNScope, 11 Novobreak, 11 Delly, 1 Sniffles and 1 PBSV for PacBio data, 1 Sniffles and 1 NanoSV for ONT data) VCF files were merged with the same command for small SV integration from Survivor as follows:
SURVIVOR merge $input.vcf.list $window_size 1 0 0 0 $window_size $output.vcf
In addition, SVs on ALT contigs or 10X Genomics’ blacklist regions were removed. For the high-confidence SV call set, we selected SVs that were represented in calls from at least 2 technologies and window size of 10kb for merging the SVs of the same type together.
Clustering analysis for merged SVs
The merged SVs are clustered based on SV type and SV sizes using a clustering algorithm to merge the SVs of the same type that require the two calls to have similar size (>50% similarity) and the two calls can be no more than the defined window size. We split the SVs into the separate files based on SV sizes and defined the clustering window size as follows:
-
50 to 100bp SV files use 50bp clustering window size
-
100 to 500bp SV files use 100bp clustering window size
-
500bp to 1kb SV files use 500bp clustering window size
-
1 to 30kb SV files use 1kb clustering window size
-
30kb above SVs or translocation event SV files use 10kb clustering window size
Subsampling analysis
Illumina short-read data downsampling
Sequencing reads were downsampled using SAMtools version 1.6. A workflow was created in BGL called “Multi downsample BAM,” which ran the “SAMtools view” tool on all SAM or BAM files in a directory and includes an option to downsample the reads by a given fraction corresponding to the “-s” parameter in SAMtools view. The workflow indexed the resulting BAM files using “SAMtools index.” The workflow was used to generate all downsampled BAM files and index files and created a subset with defined read coverage.
BAM files from BWA [26] alignment of three replicated runs of WGS with 100X coverage on HCC1395 and HCC1395BL were merged using SAMtools [60] (version 1.8) for 200X or 300X coverages respectively. Newly created BAM files were then indexed and regrouped using Picard Tools [48] (version 2.17.11).
10X Genomics data downsampling
Three replicates of the tumor sample which was sequenced at Novartis were selected for subsampling analysis. All three samples were merged using 10X Genomics LongRanger produced .mro files. The resulting bam file was subsampled to 100x using the bamtofastq tool provided by 10X Genomics. Each subsequent subsampling was generated from higher depth resulting in a bam file as input using the same subsampling method from 100x to 50x, 50x to 30x, 30x to 20x, and 20x to 10x.
Pacbio long-read data downsampling
The tumor mapped bam file mean coverage was ~39x. NGM-LR was used for the subsampling analysis. The bam file was subsampled to 30x, 20x, and 10x using the bbmp reformat.sh command.
reformat.sh in=tumor.merge.ngmlr.dedup.bam out=pb.tumor.10x.bam mappedonly=t samplereadstarget=4172820 ref=hg38.fa -Xmx31g
The SVs from each subsampling were split and merged using Survivor with the same method previously described for large and small SV integration based on window size. The merged SVs were taken as a union set from each call set rather than the filtered SVs.
Annotation of high-confidence SVs
The annotation of the high-confidence SV call set used the AnnotSV software [29, 30]. AnnotSV furnishes breakpoint annotation to NGS including repeated sequences or G/C content. It starts by detecting the genomic overlaps between the input and the annotation features. It constructs an annotation based on the full-length SV and an annotation for each gene within the SV. Significant structure variants were compared with the Cancer Gene Census Project of the Catalogue Of Somatic Mutations In Cancer (COSMIC) database (GRCh38 · COSMIC v90) [61] to find variant genes are overlapping with the Cancer Gene Census list.
Fusion gene analysis
We performed RNA-seq fusion gene analysis to further investigate fusion gene and translocation events in the genome. We used Star Fusion [62], Arriba [63] for fusion event detection. Each pipeline produces a set of fusion candidates. Gene fusion events found in the two tools were passed onto the next step as our high-quality fusion events. For tumor-specific fusion genes, the list of fusion genes that were also detected in normal samples was subtracted. The high-confidence genes from fusion genes were also compared with the high-confidence SV events to find overlap genes to produce the final integrated SV-Fusion gene set.
Commands:
STAR --runThreadN 8 --genomeDir $ref_genome.fa.star.idx --readFilesIn $intput_R1_trimmed.fastq $inut_R2_trimmed.fastq --limitBAMsortRAM 31532137230 --outReadsUnmapped None --outSAMtype BAM SortedByCoordinate --alignIntronMax 200000 --alignMatesGapMax 200000 --alignSJDBoverhangMin 10 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 12 --chimJunctionOverhangMin 12 --chimSegmentReadGapMax 3 --twopassMode Basic
STAR-Fusion --genome_lib_dir $ref_genome -J $Start_Chimeric.out.junction --output_dir Star_fusion_outdir
arriba -x Star_Fusion_Aligned.sortedByCoord.out.bam -c Star_Fusion_Chimeric.out.sam -g $ref_gencode.v24.annotation.gtf -a $ref_GRCh38.d1.dv1.fa -b $ref_blacklist_filter.bed -d $output_SV_for_arriva.tsv -o $output_arriba.tumor.with.sv.tsv
Reconstruction of cancer karyotypes for cancer cell line using RCK tool
HATCHet was used to generate allele-specific and clone-specific copy number values for matched tumor-normal samples using the recommended settings provided in the user documentation (http://compbio.cs.brown.edu/hatchet/). HATCHet-generated segment values were converted for processing with RCK using rck-scnt-x2rck following the author recommendations. Estimated copy number values within telomeric segments were extended from the closest estimated value reported by HATCHet. The high-confidence structural variants and converted HATCHet segment data were used as input for RCK. RCK was run using default parameters. RCK was run with the following command:
Rck –scnt segs_extend.tsv –acnt adj_X_removed.tsv
Calculating SV calling frequency and select high-confidence call set
The frequency of a variation is defined by the ratio of a relative measure compared to the number of sample technical replicates tested which include library replicates and software replicates.
For both small and large SVs, the frequency of each SV was calculated on three levels: (1) per tool frequency, (2) per platform frequency, and (3) general consensus score.
Per tool frequency was calculated by counting the SVs detected by a software tool divided by the total count of replicates in each platform. For the 10X Genomics and Illumina data sets, the occurrence was divided by 11 (replicates). For PacBio and Hi-C samples, it was either 1 or 0 represent called or missing SVs by the software tool.
Per platform frequency was calculated by counting the SVs detected by the specific platform and divided by all replicates. For instance, the 10X Genomics platform-specific SV frequency is the count of the SVs detected by two software tools in 22 replicates. Since there was only one replicate and two software tools for PacBio data, the frequency of detection was either ½, 1, or 0. For Illumina, there were 44 technical replicates (11 from TNscope, 11 from Delly, 11 from Novobreak, and 11 from Manta). Therefore, the frequency was the count of the SVs detected by the above 3 software tools and divided by 44. The consensus score was assigned based on the total sum of per tool frequency for each SV. SVs which are called by at least two platforms were taken as high-confidence calls