CopywriteR : DNA copy number detection from off-target sequence data

3 Analysis workflow 2 3.1 preCopywriteR() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.2 BiocParallel for parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.3 CopywriteR() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.4 plotCNA() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


Overview
To extract copy number information from targeted sequencing data while circumventing problems related to the use of on-target reads, we developed the CopywriteR package. Unlike other currently available tools, CopywriteR is based on off-target reads. We have shown that this approach has several advantages relative to current approaches [1] and it constitutes a viable alternative for copy number detection from targeted sequencing data.

CopywriteR workflow
CopywriteR uses .bam files as an input. These files are processed in several steps to allow copy number detection from off-target reads of targeted sequencing efforts. These steps include: • Removing of low-quality and anomalous reads CopywriteR: DNA copy number detection from off-target sequence data.
• Peak calling (in reference when available, otherwise in sample itself) • Discarding of reads in peak regions • Counting reads on bins of pre-defined size • Compensating for the difference in effective bin size upon discarding peaks in peak regions • Correcting for GC-content and mappability; applying a blacklist filter for CNV regions

Analysis workflow
The full analysis of copy number data with CopywriteR includes three sequential steps, using the preCopywriteR(), CopywriteR() and plotCNA() functions respectively.
In this analysis workflow, we will use CopywriteR to extract copy number information from whole-exome sequening data from a murine small-cell lung cancer. The complete dataset can be downloaded from the European Nucleotide Archive (http://www.ebi.ac.uk/ena/home) using the accession number PRJEB6954. In this vignette, only the sequence reads on chromosome 4 are used. The .bam file, which has been pre-filtered to minimize disk usage, is contained in the SCLCBam package. Before starting the analysis, GC-content and mappability helper files need to be created for the desired bin size and reference genome. This is done using the preCopywriteR() function.

preCopywriteR()
CopywriteR uses binned read count data as a basis for copy number detection. We have preassembled 1 kb bin GC-content and mappability information for the hg18, hg19, hg38, mm9 and mm10 reference genomes in the CopyhelpeR package. The preCopywriteR() function allows to create the necessary helper files. These files can be created for any custom bin size that is a multiple of 1000 bp, and for the above-mentioned reference genomes. The helper files are saved in a new folder that is names after the relevant reference genome, the bin size, and the prefix (all separated by '_'). The helper files are required to run CopywriteR. If previously created helper files for a specific bin size / reference genome / prefix combination are available, the preCopywriteR() step can be omitted.
In this example, CopywriteR will be applied to a bin size of 20 kb. We recommend to use a bin size of 20 kb when trying to analyze whole-exome sequencing data, while we would start analyzing at 50 kb resolution for targeted sequencing on smaller gene panels. As the .bam file contained in the SCLCBam package is mapped to the mm10 reference genome, we run preCopywriteR() as follows: Please not that the "mm10_4" reference genome exclusively contains helper files for chromosome 4, and therefore should only be used when replicating the analysis in this vignette. The custom helper files are placed in a folder that is named after the reference genome and the bin size (in this example: mm10_4_20kb). This folder, on its turn, is placed inside the folder specified by the output.folder argument. The bin.size argument specifies the custom bin size (in bp), and the ref.genome argument can be either "hg18", "hg19", "hg38", "mm9" or "mm10". The default chromosome notation is "1", "2", ..., "X", "Y". The prefix argument is optional and can be used to add a prefix to these names.
If you are interested in getting helper files for other genomes, please contact me by the email addresses supplied at the beginning of this vignette.

CopywriteR()
The CopywriteR() function allows to calculate binned read count data based on the helper files created by preCopywriteR(). CopywriteR uses a peak calling algorithm to remove 'ontarget' reads. For every sample that is to be analyzed by CopywriteR, the user can specify on which sample the peak regions should be identified. The argument that controls this is sample.control. This matrix or data.frame should contain path names to 'samples' in the first column, and their corresponding 'controls', in the second column. The peaks called in a 'control' are used to remove reads from the corresponding 'sample'. Please note that any sample that is to be used by the downstream plotCNA function (including those that are only to be used as a reference) needs to be analyzed by the CopywriteR function.
In the example used here, no matched germline control is available, and peaks are called on the sample itself. Therefore the paths specified in the samples and controls columns of the sample.control data.frame below are identical. The following samples will be analyzed: sample: T43 _ 4.bam ; matching control: T43 _ 4.bam The bin size for this analysis is 20000 The capture region file is not specified This analysis will be run on 1 cpus Total calculation time of CopywriteR was: 56.79833 The destination.folder argument determines in which folder the output folder is going to be placed, and the reference.folder points to the location of the custom bin size helper files. The keep.intermediary.files is an optional argument that determines whether intermediary .bam, .bai and peak .bed files are kept. The default value is FALSE to limit disk space usage; it can be set to TRUE however for troubleshooting purposes. Finally, the capture.regions.file argument is optional, and can be used to define the location of a capture regions bed file. If this file is specified, statistics of the overlap of these regions with the called peaks is provided.
CopywriteR will create a new folder named CNAprofiles in the directory specified by destination.folder. This folder contains the following files: > cat(list.files(path = file.path(data.folder, "CNAprofiles")), sep = "\n") CopywriteR.log input.Rdata log2 _ read _ counts.igv qc read _ counts.txt The read_counts.txt file contains both uncompensated and compensated read counts for every sample, as well as the corresponding fraction.of.bin values. The 'fraction of bin' indicates what fraction of the bin is not on peaks, and therefore effectively contributes to the read count. In our example, the read_counts.txt has the following content: The input.Rdata file contains a number of variables that are required to run the last function of the CopywriteR package, plotCNA(). The CopywriteR.log file contains log information of the R commands that have been used to perform the various subfunctions, and specifications of the input material. Finally, the qc folder contains two types of quality control plots.
> cat(list.files(path = file.path(data.folder, "CNAprofiles", "qc")), sep = "\n") fraction.of.bin.T43 _ 4.bam.pdf The fraction.of.bin files contain the empirical cumulative distribution function for the 'fractions of bins'. The read.counts.compensated files contain the plots and the loesses that are used for GC-content and mappability corrections.

plotCNA()
The plotCNA() function allows segmentation of the copy number data using DNAcopy [2], and subsequent plotting. We run the plotting function as follows: The samples are plotted per chromosome, as well as in a genome-wide fashion. We provide here the result for chromosome 4: In addition to the plots, the raw segmented values can be obtained from the DNAcopy object 'segment.Rdata'.
We are interested in further improving CopywriteR, so if CopywriteR fails to analyze your samples, please don't hesitate to contact me. In this case, please provide the full command line output and the log-file.

Troubleshooting
We maintain a troubleshooting GitHub page where various issues are discussed that have been raised by CopywriteR users. Please follow this link: https://github.com/PeeperLab/ CopywriteR#troubleshooting.