Systematic biases in DNA copy number originate from isolation procedures
© van Heesch et al.; licensee BioMed Central Ltd. 2013
Received: 26 November 2012
Accepted: 24 April 2013
Published: 24 April 2013
The ability to accurately detect DNA copy number variation in both a sensitive and quantitative manner is important in many research areas. However, genome-wide DNA copy number analyses are complicated by variations in detection signal.
While GC content has been used to correct for this, here we show that coverage biases are tissue-specific and independent of the detection method as demonstrated by next-generation sequencing and array CGH. Moreover, we show that DNA isolation stringency affects the degree of equimolar coverage and that the observed biases coincide with chromatin characteristics like gene expression, genomic isochores, and replication timing.
These results indicate that chromatin organization is a main determinant for differential DNA retrieval. These findings are highly relevant for germline and somatic DNA copy number variation analyses.
KeywordsCopy number variation DNA isolation Technological bias Tissue specificity
The ability to accurately detect DNA copy number variation (CNV) in both a sensitive and quantitative manner is important in many research areas. While the detection of CNVs previously relied on low resolution techniques like quantitative PCR or MLPA, high-resolution array-based comparative genomic hybridization (aCGH) and next-generation sequencing (NGS)-based depth of read coverage (DOC) approaches  now allow for detailed genome-wide analyses. However, both aCGH and DOC are complicated by the presence of 'wave patterns' in the raw data where the measurement deviates systematically from equimolar representation. These regions span up to tens of megabases and pose challenges on CNV calling. To reduce the number of false-positive calls introduced, algorithms were designed to suppress wave effects [2–6]. In these studies, quantity of DNA during hybridization, dye-biases, enzymatic effects, and correlations with GC content were proposed as the main contributors to the wave patterns. However, understanding the source of the observed patterns is important for reliable genome-wide analyses based on aCGH and NGS techniques.
Results and discussion
These results demonstrate that the observed wave patterns are the result of combined tissue-specific DNA isolation biases. As the magnitude, but not the pattern, of the biases decreases with longer proteinase K treatment (Additional file 6), we postulate that DNA retrieval effects are due to differences in degradation of DNA/protein complexes, which subsequently results in depletion of stable aggregates by early precipitation or separation into the phenol phase. Densely packed heterochromatic regions, but also nuclear lamina attached chromatin, are likely to be most affected by such process.
To test whether the DNA in the under-represented genomic regions was simply absent from the sample, or just inaccessible for subsequent applications like sequencing or aCGH, we modulated the DNA isolation experiments even further. First, DNA was extracted after only 10 min of lysis in the presence of proteinase K. After one initial round of phenol/chloroform extraction and precipitation, the sample was divided in half. One part was treated with proteinase K for 2 additional hours, while the other was used as a control and left untreated. We subsequently extracted the DNA from both samples using a second round of phenol/chloroform extraction. Surprisingly, the NGS data show that an additional 2 h of treatment dramatically improves the evenness of the genome-wide coverage as compared to the control sample (Figure 4D), now resembling the read distribution of samples that were treated for a minimum of 2 h. This suggests either that inaccessible DNA was present after the first phenol/chloroform extraction and made accessible by the additional proteinase K treatment, or that the second phenol/chloroform purification step removed additional protein-bound DNA from the control sample. In any case, these experiments further demonstrate that equimolar DNA representation is affected by differences in DNA isolation conditions.
We demonstrate that DNA isolation procedures can introduce a systematic bias that contributes to the wave effects in aCGH data and the variation in coverage depth in NGS data. We show that extended lysis with proteinase K treatment results in: (1) more even representation of NGS reads across the genome; (2) more similar representation of DNA derived from different tissue sources; and (3) improved DNA content uniformity for a previously undertreated DNA sample. Our data show that the basis for the observed bias is tissue-specific and related to specific chromatin characteristics. Interestingly, from the four tissues that we sampled in this study, brain showed the lowest variation in NGS read coverage. This could reflect the diversity of cell types within this tissue and the associated increased variety of chromatin conformations. More homogeneous tissues like blood and liver exhibited the largest bias in read coverage (Additional file 7), again supporting a cell type-specific origin of the effects rather than primary DNA characteristics. Tissue-specific chromatin characteristics could originate from protein-DNA interactions, 3D organization, and epigenetic modifications.
The observations presented in this study are relevant for a wide range of genomics techniques. Obviously, the described artifacts affect the accuracy of CNV detection [16, 17], in particular somatic CNV analyses such as in cancer where sample heterogeneity requires accurate detection of relatively small changes. Furthermore, genome-wide nucleotide variation analyses using next-generation sequencing may also be affected, as depleted regions will have lower sequencing coverage, which results in lower reliability of variant calling. Accurate experimental reflection of the original amounts of DNA is also important for other genomics techniques, as was recently demonstrated for ChIP-seq experiments . As none of the methods or conditions tested could completely remove the signal bias, special care should be taken to control for potential DNA isolation and tissue-specific effects in experiments involving quantitative DNA interpretation. Furthermore, detection of somatic copy number variation will require independent measurements, for example, using allele imbalances [7, 8].
Materials and methods
Isolation of genomic DNA
Tissues were collected from BN (BN/Crl, Charles River), WU (HsdCpb:WU), and ACI rats (ACI/Seg/Hsd, Harlan Laboratories B.V., The Netherlands) 6 weeks of age, snap frozen and powdered. A total of 30 mg input material was used for strain and tissue comparisons, 100 mg input material was used for DNA isolation methods comparisons. DNA was isolated using standard phenol/chloroform extraction (1:1, pH 7.9) or Qiagen DNeasy Blood & Tissue kit (cat.no. 69506). Tissues were lysed in lysis buffer (100 mM Tris-HCI pH 8.0 200 mM NaCl, 0.2% SDS, 5 mM EDTA) using a Kontes Dounce tissue grinder (Kimble and Chase, 885300-0002) and incubated for 2 h at 50°C in the presence of RNase A (50 μg/mL) and proteinase K (100 μg/mL). For WU rat brain and liver, additional proteinase K conditions were tested (10 min, 30 min, 60 min, 120 min, and overnight in lysis buffer). These time series were followed by two rounds of standard phenol/chloroform extraction with in-between precipitation of the DNA (1x phenol, 1x phenol/chloroform, and 1x chloroform). DNA precipitation was done with 3 volumes of pure ethanol in the presence of 1/10 volume sodium acetate (3 M). Pelleted DNA was washed with 70% ethanol and dissolved in 10 mM Tris pH 8.0. For the additional proteinase K treatment experiment, 50% of the DNA that was extracted after a 10-min lysis was treated for an additional 120 min of proteinase K (100 μg/mL) in the lysis buffer described above. Next, both samples were cleaned during a second round of phenol/chloroform and ethanol precipitation, similar to the other samples in the time series.
For isolation of blood DNA with the Qiagen kit, all steps were performed exactly according to manufacturer's instructions (Qiagen DNeasy Blood & Tissue handbook, 07/2006). DNA quality and quantity of all isolations were measured using NanoDrop ND-1000 (Thermo scientific) and a Qubit Quant-iT™ dsDNA broad range assay (Invitrogen).
Array comparative genomic hybridization (aCGH)
NimbleGen whole genome tiling path arrays covering the complete, non-repetitive part of the rat genome were used. The 2.1 M probe arrays had an average probe spacing of 1 probe per 1.3 kb and a GC-content close to 50%. For strain and tissue comparisons, DNA derived from tissues of BN and ACI rats was used for hybridization. The exact quantity of DNA recommended by NimbleGen was used (2 μg input for sonication, 1 μg input for exo- klenow mediated Cy3 and Cy5 labeling, 13 μg for hybridization). DNA labeling (NimbleGen dual-color DNA labeling kit), array hybridization (HX1 mixers, NimbleGen hybridization system 4), washing (NimbleGen wash buffer kit), and scanning were performed exactly according to manufacturer's instructions (NimbleGen Arrays User's Guide - CGH analysis Version 6.0). Image analysis, data normalization, and plotting were performed using NimbleScan 2.4 software using parameters preset by the manufacturer. For platform and extraction method comparisons, Agilent custom designed tiling path arrays (4 × 44 k, ± 1.5 kb probe spacing) were designed for the complete rat chromosome 14 (RNO14). aCGH DNA preparation steps and array hybridization were performed according to manufacturer's instructions (Agilent Oligonucleotide Array-Based CGH for Genomic DNA Analysis V4.0).
SOLiD mate-pair sequencing and depth of coverage analysis for the sequencing aCGH comparison
For the mate-pair sequencing data presented in Figure 1, 118 microgram of genomic DNA was fragmented by incomplete digestion during a time series of 15 s to 25 min with the Alu I restriction enzyme (Promega, R6281). Time points were pooled and the fragmented DNA was loaded on a 1% agarose gel for gel excision of 1-3 kb fragments. Mate-pair libraries were prepared according to the Applied Biosystems User's Guide (12/2007 4391587 Rev. B) and sequenced on SOLiD V2. Sequencing data were mapped against Rat genome assembly RGSC3.4 by BWA 0.5.9 software  (parameters -c -l 25 -n 2 -k 6). We calculated the number of reads using genomic windows containing 100 kb of genome sequence (excluding sequence gaps). The number of reads in each sample was normalized as reads per million sequenced reads. These normalized values were used for the calculation of log ratios and plotting. In total, 6,325,428 reads were used for blood, 6,600,672 for brain, 6,764,588 for liver, and 7,183,059 for testis. These numbers equal a low-pass genome coverage ranging between 0.075× and 0.086×.
SOLiD fragment sequencing and depth of coverage analysis for the time series
For the time series experiments presented in Figure 5, barcoded fragment libraries were produced on an automated system (BioMek), introducing no variation in the library preparation procedure. One microgram of DNA was used as input and libraries were prepared exactly according to manufacturer's instructions for SOLiD 5500XL library preparation. SOLiD libraries were pooled equimolary, quality assessed, and size selected on the Caliper XT system. Sequencing reads were aligned to the Rat reference genome RGSC3.4 using BWA 0.5.9  (parameters -c -k 2 -l 25 -n 10). PCR duplicates were marked in the alignments and were not used in the analysis, resulting in 10 to 35 M unique and unambiguously mapped reads per time point. For tissue comparisons, the coverage of each library was normalized by random removal of reads to 10 M of unambiguously mapped tags (0.2× genome coverage), which corresponds to the liver library with the least amount of mapped reads. For the additional proteinase K treatment comparisons, only brain samples were used and these could thus be normalized to 14.8 M reads (0.3× genome coverage; limited by the brain library with the least number of reads).
The genome was partitioned into windows each containing 20 kb of NGS-accessible sequence (excluding repeats and gaps). The read count and GC content were determined for each window and library. GC-correction: read counts were adjusted for each library by normalization against the median read count in 100 genomic windows with most similar GC content using the following formula: Ncorr = Nmed*Nobs/NmedGC where: Ncorr, GC-corrected number of reads; Nobs, observed number of reads; Nmed, median reads per window for this library; and NmedGC, median number of reads in 100 windows with the most similar GC content. After GC correction, potential somatic copy number changes were determined using a dynamic window approach (DWAC-seq).
Correlation of DNA content variability with various genome characteristics and gene expression
All sequencing data are available from the Sequence Read Archive (SRA) at EBI under accession number (ERP001927). Array CGH data are available from the Gene Expression Omnibus (GEO) database at NCBI under accession number (GSE45308). Whole genome aCGH plots for tissue comparisons are available as Additional file 8 (ACI blood versus liver and brain versus testis; BN blood versus liver and liver versus blood).
- Array CGH/aCGH:
array-comparative genome hybridization
chromatin immunoprecipitation sequencing
copy number variation
depth of coverage
multiplex ligation-dependent probe amplification
polymerase chain reaction
short interspersed nuclear element.
We thank Ewan Birney for his suggestion to explore DNA isolation artifacts as a potential source for the putative somatic CNV patterns observed in our experiments. This work was financially supported by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement No. HEALTH-F4-2010-241504 (EURATRANS) to EC and TJA and the NWO-CW TOP grant (700.58.303) to EC, the NGI Horizon program grant (93519030) to VG, and the NIH grant R01-CA77876 to JS. TJA acknowledges intramural MRC Programme support.
- Ceulemans S, van der Ven K, Del-Favero J: Targeted screening and validation of copy number variations. Methods Mol Biol. 2012, 838: 311-328. 10.1007/978-1-61779-507-7_15.PubMedView ArticleGoogle Scholar
- Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K: Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008, 36: e126-10.1093/nar/gkn556.PubMedPubMed CentralView ArticleGoogle Scholar
- Komura D, Shen F, Ishikawa S, Fitch KR, Chen W, Zhang J, Liu G, Ihara S, Nakamura H, Hurles ME, Lee C, Scherer SW, Jones KW, Shapero MH, Huang J, Aburatani H: Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006, 16: 1575-1584. 10.1101/gr.5629106.PubMedPubMed CentralView ArticleGoogle Scholar
- Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP, Tavare S, Hurles ME: Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007, 8: R228-10.1186/gb-2007-8-10-r228.PubMedPubMed CentralView ArticleGoogle Scholar
- van de Wiel MA, Brosens R, Eilers PH, Kumps C, Meijer GA, Menten B, Sistermans E, Speleman F, Timmerman ME, Ylstra B: Smoothing waves in array CGH tumor profiles. Bioinformatics. 2009, 25: 1099-1104. 10.1093/bioinformatics/btp132.PubMedView ArticleGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40: e72-10.1093/nar/gks001.PubMedPubMed CentralView ArticleGoogle Scholar
- Jacobs KB, Yeager M, Zhou W, Wacholder S, Wang Z, Rodriguez-Santiago B, Hutchinson A, Deng X, Liu C, Horner MJ, Cullen M, Epstein CG, Burdett L, Dean MC, Chatterjee N, Sampson J, Chung CC, Kovaks J, Gapstur SM, Stevens VL, Teras LT, Gaudet MM, Albanes D, Weinstein SJ, Virtamo J, Taylor PR, Freedman ND, Abnet CC, Goldstein AM, Hu N, et al: Detectable clonal mosaicism and its relationship to aging and cancer. Nat Genet. 2012, 44: 651-658. 10.1038/ng.2270.PubMedPubMed CentralView ArticleGoogle Scholar
- Laurie CC, Laurie CA, Rice K, Doheny KF, Zelnick LR, McHugh CP, Ling H, Hetrick KN, Pugh EW, Amos C, Wei Q, Wang LE, Lee JE, Barnes KC, Hansel NN, Mathias R, Daley D, Beaty TH, Scott AF, Ruczinski I, Scharpf RB, Bierut LJ, Hartz SM, Landi MT, Freedman ND, Goldin LR, Ginsburg D, Li J, Desch KC, Strom SS, et al: Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nat Genet. 2012, 44: 642-650. 10.1038/ng.2271.PubMedPubMed CentralView ArticleGoogle Scholar
- Baillie JK, Barnett MW, Upton KR, Gerhardt DJ, Richmond TA, De Sapio F, Brennan PM, Rizzu P, Smith S, Fell M, Talbot RT, Gustincich S, Freeman TC, Mattick JS, Hume DA, Heutink P, Carninci P, Jeddeloh JA, Faulkner GJ: Somatic retrotransposition alters the genetic landscape of the human brain. Nature. 2011, 479: 534-537. 10.1038/nature10531.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu GE, Hou Y, Robl JM, Kuroiwa Y, Wang Z: Assessment of genome integrity with array CGH in cattle transgenic cell lines produced by homologous recombination and somatic cell cloning. Genome Integr. 2011, 2: 6-10.1186/2041-9414-2-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Ryba T, Hiratani I, Lu J, Itoh M, Kulik M, Zhang J, Schulz TC, Robins AJ, Dalton S, Gilbert DM: Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome Res. 2010, 20: 761-770. 10.1101/gr.099655.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Smit AF: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999, 9: 657-663. 10.1016/S0959-437X(99)00031-3.PubMedView ArticleGoogle Scholar
- Veal CD, Freeman PJ, Jacobs K, Lancaster O, Jamain S, Leboyer M, Albanes D, Vaghela RR, Gut I, Chanock SJ, Brookes AJ: A mechanistic basis for amplification differences between samples and between genome regions. BMC Genomics. 2012, 13: 455-10.1186/1471-2164-13-455.PubMedPubMed CentralView ArticleGoogle Scholar
- Schubeler D, Scalzo D, Kooperberg C, van Steensel B, Delrow J, Groudine M: Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing. Nat Genet. 2002, 32: 438-442. 10.1038/ng1005.PubMedView ArticleGoogle Scholar
- Lee KH, Chiu S, Lee YK, Greenhalgh DG, Cho K: Age-dependent and tissue-specific structural changes in the C57BL/6J mouse genome. Exp Mol Pathol. 2012, 93: 167-172. 10.1016/j.yexmp.2012.04.013.PubMedView ArticleGoogle Scholar
- Piotrowski A, Bruder CE, Andersson R, Diaz de Stahl T, Menzel U, Sandgren J, Poplawski A, von Tell D, Crasto C, Bogdan A, Bartoszewski R, Bebok Z, Krzyzanowski M, Jankowski Z, Partridge EC, Komorowski J, Dumanski JP: Somatic mosaicism for copy number variation in differentiated human tissues. Hum Mutat. 2008, 29: 1118-1124. 10.1002/humu.20815.PubMedView ArticleGoogle Scholar
- Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS: Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods. 2012, 9: 609-614. 10.1038/nmeth.1985.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kahari AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, et al: Ensembl 2012. Nucleic Acids Res. 2012, 40: D84-90. 10.1093/nar/gkr991.PubMedPubMed CentralView ArticleGoogle Scholar
- Pohl AA, Sugnet CW, Clark TA, Smith K, Fujita PA, Cline MS: Affy exon tissues: exon levels in normal tissues in human, mouse and rat. Bioinformatics. 2009, 25: 2442-2443. 10.1093/bioinformatics/btp414.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.