Whole exome capture in solution with 3 Gbp of data
© Bainbridge et al.; licensee BioMed Central Ltd. 2010
Received: 14 April 2010
Accepted: 17 June 2010
Published: 17 June 2010
We have developed a solution-based method for targeted DNA capture-sequencing that is directed to the complete human exome. Using this approach allows the discovery of greater than 95% of all expected heterozygous singe base variants, requires as little as 3 Gbp of raw sequence data and constitutes an effective tool for identifying rare coding alleles in large scale genomic studies.
DNA sequence capture is an effective technique for enrichment of significant subfractions of the genome for targeted analysis. Sequence capture is generally conducted with either a solid-phase substrate, usually a glass microarray, or alternatively, in solution [1–4]. Solution-capture performance, however, has not been thoroughly compared to microarray with respect to uniformity of representation of the targeted DNA bases or evenness of DNA sequencing coverage depth between regions [5, 6]. Further, solution-based sequence capture has not been demonstrated to work effectively at the scale of a human exome (approximately 30 Mbp) but typically has been limited to targets <5 Mbp in size. Despite this, solution capture has several advantages when compared to microarrays: the reagent cost is lower; less DNA is required; and, because the capture method can be conducted entirely in small laboratory tubes, it is readily scaled and automated. Before solution-capture sequencing can be widely adopted, however, the reproducibility of the method must first be demonstrated, and targets should show similar levels of coverage from capture to capture. Ideally, solution-capture methods should also be able to be coupled to different sequencing technology platforms, and reliably produce suitable levels of enrichment that routinely enable the discovery of rare genetic variants.
A comparison of different capture methodologies
To test the reproducibility of our recent innovations of liquid DNA capture, technical replicate capture experiments were performed and subsequently sequenced on the SOLiD  platform. Capture followed by Illumina  sequencing was also performed with both a fragment (frag) and paired-end (PE) library to test the merits of employing PE data versus single-ended reads. Finally, we used each of these data sets to test the ability to discover single nucleotide variants across the exome.
Results and discussion
Here we report the performance of newly developed methods for sequence-capture in solution. The procedures were tested with respect to reproducibility of capture, agnosticism to sequencer platform, and the ability to discover genetic variation in human gene coding regions. In total, six captures of the Consensus Coding Sequence (CCDS) exons [9, 10] were performed using DNA from one HapMap sample (NA12812). Four separate, replicate capture libraries were prepared for solid sequencing and one additional library was prepared for both the Illumina frag and PE sequencing. In total, 23 Gbp of data were uniquely aligned to the human reference genome (Table S1 in Additional file 1). Variants were called using algorithms tailored to each sequence platform and corresponding data type, and compared to known variants in the HapMap sample (see Materials and methods).
One technical artifact of capture-sequencing procedures is the generation of duplicate DNA sequencing reads that represent the repeated sequencing of copies of the same molecule. These duplicates generally arise when there are too few total molecules present at any stage of the technical manipulations - especially immediately prior to any PCR step. Detection of the duplicate reads by computational analysis is not trivial, and generally relies on observation of the alignment positions. Unfortunately, these artifactual duplicates are difficult to distinguish from exactly overlapping reads that naturally occur within deep sequence samples.
Alignment statistics for SOLiD frag sequencing libraries
Total reads aligned
Total data aligned (Gbp)
Reads on target (%)
Duplicate reads (%)
Mean coverage (X)a
Median coverage (X)a
Targets hit (%)
Bases ≥1× coverage (%)
Bases ≥10× coverage (%)a
Bases ≥20× coverage (%)a
Alignment statistics for Illumina PE and frag sequencing libraries
Total reads aligned
Total data aligned (Gbp)
Reads on target (%)
Duplicate reads (%)
Mean coverage (X)a
Median coverage (X)a
Targets hit (%)
Bases ≥1× Coverage (%)
Bases ≥10× Coverage (%)a
Bases ≥20× Coverage (%)a
To test these theories we also analyzed the PE data as if they were generated from a single-ended frag library. This caused the on-target alignment rate to drop slightly to 73% and the duplicate rate to nearly quadruple to 27.6%, virtually identical to the Illumina frag library duplicate rate. The net effect of using PE data instead of frag data was a significant increase in on-target coverage, which resulted in >90% of the targeted bases covered at 10-fold or higher using just 2.8 Gbp of data, a single 2 × 75 bp lane of Illumina sequencing.
To assess the effect of DNA sequencing coverage depth on our ability to correctly identify variants in the exonic region of NA12812, we conducted variant discovery using both approximately 3.3 Gbp and approximately 10 Gbp of SOLiD capture data and 2.8 and 2.5 Gbp of Illumina PE and frag data, respectively. Only a subfraction of these data, non-duplicate sequence reads mapped to target regions, was used for variant discovery, and it is these data, not the total, that ultimately affect variant discovery quality. Discovered variants were compared to known HapMap SNPs in this sample and dbSNP. Here, the concordance to HapMap is used to measure the false negative discovery rate, and the proportion of variants discovered that were also present in dbSNP129 was used to approximate the false positive discovery rate. Others have typically found approximately 90% of CCDS variants to be present in dbSNP for Europeans  and significant deviation below 90% may indicate an increased false positive discovery rate.
Variant discovery and HapMap concordance for different sequencing types and varying amounts of sequence data
PE (high stringency)
Bases produced (Gbp)
Bases on target after duplicate removal (Gbp)
HapMap variant concordance (%)
Variant concordance (>9× coverage) (%)
This work demonstrates the practicality of genomic target enrichment using capture-sequencing in solution. For the first time, this technology is used at the scale of the whole exome, comprising over 36 Mbp across >170,000 K individual targets. Using four technical replicate libraries, we show that the average coverage of the targeted regions is highly correlated. Capture performance is also shown to be consistent, with the average coverage of each target having >98% correlation between technical replicates. Thus, it is practical to obtain consistent sequence coverage distributions and reproducible variant discovery for a variety of genomic screening experiments. This work also shows the feasibility of using either SOLiD or Illumina-based sequencing after capture. PE data were shown to be superior to frag data, increasing both the on-target number of reads, and greatly improving the correct identification of duplicates. Development of PE sequencing on the SOLiD platform should show a similar effect. Interestingly, Illumina sequencing consistently shows higher levels of enrichment than SOLiD sequencing. This is unexpected because both sequencing platforms yield similar coverage distributions in whole genome sequencing data , that is, without enrichment. Further, the capture-sequencing protocols for both methods are almost identical; therefore, we suspect that differences in efficiency are due to an increase in initial library complexity arising from better annealing efficiencies of the Illumina adapter. This is probably explained by the fact that Illumina adaptor sequences contain an A/T overhang, whereas the SOLiD adapters rely on less efficient blunt-end ligation. We are currently developing an A/T overhang-type SOLiD sequencing adapter for use in capture to test whether we can improve levels of enrichment for SOLiD-based capture.
Finally, we examine the amount of sequence data required to fully interrogate the single nucleotide variants of the exome in a HapMap sample using either SOLiD or Illumina sequencing. Using approximately 10 Gbp of SOLiD data, approximately 93% of all HapMap variants are discovered and over 88% of all variants are present in dbSNP. Illumina based sequencing, however, discovers 96% of HapMap variants, with approximately 85% of variants in dbSNP, using only 3 Gbp of sequence data. This result is achieved because our Illumina protocol yields higher overall coverage on the target regions even while producing less raw sequencing data. SOLiD variant calling appears to be more sensitive at <9× coverage, typically obtaining 20% higher concordance. Overall performance between both platforms is similar, however, and the majority of the observed difference is likely due to differences in the variant discovery pipeline software.
Sequence capture in solution is easier to automate, has higher throughput and is less expensive than microarray-based techniques but has not been extensively used because of performance issues. Here we show the high reproducibility and scalability of our capture method and demonstrate that liquid capture can be used in large-scale experiments to yield reliably high levels of coverage that are consistent at a target-by-target level and are similar to microarray-based techniques. Further, we establish that the entire CCDS exome can be interrogated with just 2.8 Gbp of sequence data, approximately 3% of the required data for whole genome shotgun experiments. At this level of cost and scalability, solution capture-sequencing becomes an attractive technique for rare variant discovery in its own right and as follow-up to genome-wide association studies, especially in studies where much of the heritability of the disease remains unexplained  and thus may be due to rare mutations . Capture-sequencing in solution reduces the cost and increases the throughput of rare mutation discovery by focusing on coding regions of the genome and will prove to be a significant addition to the geneticist's tool chest.
Materials and methods
All sequence data are available from the Short Read Archive with the following accession numbers [SRA012614 to SRA012615].
Sequence alignment and variant discovery
SOLiD sequence data were aligned using ABI's corona_lite package (version: 4.0r2.0) with a maximum allowed mismatch of 6; all other parameters were set at default. Pileup-style files were generated with samtools  and were filtered to require a variant score of at least 40, or 30 and the variant to be on both strands, and present in at least 15% of all reads. Illumina data were aligned using BWA (v 0.5.3) . The base quality was recalibrated using GATK  (downloaded 2 October 2009). Variants were discovered with a minimum LOD of 5 (unless otherwise stated), and were filtered with the following recommended parameters: -X AlleleBalance:low = 0.25, high = 0.75 -X ClusteredSnps.
Library and capture
The experimental procedures for preparation of pre- and post-capture libraries are described in Additional file 1 and are available on-line for the SOLiD  and Illumina platforms . Briefly, 5 μg genomic DNA is sheared, end-repaired and ligated with either Illumina (frag or PE) platform-specific or SOLiD TM platform-specific adaptors. The library is amplified by pre-capture LM-PCR (linker mediated-PCR) and hybridized to NimbleGen SeqCap EZ Exome libraries. After washing, amplification by post-capture LM-PCR and a quantitative PCR-based quality check, the successfully captured DNA is ready for sequencing.
The CCDS (build 36.2) exome capture oligonucleotide pool was designed by targeting 174,984 exons of 16,008 high-confidence protein-coding genes in CCDS. Chromosomal coordinates were obtained from the UCSC genome browser (human build hg18). Target exons were padded to a minimum length of 80 bp, and consolidated to remove redundant overlaps. Coordinates for 528 human miRNA genes were obtained from miRBase (release 10), padded by 25 bp on each end, and likewise consolidated. In sum, the coding and miRNA targets comprised 36 Mb of non-redundant sequence, against which 1.9 million probes were selected on the genomic forward strand, with a median probe length of 75 bp and median start-to-start spacing of 34 bp. A rebalancing algorithm (described in Additional file 1) was used to improve uniformity of coverage across target exons. Probe pools were manufactured by Roche NimbleGen (Madison, WI, USA).
Consensus Coding Sequence
single nucleotide polymorphism.
The authors would like to thank Svasti Haricharan for editing the manuscript. This project was supported by Award Number U54HG003273 from the National Human Genome Research Institute.
- Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA: Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007, 4: 903-905. 10.1038/nmeth1111.PubMedView ArticleGoogle Scholar
- Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR: Genome-wide in situ exon capture for selective resequencing. Nat Genet. 2007, 39: 1522-1527. 10.1038/ng.2007.42.PubMedView ArticleGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461: 272-276. 10.1038/nature08250.PubMedPubMed CentralView ArticleGoogle Scholar
- Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME: Microarray-based genomic selection for high-throughput resequencing. Nat Methods. 2007, 4: 907-909. 10.1038/nmeth1109.PubMedView ArticleGoogle Scholar
- Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27: 182-189. 10.1038/nbt.1523.PubMedPubMed CentralView ArticleGoogle Scholar
- Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J: Multiplex amplification of large sets of human exons. Nat Methods. 2007, 4: 931-936. 10.1038/nmeth1110.PubMedView ArticleGoogle Scholar
- Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem N, Stromberg MP, Stewart DA, Zhang L, Ranade SS, Warner JB, Lee CC, Coleman BE, Zhang Z, McLaughlin SF, Malek JA, Sorenson JM, Blanchard AP, Chapman J, Hillman D, Chen F, Rokhsar DS, McKernan KJ, Jeffries TW, Marth GT, Richardson PM: Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res. 2008, 18: 1638-1642. 10.1101/gr.077776.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Bennett S: Solexa Ltd. Pharmacogenomics. 2004, 5: 433-438. 10.1517/146224126.96.36.1993.PubMedView ArticleGoogle Scholar
- Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, et al: Ensembl 2008. Nucleic Acids Res. 2008, 36: D707-714. 10.1093/nar/gkm988.PubMedPubMed CentralView ArticleGoogle Scholar
- Consensus Coding Sequence. [http://www.ncbi.nlm.nih.gov/CCDS]
- Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP: Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA. 2009, 106: 19096-19101. 10.1073/pnas.0910672106.PubMedPubMed CentralView ArticleGoogle Scholar
- Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, Bainbridge M, Dinh H, Jing C, Wheeler DA, McGuire AL, Zhang F, Stankiewicz P, Halperin JJ, Yang C, Gehman C, Guo D, Irikat RK, Tom W, Fantin NJ, Muzny DM, Gibbs RA: Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med. 2010, 362: 1181-1191. 10.1056/NEJMoa0908094.PubMedPubMed CentralView ArticleGoogle Scholar
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 9362-9367. 10.1073/pnas.0903103106.PubMedPubMed CentralView ArticleGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMedPubMed CentralView ArticleGoogle Scholar
- Samtools. [http://samtools.sourceforge.net]
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- GATK. [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit]
- SOLiD Protocol. [http://www.hgsc.bcm.tmc.edu/documents/Preparation_of_SOLiD_Capture_Libraries.pdf]
- Illumina Protocol. [http://www.nimblegen.com/products/seqcap/ez.html]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.