Hybrid selection for sequencing pathogen genomes from clinical samples
- Alexandre Melnikov1,
- Kevin Galinsky1,
- Peter Rogov1,
- Timothy Fennell1,
- Daria Van Tyne2,
- Carsten Russ1,
- Rachel Daniels1,
- Kayla G Barnes2,
- James Bochicchio1,
- Daouda Ndiaye3,
- Papa D Sene3,
- Dyann F Wirth2,
- Chad Nusbaum1,
- Sarah K Volkman2,
- Bruce W Birren1,
- Andreas Gnirke1 and
- Daniel E Neafsey1Email author
© Melnikov et al.; licensee BioMed Central Ltd. 2011
Received: 3 May 2011
Accepted: 11 August 2011
Published: 11 August 2011
We have adapted a solution hybrid selection protocol to enrich pathogen DNA in clinical samples dominated by human genetic material. Using mock mixtures of human and Plasmodium falciparum malaria parasite DNA as well as clinical samples from infected patients, we demonstrate an average of approximately 40-fold enrichment of parasite DNA after hybrid selection. This approach will enable efficient genome sequencing of pathogens from clinical samples, as well as sequencing of endosymbiotic organisms such as Wolbachia that live inside diverse metazoan phyla.
The falling cost of DNA sequencing means that sample quality, rather than expense, is now the blocking issue for many infectious disease genome sequencing projects. Pathogen genomes are generally very small relative to that of their human host, and are typically haploid in nature. Therefore, even a modest number of nucleated human cells present in infectious disease samples may result in the pathogen DNA representation being dwarfed relative to the host human DNA. This difference in representation poses a significant challenge to achieving adequate sequence coverage of the pathogen genome in a cost-effective manner. Separation of host and pathogen cells prior to DNA extraction can be difficult or inconvenient, particularly in field settings common to clinics in developing countries.
This barrier to the efficient sequencing of pathogen genomes comes at a time when the potential motivations and rewards for large-scale sequencing of pathogens are becoming increasingly clear. Examples abound to demonstrate how whole-genome analyses of pathogen population structure from large numbers of isolates can help to identify the source of disease outbreaks or hidden subpopulations. Whole genome sequencing of 35 Salmonella enterica samples was recently performed by the United States Food and Drug Administration in order to identify the source of a foodborne illness outbreak that affected approximately 300 individuals in 2009 and 2010 . Whole genome sequencing of 20 isolates of pathogenic Coccidiodies spp. fungi identified gene flow in select genomic regions between the recently diverged Coccidiodies immitis and Coccidiodies posadasii . Whole genome sequencing and comparative SNP analysis of unculturable Mycobacterium leprae isolates was utilized to demonstrate that a third of leprosy infections in the United States derive from armadillos . So-called 'third generation' sequencing was successfully employed to identify the origin of the recent Haitian cholera outbreak strain via de novo sequencing of 5 isolates and comparison of those sequences to 23 previously sequenced isolates of Vibrio cholera . In addition, the increasing use of genome-wide association studies to determine the genetic basis of important infectious disease phenotypes, such as drug resistance in malaria parasites [5, 6], will require sequencing or genotyping hundreds to thousands of pathogen isolates, making a shortage of quality specimens an acute problem. All of these studies could have been performed more expediently if a culturing step were not required to eliminate DNA derived from the host or environment.
Existing methods for dealing with DNA contamination in infectious disease samples typically require significant time, money, and/or special handling of samples at the time of collection. Taking Plasmodium falciparum as a model case, malaria parasite samples in blood may be adapted to in vitro culture and sustained in a pure medium of DNA-free human red blood cells. The adaptation process, however, can take more than 6 weeks, requires considerable expertise and expense , and may potentially select for culturable variants. To remedy this, DNA-containing white blood cells may be depleted directly from malaria patient blood samples prior to cell lysis via differential density centrifugation or column filtration [8–10]. While depletion methodologies may reduce white cell abundance to levels useful for biochemical assays, the 100-fold disparity in genome size between human and malaria means that an even modest number of host cells can compromise a sample for genome sequencing. In addition, white blood cell depletion currently requires a significant volume of blood to be drawn from patients (approximately 5 ml), and the blood must then be stored at minus 70°C in a special medium to preserve cellular integrity. This could preclude sample collection for genome sequencing from many clinical trials due to protocol limitations or lack of equipment in the field. Furthermore, pathogens that infect or closely associate with nucleated host cells, such as Plasmodium vivax, Trypanosoma cruzi, or Chlamydia trachomatis, are not amenable to purification by white cell depletion. Endosymbionts such as Wolbachia, which influence host fertility and other traits in filarial worms, insect disease vectors, and diverse other taxa, may only be cultured in an intracellular system , precluding easy isolation of their genomic DNA for sequencing except by elaborate methods .
To address this problem we have adapted a solution hybrid selection approach originally developed for the purification of resequencing targets in the human genome . In brief, biotinylated RNA probes complementary to the pathogen genome ('baits') are hybridized to pathogen DNA in solution and pulled down with magnetic streptavidin-coated beads. Host DNA is washed away and the captured pathogen DNA may then be eluted and amplified for sequencing or genotyping. We experimented with two approaches to bait design: synthetic 140-bp oligos targeting specific regions of the P. falciparum 3D7 reference genome assembly and 'whole genome baits' (WGBs) generated from pure P. falciparum DNA. Using this protocol, we achieved significant enrichment of P. falciparum DNA, to a level that allowed us to conduct whole genome sequencing on samples that otherwise would have been prohibitively expensive to sequence.
Results and discussion
Hybrid selection on a mock clinical malaria sample
Quantitative PCR enrichment measurements from 12 clinical samples
Parasite [DNA] (pg/μl)
Th231.08 (round 1)
Th231.08 (round 2)
Sequencing of the hybrid-selected samples revealed a significant increase in representation of Plasmodium DNA in every case. The synthetic baits respectively yielded an average of 41-fold and 44-fold parasite DNA enrichment for unamplified and WGA simulated clinical samples in genomic regions targeted by the baits, as measured by qPCR. Whole genome baits yielded parasite genome-wide average enrichment levels of 37-fold and 40-fold for the unamplified and WGA input samples, respectively.
Quantitative PCR enrichment measurements
0.10 × SSC
0.25 × SSC
0.50 × SSC
0.75 × SSC
0.10 × SSC
0.25 × SSC
0.50 × SSC
0.75 × SSC
In summary, both bait strategies performed effectively and now offer investigators a method to sequence either targeted regions or complete genomes of pathogens in clinical samples dominated by host DNA. Pairing this hybrid selection protocol with WGA further expands the range of clinical samples now eligible for efficient pathogen genome sequencing. For example, for Plasmodium it should now be possible to sequence the parasite genome directly from dried blood spots on filter paper, an easily collectable and storable sample format.
Hybrid selection on authentic clinical samples
We conducted a second round of hybrid selection on the Th231.08 clinical sample to determine whether the Plasmodium DNA titer in the sample could be boosted above approximately 7%. The second round of hybrid selection was carried out under identical hybridization and wash conditions. qPCR analysis indicates this yielded a sample in which 47.5% of the genetic material was Plasmodium by mass (a 6.7-fold enrichment). This lower fold enrichment is consistent with our previous observation that fold enrichment is inversely proportional to initial parasite DNA titer, but in this case an additional round of hybrid selection yields a sample even more amenable to cost-efficient and deep sequencing.
Although sequencing has become considerably less expensive in recent years, it remains financially impractical to sequence pathogen genomes from clinical samples at scale due to the gross excess of host DNA typically present. The simplest way to compensate for host DNA contamination is to augment sequencing coverage depth. However, this strategy can be costly for all but the most lightly contaminated samples. In contrast, the cost of purification by hybrid selection using whole genome baits is approximately US$250, which is roughly equivalent to the current cost of generating 20-fold coverage of the 23 Mb P. falciparum genome from pure template using a fraction of an Illumina HiSeq lane. For augmented coverage to be an affordable strategy relative to hybrid selection for a target coverage level of 40 × in a genome of this size, samples must contain at least 50% pathogen DNA. This titer of parasite DNA is rarely found in clinical samples unless white cell depletion is performed prior to DNA extraction. For a more typical clinical sample containing only 1% P. falciparum DNA, hybrid selection resulting in 40-fold enrichment enables 40 × coverage depth for a dramatically lower total price (approximately $1,000) than deeper sequencing of the unpurified sample (approximately $40,000).
The modest cost and high performance of this hybrid selection purification protocol will facilitate sequencing of archival clinical samples of malaria parasites and other pathogens previously considered unfit for sequencing by any methodology. This may enable sequencing of important samples stored on filter papers or diagnostic slides predating the spread of drug resistance or associated with historic outbreaks. This purification protocol also broadens the accessibility of sequencing for clinical samples of infectious organisms for which in vitro culture is possible but costly or inconvenient, such as class IV 'select agents' recognized by the Centre for Disease Control. This protocol is not limited to pathogens, and should be equally useful in sequencing commensal or symbiotic organisms closely associated with their host, such as intracellular Wolbachia bacteria, as was recently demonstrated by Kent et al. in their application of an array-based capture protocol . The reduction in sample quality and quantity requirements permitted by hybrid selection will simplify protocol design in future large-scale clinical studies and help realize the benefits of inexpensive, massively parallel sequencing technologies for studying infectious diseases in diverse contexts.
Materials and methods
Mock clinical samples were generated by mixing Homo sapiens NA15510 DNA with a pure preparation of P. falciparum 3D7 parasite DNA at a ratio of 99:1 (H. sapiens: P. falciparum) by mass. Samples were fluorescently quantified prior to mixing using a PicoGreen  assay. Authentic clinical samples were collected in 2008 from symptomatic patients at a clinic in Thies, Senegal under an approved institutional review board protocol. Samples consisted of whole blood dried and stored on a Whatman FTA card (fast technology for analysis of nucleic acids) and/or frozen whole blood stored in glycerolyte 57 solution. DNA was extracted using a DNeasy kit (Qiagen Hilden, North Rhine-Westphalia, Germany). Whole frozen blood samples yielded sufficient DNA for hybrid selection, but samples from FTA cards typically yielded less than 100 ng of DNA and required WGA. WGA was performed using the Repli-G kit (Qiagen).
Bait design and preparation
Synthetic 140-bp oligos were obtained from Agilent and designed to capture exonic regions of the P. falciparum genome as defined in the 3D7 v.5.0 reference assembly. The final bait set included 24,246 oligos (3.4 Mb) with unique BLAT matches to the P. falciparum 3D7 reference genome assembly and no homology to the human genome. Baits and locations are listed in Additional file 2. To generate synthetic single-stranded biotinylated RNA bait, in vitro transcription was performed with biotin-labeled UTP using the MEGAshortscript T7 kit (Ambion Austin, Texas, United States) as described previously .
WGB was generated at the Broad Institute. For input, 3 μg of P. falciparum 3D7 DNA was sheared for 4 minutes on a Covaris E210 instrument set to duty cycle 5, intensity 5 and 200 cycles per burst. The mode of the resulting fragment size distribution was 250 bp. End repair, addition of a 3'-A, adaptor ligation and reaction clean-up followed the Illumina's genomic DNA sample preparation kit protocol except that adapter consisted of oligonucleotides 5'-TGTAACATCACAGCATCACCGCCATCAGTCxT-3' ('x' refers to an exonuclease I-resistant phosphorothioate linkage) and 5'-[PHOS]GACTGATGGCGCACTACGACACTACAATGT-3'. The ligation products were cleaned up (Qiagen), amplified by 8 to 12 cycles of PCR on an ABI GeneAmp 9700 thermocycler in Phusion High-Fidelity PCR master mix with HF buffer (NEB Ipswich, Massachusetts, United States) using PCR forward primer 5'-CGCTCAGCGGCCGCAGCATCACCGCCATCAGT-3' and reverse primer 5'-CGCTCAGCGGCCGCGTCGTAGTGCGCCATCAGT-3' (ABI Carlsbad, California, United States). Initial denaturation was 30 s at 98°C. Each cycle was 10 s at 98°C, 30 s at 50°C and 30 s at 68°C. PCR products were size-selected on a 4% NuSieve 3:1 agarose gel followed by QIAquick gel extraction. To add a T7 promoter, size-selected PCR products were re-amplified as above using the forward primer 5'-GGATTCTAATACGACTCACTATACGCTCAGCGGCCGCAGCATCACCGCCATCAGT-3'. Qiagen-purified PCR product was used as template for whole genome biotinylated RNA bait preparation with the MEGAshortscript T7 kit (Ambion) .
Hybrid selection using either synthetic bait or WGB was carried out as described previously . Hybridization was conducted at 65°C for 66 h with 2 μg of 'pond' libraries carrying standard or indexed Illumina paired-end adapter sequences and 500 ng of bait in a volume of 30 μl. After hybridization, captured DNA was pulled down using streptavidin Dynabeads (Invitrogen Carlsbad, California, United States). Beads were washed once at room temperature for 15 minutes with 0.5 ml 1 × SSC/0.1% SDS, followed by three 10-minute washes at 65°C with 0.5 ml pre-warmed 0.1 × SSC/0.1% SDS, re-suspending the beads once at each washing step. Hybrid-selected DNA was eluted with 50 μl 0.1 M NaOH. After 10 minutes at room temperature, the beads were pulled down, the supernatant transferred to a tube containing 70 μl of 1 M Tris-HCl, pH 7.5, and the neutralized DNA desalted and concentrated on a QIAquick MinElute column and eluted in 20 μl.
Quantitative PCR enrichment measurement
Enrichment of malaria DNA in samples was assessed using a panel of malaria qPCR primers designed to conserved regions of the P. falciparum 3D7 v.5.0 reference genome. Enrichment for each amplicon was calculated as the ratio between the amount of DNA presented pre- and post-hybrid selection, with threshold cycle (cT) counts corrected for qPCR efficiency using a standard curve for each amplicon. All qPCR reactions utilized 1 μl of template containing 1 ng of total DNA. Estimated enrichment for the samples was calculated as the mean enrichment observed across all tested amplicons. Primer sequences and locations are listed in Additional file 3. Quantification of human DNA in the clinical samples was performed prior to sequencing using the Taqman RNase P Detection Reagents kit (Applied Biosystems Carlsbad, California, United States).
Each sample was sequenced at the Broad Institute using one lane of Illumina 76-bp paired-end reads. The libraries of pure P. falciparum DNA and hybrid-selected artificial clinical samples were each sequenced with one Illumina GAIIx lane. The hybrid-selected authentic clinical sample (Th231.08) was sequenced with one Illumina HiSeq lane. Sequence data have been deposited in the NCBI Short Read Archive under accession number [SRA029706].
Quality scores on Illumina reads were rescaled using the MAQ sol2sanger utility . Reads were then aligned to P. falciparum 3D7 (PlasmoDB 5.0) using BWA . Sequenced reads were sorted and the consensus sequence was determined using the SAMtools utilities . %GC was calculated from 140-bp windows across the P. falciparum genome.
The human:P. falciparum DNA ratio in each sequence dataset was estimated from sequencing data by randomly sampling 50K pairs of mated reads and measuring the fractions that uniquely mapped to human versus P. falciparum reference genome assemblies.
Simulated sequencing read coverage for the mock clinical sample prior to hybrid selection was performed by randomly sampling 1% of the read data generated for the pure P. falciparum sample, under the tested assumption that read coverage scales closely with parasite DNA fraction.
Principal components analysis was performed using Eigensoft software  on 8,300 non-singleton SNPs with coverage of at least 10-fold in all strains and consensus quality scores of at least 30.
quantitative polymerase chain reaction
single nucleotide polymorphism
whole genome amplification
whole genome bait.
This project has been funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases National Institutes of Health, Department of Health and Human Services, under contract number HHSN27220090018C. Funding was also supplied by a Global Health Program grant (number 49764) from the Bill and Melinda Gates Foundation and a grant from the National Human Genome Research Institute (number HG03067-05). We thank the Broad sequencing platform for sequence data generation.
- Lienau EK, Strain E, Wang C, Zheng J, Ottesen AR, Keys CE, Hammack TS, Musser SM, Brown EW, Allard MW, Cao G, Meng J, Stones R: Identification of a salmonellosis outbreak by means of molecular sequencing. N Engl J Med. 2011, 364: 981-982. 10.1056/NEJMc1100443.PubMedView ArticleGoogle Scholar
- Neafsey DE, Barker BM, Sharpton TJ, Stajich JE, Park DJ, Whiston E, Hung C-Y, McMahan C, White J, Sykes S, Heiman D, Young S, Zeng Q, Abouelleil A, Aftuck L, Bessette D, Brown A, FitzGerald M, Lui A, Macdonald JP, Priest M, Orbach MJ, Galgiani JN, Kirkland TN, Cole GT, Birren BW, Henn MR, Taylor JW, Rounsley SD: Population genomic sequencing of Coccidioides fungi reveals recent hybridization and transposon control. Genome Res. 2010, 20: 938-946. 10.1101/gr.103911.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Truman RW, Singh P, Sharma R, Busso P, Rougemont J, Paniz-Mondolfi A, Kapopoulou A, Brisse S, Scollard DM, Gillis TP, Cole ST: Probable zoonotic leprosy in the southern United States. N Engl J Med. 2011, 364: 1626-1633. 10.1056/NEJMoa1010536.PubMedPubMed CentralView ArticleGoogle Scholar
- Chin C-S, Sorenson J, Harris JB, Robins WP, Charles RC, Jean-Charles RR, Bullard J, Webster DR, Kasarskis A, Peluso P, Paxinos EE, Yamaichi Y, Calderwood SB, Mekalanos JJ, Schadt EE, Waldor MK: The origin of the Haitian cholera outbreak strain. N Engl J Med. 2011, 364: 33-42. 10.1056/NEJMoa1012928.PubMedPubMed CentralView ArticleGoogle Scholar
- Mu J, Myers RA, Jiang H, Liu S, Ricklefs S, Waisberg M, Chotivanich K, Wilairatana P, Krudsood S, White NJ, Udomsangpetch R, Cui L, Ho M, Ou F, Li H, Song J, Li G, Wang X, Seila S, Sokunthea S, Socheat D, Sturdevant DE, Porcella SF, Fairhurst RM, Wellems TE, Awadalla P, Su X-zhuan: Plasmodium falciparum genome-wide scans for positive selection, recombination hot spots and resistance to antimalarial drugs. Nat Genet. 2010, 42: 268-271. 10.1038/ng.528.PubMedPubMed CentralView ArticleGoogle Scholar
- Van Tyne D, Park DJ, Schaffner SF, Neafsey DE, Angelino E, Cortese JF, Barnes KG, Rosen DM, Lukens AK, Daniels RF, Milner DA, Johnson CA, Shlyakhter I, Grossman SR, Becker JS, Yamins D, Karlsson EK, Ndiaye D, Sarr O, Mboup S, Happi C, Furlotte NA, Eskin E, Kang HM, Hartl DL, Birren BW, Wiegand RC, Lander ES, Wirth DF, Volkman SK, Sabeti PC: Identification and functional validation of the novel antimalarial resistance locus PF10_0355 in Plasmodium falciparum. PLoS Genet. 2011, 7: e1001383-10.1371/journal.pgen.1001383.PubMedPubMed CentralView ArticleGoogle Scholar
- Trager W, Jensen JB: Human malaria parasites in continuous culture. Science. 1976, 193: 673-675. 10.1126/science.781840.PubMedView ArticleGoogle Scholar
- Mons B, Boorsma EG, Ramesar J, Janse CJ: Removal of leucocytes from malaria-infected blood using commercially available filters. Ann Trop Med Parasitol. 1988, 82: 621-623.PubMedGoogle Scholar
- Williamson J, Cover B: Removal of white blood cells from gametocyte-, schizont-, trophozoite- and ring stages of Plasmodium falciparum. Trans R Soc Trop Med Hyg. 1971, 65: 416-PubMedView ArticleGoogle Scholar
- Richards WH, Williams SG: The removal of leucocytes from malaria infected blood. Ann Trop Med Parasitol. 1973, 67: 249-250.PubMedGoogle Scholar
- O'Neill SL, Pettigrew MM, Sinkins SP, Braig HR, Andreadis TG, Tesh RB: In vitro cultivation of Wolbachia pipientis in an Aedes albopictus cell line. Insect Mol Biol. 1997, 6: 33-39. 10.1046/j.1365-2583.1997.00157.x.PubMedView ArticleGoogle Scholar
- Rasgon JL, Gamston CE, Ren X: Survival of Wolbachia pipientis in cell-free medium. Appl Environ Microbiol. 2006, 72: 6934-6937. 10.1128/AEM.01673-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27: 182-189. 10.1038/nbt.1523.PubMedPubMed CentralView ArticleGoogle Scholar
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan M-S, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DMA, et al: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419: 498-511. 10.1038/nature01097.PubMedView ArticleGoogle Scholar
- Kent BN, Salichos L, Gibbons JG, Rokas A, Newton IL, Clark ME, Bordenstein SR: Complete bacteriophage transfer in a bacterial endosymbiont (Wolbachia) determined by targeted genome capture. Genome Biol Evol. 2011, 3: 209-18. 10.1093/gbe/evr007.PubMedPubMed CentralView ArticleGoogle Scholar
- Singer VL, Jones LJ, Yue ST, Haugland RP: Characterization of PicoGreen reagent and development of a fluorescence-based solution assay for double-stranded DNA quantitation. Anal Biochem. 1997, 249: 228-238. 10.1006/abio.1997.2177.PubMedView ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-1858. 10.1101/gr.078212.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2: e190-10.1371/journal.pgen.0020190.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.