Effective detection of rare variants in pooled DNA samples using cross-pool tail-curve analysis
- Tejasvi S Niranjan†1, 2,
- Abby Adamczyk†1,
- Hector Corrada Bravo†3, 4,
- Margaret A Taub5,
- Sarah J Wheelan5, 6,
- Rafael Irizarry5 and
- Tao Wang1
© Niranjan et al; licensee BioMed Central Ltd. 2011
Published: 19 September 2011
Rare genetic variants of large effect may confer a substantial genetic risk for common diseases and complex traits. There is considerable interest in sequencing limited genomic regions such as candidate genes and target regions identified by genetic linkage and/or association studies. Next-generation sequencing of pooled DNA samples is an efficient way to identify rare variants in large sample sets. Although sample pooling can reduce the labor and cost of sequencing, it also reduces the sensitivity and specificity for effective and reliable identification of rare variants. It remains a challenge to solve these problems using the available computational genomics tools. We have developed an effective Illumina-based sequencing strategy using pooled samples and have optimized a novel base-calling algorithm, Srfim, and a variant-calling algorithm, SERVIC4E (Sensitive Rare Variant Identification by Cross-pool Cluster, Continuity & Tail-Curve Evaluation). SERVIC4E analyzes base composition by cycle or tail-curves across sample pools and employs multiple filtering strategies, including quality and continuity cluster analysis, average quality filtering, tail-curve filtering and error proximity filtering, to accurately identify rare sequence variants. We validated these algorithms using two independent Illumina sequence datasets generated from different pool sizes, read lengths and sequencing chemistries. Using these programs, we identified 32 coding variants, including 14 present only once over 24 exon-containing regions in one sample cohort (n = 480), and 41 coding variants, including 16 present only once in the same regions in an unrelated cohort (n = 480). Validation of these variants by Sanger sequencing revealed an excellent combination of sensitivity (97.8% and 96.4%) and specificity (84.9% and 93.8%) for variant detection in pooled samples from both cohorts, respectively. Data from these studies showed that our algorithms compare favorably with the available programs, including SAMtools, SNPSeeker, CRISP and Syzygy, for the effective and reliable detection of rare variants in pooled samples.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.