Effective detection of rare variants in pooled DNA samples using Cross-pool tailcurve analysis
- Tejasvi S Niranjan†1, 2,
- Abby Adamczyk†1,
- Héctor Corrada Bravo†3, 4,
- Margaret A Taub5,
- Sarah J Wheelan5, 6,
- Rafael Irizarry5 and
- Tao Wang1Email author
© Niranjan et al.; licensee BioMed Central Ltd. 2011
Received: 20 March 2011
Accepted: 28 September 2011
Published: 28 September 2011
Sequencing targeted DNA regions in large samples is necessary to discover the full spectrum of rare variants. We report an effective Illumina sequencing strategy utilizing pooled samples with novel quality (Srfim) and filtering (SERVIC 4 E) algorithms. We sequenced 24 exons in two cohorts of 480 samples each, identifying 47 coding variants, including 30 present once per cohort. Validation by Sanger sequencing revealed an excellent combination of sensitivity and specificity for variant detection in pooled samples of both cohorts as compared to publicly available algorithms.
Next-generation sequencing and computational genomic tools permit rapid, deep sequencing for hundreds to thousands of samples [1–3]. Recently, rare variants of large effect have been recognized as conferring substantial risks for common diseases and complex traits in humans . There is considerable interest in sequencing limited genomic regions such as sets of candidate genes and target regions identified by linkage and/or association studies. Sequencing large sample cohorts is essential to discover the full spectrum of genetic variants and provide sufficient power to detect differences in the allele frequencies between cases and controls. However, several technical and analytical challenges must be resolved to efficiently apply next-generation sequencing to large samples in individual laboratories. First, it remains expensive to sequence a large number of samples despite a substantial cost reduction in available technologies. Second, for target regions of tens to hundreds of kilobases or less for a single DNA sample, the smallest functional unit of a next-generation sequencer (for example, a single lane of an Illumina Genomic Analyzer II (GAII) or HiSeq2000 flow cell) generates a wasteful excess of coverage. Third, methods for individually indexing hundreds to thousands of samples are challenging to develop and limited in efficacy [5, 6]. Fourth, generating sequence templates for target DNA regions in large numbers of samples is laborious and costly. Fifth, while pooling samples can reduce both labor and costs, it reduces sensitivity for the identification of rare variants using currently available next-generation sequencing strategies and bioinformatics tools [1, 3].
We have optimized a flexible and efficient strategy that combines a PCR-based amplicon ligation method for template enrichment, sample pooling, and library indexing in conjunction with novel quality and filtering algorithms for identification of rare variants in large sample cohorts. For validation of this strategy, we present data from sequencing 12 indexed libraries of 40 samples each (total of 480 samples) using a single lane of a GAII Illumina Sequencer. We utilized an alternative base-calling algorithm, Srfim , and an automated filtering program, SERVIC 4 E (Sensitive Rare Variant Identification by Cross-pool Cluster, Continuity, and tailCurve Evaluation), designed for sensitive and reliable detection of rare variants in pooled samples. We validated this strategy using Illumina sequencing data from an additional independent cohort of 480 samples. Compared to publicly available software, this strategy achieved an excellent combination of sensitivity and specificity for rare variant detection in pooled samples through a substantial reduction of false positive and false negative variant calls that often confound next-generation sequencing. We anticipate that our pooling strategy and filtering algorithms can be easily adapted to other popular platforms of template enrichment, such as microarray capture and liquid hybridization [8, 9].
Results and discussion
An optimized sample-pooling strategy
Data analysis and variant calling
Quality assessment and base calling using SRFIM
To overcome this problem, we utilized Srfim, a quality assessment and base-calling algorithm based on a statistical model of fluorescence intensity measurements that captures the technical effects leading to base-calling biases . Srfim explicitly models cycle-dependent effects to create read-specific estimates that yield a probability of nucleotide identity for each position along the read. The algorithm identifies nucleotides with highest probability as the final base call, and uses these probabilities to define highly discriminatory quality metrics. Srfim increased the total number of mapped reads by 1% (to 11.2 million), reflecting improved base-calling and quality metrics, and reduced the number of variant calls by 20% (308 variants across 12 pools; 33 variant calls present in only a single pool).
Cross-pool filtering using SERVIC 4 E
Further validation by Sanger sequencing indicated the persistence of a few false positive calls from this dataset. Analysis of these variant calls allowed us to define statistics that capture regularities in the base calls and quality values at false positive positions compared to true variant positions. We developed SERVIC 4 E, an automated filtering algorithm designed for high sensitivity and reliable detection of rare variants using these statistics.
SERVIC 4 E uses a clustering analysis of continuity and weighted allele frequency to filter variant calls between pools. We use k-medioid clustering and decide the number of clusters using average silhouette width . For common variants, negative pools tend to cluster and are filtered out while all other pools are retained as positives (Figure 7a, b). Rare variant pools, due to their lower allele frequency, will have a narrower range in continuity and weighted allele frequency. Negative pools will appear to cluster less, while positive pools cluster more. SERVIC 4 E will retain as positive only the cluster with highest continuity and weighted allele frequency (Figure 7c, d).
The third filtering step used by SERVIC 4 E captures persistent cycle-dependent errors in variant tailcurves that are not eliminated by Srfim. Cycle-specific nucleotide proportions (tailcurves) from calls in the first half of sequencing cycles are compared to the proportions from calls in the second half of sequencing cycles. The ratio of nucleotide proportions between both halves of cycles is calculated separately for plus and minus strands, thereby providing the tailcurve ratio added sensitivity to strand biases. By default, variant calls are filtered out if the tailcurve ratio differs more than ten-fold; we do not anticipate that this default will need adjustment with future sequencing applications, as it is already fairly generous, chiefly eliminating variant pools with clearly erroneous tailcurve ratios. This default was used for all our datasets.
The combination of filtering by average quality and tailcurve structure eliminates a large number of false variant calls. Additional file 3 demonstrates the effect of these filtering steps applied sequentially on two sets of base call data.
Effect of sequential filtering by SERVIC 4 E on variant output
Number of variant-pools after filtering
Prior to cluster analysis
After cluster analysis
After average quality filtering
After tailcurve filtering
Finally, SERVIC 4 E provides a trim parameter that masks a defined length of sequence from the extremes of target regions from variant calling. This allows for SERVIC 4 E to ignore spurious variant calling that may occur in primer regions as a result of the concatenation of amplicons. By default, this parameter is set to 0; for our datasets, we used a trim value of 25, which is the approximate length of our primers.
Reliable detection of rare variants in pooled samples
Using SERVIC 4 E, we identified 68 unique variants (total of 333 among 12 pools), of which 34 were exonic variants in our first dataset of 480 samples (Additional file 4). For validation, we performed Sanger sequencing for all exonic variants in individual samples in at least one pool. A total of 4,050 medium/high-quality Sanger traces were generated, targeting approximately 3,380 individual amplicons. Total coverage in the entire study by Sanger sequencing was approximately 930 kb (approximately 7.3% of total coverage obtained by high-throughput sequencing). Sanger sequencing confirmed 31 of the 34 variants. Fifteen rare exonic variants were identified as heterozygous in a single sample in the entire cohort.
A comparison with available variant calling algorithms
Validation analysis of variant calling from first cohort samples
SERVIC 4 E
SERVIC 4 E
Variant identification and validation
Statistical analysis (%)
Rare exonic variant detection and validation
Detected total variants (n = 15)
Detection rate (%)
To call variants with SAMtools , we used the deprecated Maq algorithms (SAMtools pileup -A -N 80), as the regular SAMtools algorithms failed to identify all but the most common variants. As a filtering cutoff we retained only the top 95th percentile of variants by consensus quality and SNP quality score (cq ≥ 196 and sq ≥ 213 for standard Illumina base calls, Figure 4a; cq ≥ 161 and sq ≥ 184 for Srfim base calls, Figure 4b).
SNPSeeker  uses large deviation theory to identify rare variants. It reduces the effect of sequencing errors by generating an error model based on internal negative controls. We used exons 6 and 7 as the negative controls in our analysis (total length = 523 bp) as both unfiltered SAMtools analysis and subsequent Sanger validation indicated a complete absence of variants in both exons across all 12 pools. Only Illumina base calls were used in this comparison because of a compatibility issue with the current version of Srfim. The authors of SNPSeeker recently developed a newer variant caller called SPLINTER , which requires both negative and positive control DNA to be added to the sequencing library. SPLINTER was not tested due to the lack of a positive control in our libraries.
CRISP  conducts variant calling using multiple criteria, including the distribution of reads and pool sizes. Most importantly, it analyzes variants across multiple pools, a strategy also employed by SERVIC 4 E. CRISP was run on both Illumina base calls and Srfim base calls using default parameters.
Syzygy  uses likelihood computation to determine the probability of a non-reference allele at each position for a given number of alleles in each pool, in this case 80 alleles. Additionally, Syzygy conducts error modeling by analyzing strand consistency (correlation of mismatches between the plus and minus strands), error rates for dinucleotide and trinucleotide sequences, coverage consistency, and cycle positions for mismatches in the read . Syzygy was run on both Illumina and Srfim base calls, using the number of alleles in each pool (80) and known dbSNP positions as primary input parameters.
SERVIC 4 E was run using a trim value of 25 and a total allele number of 80. All other parameters were run at default. The focus of our library preparation and analysis strategy is to identify rare variants in large sample cohorts, which necessitates variant calling software with very high sensitivity. At the same time, specificity must remain high, primarily to ease the burden during validation of potential variants. In addition to calculating sensitivity and specificity, we calculated the Matthews correlation coefficient (MCC; see Materials and methods) for each method (Table 2) in order to provide a more balanced comparison between the nine methods.
For validation of our dataset, we focused primarily on changes in the exonic regions of our amplicons. Any intronic changes that were collaterally sequenced successfully were also included in our final analysis (Table 2). Sixty-one exonic positions were called as having a variant allele in at least one pool by one or more of the nine combinations of algorithms tested. We generated Sanger validation data in at least one pool for 49 of the 61 positions identified. Genotypes for validated samples are indicated in Additional file 5.
SNPSeeker (with Illumina base calls) performed with the highest specificity (97.3%), but with the worst sensitivity (62.2%), identifying less than half of the 15 valid rare exonic variants (Table 2). This is likely due to an inability of this algorithm to discriminate variants with very low allele frequencies in a pool; 84% of SNPSeeker's true positive calls have an allele frequency ≥ 1/40, while only 13% of the false negative calls have a frequency ≥ 1/40 (Additional files 4 and 6). SNPSeeker's MCC score was low (61.8%), due in large part to its very low false positive rate.
SAMtools alone with Illumina base calls achieved a 92.2% sensitivity, identifying all 15 rare exonic variants; however, these results were adulterated with the highest number of false positives, resulting in the worst specificity (56.2%) and MCC score (52.8%) among the nine methods (Table 2). Incorporation of Srfim base calls cut the number of false positives by 60% (from 32 to 13) without a sizeable reduction in the number of true positive calls (from 83 to 80). Fourteen of the fifteen valid rare exonic variants were successfully identified, which while not perfect, is an acceptably high sensitivity (Table 2). Srfim made noticeable improvements to individual base quality assessment as reflected in a substantial reduction in low quality variant calls (Figure 4) by reducing the contribution of low quality base calls to the average quality distribution (Figure 8b) and by reducing the tailcurve effect that leads to many false positives (Additional file 3a, b). Most low quality variant calls eliminated when transitioning to Srfim were not valid; nonetheless, three low quality valid variant calls were similarly affected by Srfim, and their loss resulted in a slight reduction in the true positive rate.
CRISP using Illumina base calls achieved a sensitivity slightly lower than SAMtools (87.8% versus 92.2%). Additionally, CRISP identified only 13 of the 15 valid rare exonic variants. Though this is lower than SAMtools, it is a large improvement over SNPSeeker; for the purposes set forth in our protocol, the > 75% sensitivity for extremely rare variants achieved by CRISP (using either base-calling method) is acceptable (Table 2).
Syzygy achieved the second highest sensitivity (94.4%) using Illumina base calls, but specificity remained low (67.1%). Fourteen of the fifteen rare exonic variants were successfully identified. CRISP and Syzygy achieved relatively average MCC values (50.5% and 65.0%, respectively), reflecting better performance than SAMtools with Illumina base calls.
SERVIC 4 E using Illumina base calls achieved the highest sensitivity (97.8%) and identified all 15 valid rare exonic variants. Both sensitivity and specificity were improved over SAMtools, CRISP, and Syzygy (Table 2), reflected in the highest MCC score of all the tested methods (84.2%). Taken together, the combination of SERVIC 4 E with either base-calling algorithm provides the highest combination of sensitivity and specificity in the dataset from pooled samples.
As previously mentioned, Srfim greatly improved variant calling in SAMtools, as is reflected in the 19% increase in SAMtools' MCC value (from 52.8% to 71.4%). CRISP, Syzygy, and SERVIC 4 E benefited little from using Srfim base calls: the MCC value for CRISP improved by only 6% (from 50.5% to 56.5%), Syzygy diminished by 4.6% (from 65.0% to 60.4%), and SERVIC 4 E diminished by 6.5% (from 84.2% to 77.7%). Importantly, use of Srfim base calls with Syzygy diminished its capacity to detect rare variants by a third. These three programs are innately designed to distinguish low frequency variants from errors using many different approaches. As such, it can be inferred from our results that any initial adjustments to raw base calls and quality scores by the current version of Srfim will do little to improve that innate capacity. In contrast, SAMtools, which is not specifically built for rare variant detection and would therefore have more difficulty distinguishing such variants from errors, benefits greatly from the corrective pre-processing provided by Srfim.
In addition to performance metrics like sensitivity and specificity, we analyzed annotated SNP rates, transition-transversion rates, and synonymous-non-synonymous rates of the nine algorithms on a variant-pool basis (Additional file 7).
The variant pools with the greatest discrepancies between the various detection methods tended to have an estimated allele frequency within the pool that is less than the minimum that should be expected (1/80; Additional files 4, 6, and 8). Such deviations are inevitable, even with normalization steps, given the number of samples being pooled. This underscores the importance of having careful, extensive normalization of samples to minimize these deviations as much as possible, and the importance of using variant detection methods that are not heavily reliant on allele frequency as a filtering parameter or are otherwise confounded by extremely low allele frequencies.
Validation using data from an independent cohort of samples
To further assess the strength of our method and analysis software, we sequenced the same 24 GRIP2 exons in a second cohort of 480 unrelated individuals. The same protocol for the first cohort was followed, with minor differences. Firstly, we pooled 20 DNA samples at equal concentration into 24 pools. The first 12 pools were sequenced in one lane of a GAII and the last 12 pools were sequenced in a separate lane (Additional file 9). Additionally, the libraries were sequenced using the 100-bp paired-end module, and sequencing was conducted using a newer version of Illumina's sequencing chemistry. These 24 libraries occupied approximately 5% of the total sequencing capacity of the two lanes. The remaining capacity was occupied by unrelated libraries that lacked reads originating from the GRIP2 locus
To map reads from this dataset, we initially used Bowtie's strict alignment parameters (-v 3), as we had done with our first dataset, but this resulted in a substantial loss of coverage in the perimeters of target regions. This is likely due to reads that cross the junctions between our randomly concatenated amplicons; such reads, which have sequence from two distant amplicons, appear to have extensive mismatching that would result in their removal. This effect became pronounced when using long read lengths (100 bp), but was not noticeable when using the shorter reads in our first dataset (Additional file 10). This effect should not be an issue when using hybridization enrichment, where ligation of fragments is not needed.
In order to improve our coverage, we used Bowtie's default parameter, which aligns the first 28 bases of each read, allowing no more than two mismatches. To focus on GRIP2 alignments, we provided a fasta reference of 60 kb covering the GRIP2 locus. A total of 6.4 million reads (5.6% of all reads) aligned to our reference template of the GRIP2 locus. The depth of coverage for each amplicon pool is shown in Additional file 11. For exonic positions, the average allelic coverage was 60.8×, and the minimum coverage was 10×; 99.9% of exonic positions were covered at least 15× per allele, and 98.5% were covered at least 30× per allele.
We did not apply Srfim base calls to our variant calling as Srfim has not yet been fully adapted to the newer sequencing chemistry used with this cohort. For variant calling, we tested Syzygy and SERVIC 4 E, the two most sensitive software identified in our first dataset when using only the standard Illumina base calls (Table 2). Syzygy was provided with a template-adjusted dbSNP file and a total allele number of 40 as input parameters. All other parameters were run at default. Syzygy made a total of 474 variant calls across 24 pools (74 unique variant calls). Of the 74 unique calls made, 36 were exonic changes. SERVIC 4 E was run using a trim value of 25 and a total allele number of 40. All other parameters were run at default. SERVIC 4 E made a total of 378 variant calls across 24 pools (68 unique variant calls). Of the 68 unique calls made, 33 were exonic changes. Between Syzygy and SERVIC 4 E, a total of 42 unique exonic sequence variant calls were made (Additional files 12 and 13).
Validation analysis of variant calling from second cohort samples
SERVIC 4 E
Variant identification and validation
Statistical analysis (%)
Rare exonic variant detection and validation
Detected total variants (n = 16)
Positive detection rate (%)
We have developed a strategy for targeted deep sequencing in large sample cohorts to reliably detect rare sequence variants. This strategy is highly flexible in study design and well suited to focused resequencing of candidate genes and genomic regions from tens to hundreds of kilobases. It is cost-effective due to substantial cost reductions provided by sample pooling prior to target enrichment and by the efficient utilization of next-generation sequencing capacity using indexed libraries. Though we utilized a PCR method for target enrichment in this study, other popular enrichment methods, such as microarray capture and liquid hybridization [8–10], can be easily adapted for this strategy.
Careful normalization is needed during sample pooling, PCR amplification, and library indexing, as variations at these steps will influence detection sensitivity and specificity. While genotyping positive pools will be needed for validation of individual variants, only a limited number of pools require sequence confirmation as this strategy is intended for discovery of rare variants.
SERVIC 4 E is highly sensitive to the identification or rare variants with minimal contamination by false positives. It consistently outperformed several publicly available analysis algorithms, generating an excellent combination of sensitivity and specificity across base-calling methods, sample pool sizes, and Illumina sequencing chemistries in this study. As sequencing chemistry continues to improve, we anticipate that our combined sample pooling, library indexing, and variant calling strategy should be even more robust in identifying rare variants with allele frequencies of 0.1 to 5%, which are within the range of the majority of rare deleterious variants in human diseases.
Materials and methods
Sample pooling and PCR amplification
De-identified genomic DNA samples from unrelated patients with intellectual disability and autism, and normal controls were obtained from Autism Genetics Research Exchange (AGRE), Greenwood Genomic Center, SC, and other DNA repositories . An informed consent was obtained from each enrolled family at the respective institutions. The Institutional Review Board at the Johns Hopkins Medical Institutions approved this study.
DNA concentration from each cohort of 480 samples in 5 × 96-well plates was measured using a Quant-iT™ PicoGreen® dsDNA Kit (Invitrogen, Carlsbad, CA, USA) in a Gemini XS Microplate Spectrofluorometer. These samples were normalized and mixed at equal molar ratio into 12 pools of 40 samples each (first cohort) or 24 pools of 20 samples each (second cohort). For convenience, first cohort samples from the same column of each 5 × 96-well plate were pooled into a single well (Figure 1). The same principle was applied to the second cohort, with the first two and a half plates combined into the first 12 pools, and the last two and a half plates combined into the last 12 pools (Additional file 9). PCR primers for individual amplicons were designed using the Primer3 program. PCR reaction conditions were optimized to result in a single band of the expected size. Phusion Hot Start High-Fidelity DNA Polymerase (Finnzymes, Thermo Fisher Scientific, Waltham, MA, USA) and limited amplification cycles (n = 25) were used to minimize random errors introduced during PCR amplification. PCR reactions were carried out in a 20- μl system containing 50 ng of DNA, 200 μM of dNTP, 1× reaction buffer, 0.2 μM of primers, and 0.5 units of Phusion Hot Start High-Fidelity Polymerase in a thermocycler with an initial denaturation at 98°C for 30 seconds followed by 25 cycles of 98°C for 10 seconds, 58 to 66°C for 10 seconds, and 72°C for 30 seconds. The annealing temperature was optimized for individual primer pairs. Successful PCR amplification for individual samples was then verified by agarose gel electrophoresis. The concentration for individual PCR products was measured using the Quant-iT™ PicoGreen® dsDNA Kit (Invitrogen) on Gemini XS Microplate Spectrofluorometer, and converted to molarity. PCR amplicons intended for the same indexed library were combined at equal molar ratio, purified using QIAGEN (Hilden, Germany) QIAquick PCR Purification Kit, and concentrated using Microcon YM-30 columns (Millipore, Billerica, MA, USA).
Amplicon ligation and fragmentation
The pooled amplicons were ligated using a Quick Blunting and Quick Ligation Kit (NEB, Ipswich, MA, USA) following the manufacturer's instructions. For blunting, a 25- μl reaction system was set up as follows: 1× blunting buffer, 2 to 5 μg of pooled PCR amplicons, 2.5 μl of 1 mM dNTP mix, and 1 μl of enzyme mix including T4 DNA polymerase (NEB #M0203) with 3' → 5' exonuclease activity and 5' → 3' polymerase activity and T4 polynucleotide kinase (NEB #M0201) for phosphorylation of the 5' ends of blunt-ended DNA. The reaction was incubated at 25°C for 30 minutes and then the enzymes were inactivated at 70°C for 10 minutes. The blunting reaction products were purified using a MinElute PCR purification column (QIAGEN) and then concentrated using a Microcon YM-30 column (Millipore) to 5 μl volume in distilled water. For ligation, 5 μl of 2× Quick-ligation buffer was mixed with 5 μl of purified DNA. Quick T4 DNA ligase (1 μl; NEB) was added to the reaction mixture, which was incubated at 25°C for 5 minutes and then chilled on ice. The reaction product (0.5 μl) was checked for successful ligation using 1.5% agarose gel electrophoresis. The ligation products were then purified using a MinElute PCR purification column (QIAGEN). Random fragmentation of the ligated amplicons was achieved using either one of the two methods: (1) nebulization in 750 μl of nebulization buffer at 45 psi for 4 minutes on ice following a standard protocol (Agilent); or (2) using a NEBNext dsDNA Fragmentase Kit following the manufacturer's instructions (NEB). One-twentieth of the product was analyzed for successful fragmentation to a desired range using 2% agarose gel electrophoresis.
Library construction and Illumina sequencing
The Multiplexing Sample Preparation Oligonucleotide Kit (Illumina PE-400-1001) was used to generate 1 × 12 (first cohort) and 2 × 12 (second cohort) individually indexed libraries following the manufacturer's instructions. The indexed libraries were quantified individually and pooled at equal molar quantity. The concentration of the final pooled library was determined using a Bioanalyzer (Agilent). All 12 pooled libraries from the first cohort were run in one lane of a flow cell on an Illumina Genomic Analyzer II (GAII). The first 12 pooled libraries from the second cohort were run in one lane of a GAII, while the last 12 pooled libraries were run in another lane in the same flow cell. Illumina sequencing was done at the UCLA DNA Sequence Core and Genetic Resource Core Facility at the Johns Hopkins University.
Sequence data analysis
Raw intensity files and fastq-formatted reads were provided for both cohort datasets. Output had been calibrated with control lane PhiX DNA to calculate matrix and phasing for base calling. A custom script was used on first cohort sequence data to identify the 12 Illumina barcodes from the minimum edit distance to the barcode and assign a read to that pool if the distance index was unique (demultiplexing). Second cohort sequence data were provided to us already demultiplexed. Read mapping was done independently on each pool using BOWTIE (options: -v 3 for first cohort, default for second cohort). As reference templates, hg19 was used for the first cohort and a 60-kb fragment of the GRIP2 regions was used for the second cohort (GRIP2 region- chr3:14527000-14587000).
Variant calling using SAMtools was done independently on each pool using SAMtools' deprecated algorithms (options: pileup -vc -A -N 80). Variants identified were first filtered by eliminating non-GRIP2 variants, and then filtered by consensus quality and SNP quality scores (cq ≥ 196 and sq ≥ 213 for Illumina base calls; cq ≥ 161 and sq ≥ 184 for Srfim base calls). Deprecated (Maq) algorithms were used, as the current SAMtools variant-calling algorithms failed to call all but the most common SNPs. Quality cutoff is based on the 95th percentile of scores in the quality distributions observed amongst all reported SAMtools variants in the GRIP2 alignment region, after excluding variants with the maximal quality score of 235). Reads were base-called using Srfim using default filtering and quality parameters.
SERVIC 4 E was given the location of sorted alignment (BAM) files. Though alignment files are maintained separately for each pool, the locations of each file are given all together. A trim value was set at 25. This trims 25 bases away from the ends of aligned amplicons, so that variant calling is focused away from primer regions. Use of shorter primers during library preparation allows for a smaller trim value. Hybridization enrichment will always result in a trim value of zero, regardless of what trim value is actually set. The total number of alleles in each pool was also provided as input (80 alleles for the first cohort; 40 alleles for the second cohort). SERVIC 4 E (release 1) does not call insertions or deletions.
SNPSeeker was run on first cohort data using author recommended parameters. Reads (Illumina base calls) were converted to SCARF format. Srfim base calls could not be used due to an unknown formatting issue after SCARF conversion. Alignment was conducted against GRIP2 template sequences. Exons 6 and 7 reference sequences were merged so that their alignments could be used as a negative control to develop an error model. All 47 cycles were used in the alignment, allowing for up to three mismatches. Alignments were tagged and concatenated, and an error model generated using all 47 cycles, allowing for up to three mismatches, and using no pseudocounts. The original independent alignment files (pre-concatenation) were used for variant detection. As per recommendation by the authors, the first third of cycles was used for variant detection (15 cycles). A P-value cutoff of 0.05 was used. Lower cutoffs generated worse results when checked against our validation database.
CRISP was run using default parameters. A CRISP-specific pileup file was generated using the author-provided sam_to_pileup.py script and not generated using the pileup function in SAMtools. A separate pileup was generated for each pool for both alignments from Illumina base calls and alignment from Srfim base calls. A BED file was provided to focus pileup at GRIP2 loci. CRISP analysis for variant detection was conducted using all 47 cycles and a minimum base quality of 10 (default). All other parameters were also kept at default.
Syzygy [3, 19] was run on both cohorts using 80 and 40 as the total number of alleles, respectively. A dbSNP file was provided for known chromosome 3 variants. A TGF file was provided to focus variant calling at GRIP2 target regions. Hg19 was used as the reference sequence for the first cohort, while the same abridged GRIP2 sequence that was used by SERVIC 4 E was also used by Syzygy for the second cohort. All other parameters were run at default.
Reads used for analysis, both Illumina and Srfim base calls, are available through the public data repository at the NCBI (accession number SRP007694). Srfim is available as an R package, while SERVIC 4 E is available as a set of R scripts. Both are available for download online .
Validation by Sanger sequencing
Sanger sequencing of positive pools for variant validation was conducted using the BigDye Terminator v3.1 Cycle Sequencing Kit on an ABI3100 automatic DNA analyzer (Applied Biosystems, Foster City, CA, USA) following the manufacturer's instructions.
Sanger sequencing was done on each sample within a pool separately (40 traces per pool with the first cohort, 20 traces per pool for the second cohort). Only traces with low quality or ambiguous calls were sequenced bidirectionally. In the event that a positive sample was verified at least once in the pool, further sequencing of that pool was halted. Sequencing primers were the same primers used in target enrichment to build the libraries for next-generation sequencing.
Standard sequence alignment software (CodonCode, MacVector) followed by manual investigations of the chromatograms was used to identify any variants that might have been missed by all nine combinations of programs.
Matthews correlation coefficient
The MCC is intended as a measure of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), without being influenced by potential extreme sizes by one or more of the groups. An MCC = 1 indicates perfect correlation between predicted results (variants identified by next-generation sequencing and various combinations of base-calling and variant-calling algorithms) and the observed results (validation by Sanger sequencing). An MCC = 0 indicates that the algorithm is no better than random. An MCC = -1 indicates an inverse correlation. MCC = (TP × TN-FP × FN)/SQRT [(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)]. Sensitivity (true positive rate, recall): TP/(TP + FN). Specificity (true negative rate): TN/(FP + TN). Positive predictive value (precision): TP/(TP + FP). Negative predictive value: TN/(TN + FN). Accuracy: (TP + TN)/(TP + TN + FP + FN). False positive rate (fall-out): 1-True negative rate. False discovery rate: FP/(FP + TP).
consensus quality score generated by SAMtools pileup
Genome Analyzer II (Illumina Sequencing Machine)
- GRIP2 :
glutamate-receptor interacting protein 2
Matthews correlation coefficient
polymerase chain reaction
- SERVIC 4 E :
Sensitive Rare Variant Identification by Cross-pool Cluster: Continuity: and tailCurve Evaluation
single nucleotide polymorphism
SNP quality score generated by SAMtools pileup.
We thank Dr David Valle of Johns Hopkins University for critical reading of this manuscript, Autism Genetics Research Exchange (AGRE) and Greenwood Genetic Center (GGC) for DNA samples used in this study, and Dr Manuel Rivas of the Massachusetts Institute of Technology and the University of Oxford with assistance in running Syzygy. This work was supported in part by a pilot grant (#2487) from Autism Speaks (to TW), a research grant from the Brain Science Institute at Johns Hopkins University, R01HD52680 from NICHD (to TW), and R01HG005220 from NHGRI (to RI). TSN is a student of the Predoctoral Training Program in Human Genetics at Johns Hopkins University.
- Druley TE, Vallania FL, Wegner DJ, Varley KE, Knowles OL, Bonds JA, Robison SW, Doniger SW, Hamvas A, Cole FS, Fay JC, Mitra RD: Quantification of rare allelic variants from pooled genomic DNA. Nat Methods. 2009, 6: 263-265. 10.1038/nmeth.1307.PubMedPubMed CentralView ArticleGoogle Scholar
- Out AA, van Minderhout IJ, Goeman JJ, Ariyurek Y, Ossowski S, Schneeberger K, Weigel D, van Galen M, Taschner PE, Tops CM, Breuning MH, van Ommen GJ, den Dunnen JT, Devilee P, Hes FJ: Deep sequencing to reveal new variants in pooled DNA samples. Hum Mutat. 2009, 30: 1703-1712. 10.1002/humu.21122.PubMedView ArticleGoogle Scholar
- Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK: High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat Genet. 2010, 42: 851-858. 10.1038/ng.659.PubMedPubMed CentralView ArticleGoogle Scholar
- McClellan J, King M-C: Genetic heterogeneity in human disease. Cell. 2010, 141: 210-217. 10.1016/j.cell.2010.03.032.PubMedView ArticleGoogle Scholar
- Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R: Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008, 5: 235-237. 10.1038/nmeth.1184.PubMedPubMed CentralView ArticleGoogle Scholar
- Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan DA, Homer N, Huentelman MJ: Identification of genetic variants using bar-coded multiplexed sequencing. Nat Methods. 2008, 5: 887-893. 10.1038/nmeth.1251.PubMedPubMed CentralView ArticleGoogle Scholar
- Bravo HC, Irizarry RA: Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics. 2009, 66: 665-674.View ArticleGoogle Scholar
- Thomas AJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA: Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007, 4: 903-905. 10.1038/nmeth1111.View ArticleGoogle Scholar
- Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D'Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jeddeloh JA, Muzny D, Albert TJ, Gibbs RA: Whole exome capture in solution with 3 Gbp of data. Genome Biol. 2010, 11: R62-10.1186/gb-2010-11-6-r62.PubMedPubMed CentralView ArticleGoogle Scholar
- Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME: Microarray- based genomic selection for high-throughput resequencing. Nat Methods. 2007, 4: 907-909. 10.1038/nmeth1109.PubMedView ArticleGoogle Scholar
- Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J: Multiplex amplification of large sets of human exons. Nat Methods. 2007, 4: 931-936. 10.1038/nmeth1110.PubMedView ArticleGoogle Scholar
- Markoulatos P, Siafakas N, Moncany M: Multiplex polymerase chain reaction: a practical approach. J Clin Lab Anal. 2002, 16: 47-51. 10.1002/jcla.2058.PubMedView ArticleGoogle Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome center's improvements to the Illumina sequencing system. Nat Methods. 2008, 5: 1005-1010. 10.1038/nmeth.1270.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map (SAM) format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987, 20: 53-65. 10.1016/0377-0427(87)90125-7.View ArticleGoogle Scholar
- Bansal V: A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 2010, 26: i318-i324. 10.1093/bioinformatics/btq214.PubMedPubMed CentralView ArticleGoogle Scholar
- Vallania FLM, Druley TE, Ramos E, Wang J, Borecki I, Province M, Mitra RD: High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 2010, 20: 1711-1718. 10.1101/gr.109157.110.PubMedPubMed CentralView ArticleGoogle Scholar
- Rivas M, Daly M: Syzygy Documentation Release 0.9. [http://www.broadinstitute.org/software/syzygy/sites/default/files/Syzygy.pdf]
- Mejias R, Adamczyk A, Anggono V, Niranjan T, Thomas GM, Sharma K, Skinner C, Schwartz CE, Stevenson RE, Fallin MD, Kaufmann W, Pletnikov M, Valle D, Huganir RL, Wang T: Gain-of-function glutamate receptor interacting protein 1 variants alter GluA2 recycling and surface distribution in patients with autism. Proc Natl Acad Sci USA. 2011, 108: 4920-495. 10.1073/pnas.1102233108.PubMedPubMed CentralView ArticleGoogle Scholar
- Niranjan T, Bravo HC: Download Links for SRFIM and SERVIC 4 E scripts. [http://www.cbcb.umd.edu/~hcorrada/secgen]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.