Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems
© Minoche et al.; licensee BioMed Central Ltd. 2011
Received: 16 July 2011
Accepted: 8 November 2011
Published: 8 November 2011
The generation and analysis of high-throughput sequencing data are becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases.
We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strands separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range.
The errors and biases we report have implications for the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.
Next generation sequencing (NGS) is revolutionizing molecular biology research with a wide and rapidly growing range of applications. These applications include de novo genome sequencing, re-sequencing, detection and profiling of coding and non-coding transcripts, identification of sequence variants, epigenetic profiling, and interaction mapping. Compared with microarrays, previously used for many of these applications, NGS offers a higher dynamic range, enabling the detection of rare transcripts and splice variants in the transcriptome as well as rare genomic polymorphisms - for example, somatic mutations present within cancer samples. The challenge remains to distinguish sequence variation from sequencing errors, and a thorough characterization of NGS data is required in order to detect method-inherent errors and biases. Systematic errors are platform-dependent. In the context of this work, we focus on Illumina data. According to market share analysis, almost two thirds of all NGS instruments presently in operation have been manufactured by Illumina .
Existing studies about Illumina data evaluation have revealed several biases, that is, a non-random distribution of the reads in the sequenced sample over the reference (reported for the Genome Analyzer (GA) I [2–5]) and a non-random distribution of errors (GAIIx ). Preferences of certain substitution errors and sequence context have been observed. For instance, wrong base calls are frequently preceded by base G  and frequencies of base substitutions vary by 10- to 11-fold, with A to C conversions being the most frequent error [2, 7]. Such errors might have profound implications on the interpretation of results: a non-random read distribution can bias profiling of transcripts and hamper the detection of sequence polymorphisms in regions of low sequence coverage. Errors in the reads can result in false positive variant calls or wrong consensus sequences.
The Illumina sequencing technology has been under constant development, relating to instrumentation, signal processing software, and sequencing chemistry, towards the production of more data and longer sequencing reads. The HiSeq2000 became commercially available in the second quarter of 2010 and uses sequencing-by-synthesis (SBS) chemistry similar to the Illumina GA series but at a two- to five-fold increased rate of data acquisition. A HiSeq flow cell can be imaged on both the top and bottom surface. To increase the HiSeq data collection rate, imaging is performed in a line scanning mode, in contrast to the area imaging in the GA. Instead of using only one camera, the HiSeq operates with a four camera system that detects the intensities of all four bases simultaneously. The Hiseq currently runs with lower cluster densities than the GA and with a maximal read length of 100 nucleotides for single reads or 2 × 100 nucleotides in paired-end mode.
Every development of a system can shift error profiles and can reveal new types of errors. Here, we evaluate Illumina sequencing data generated on the latest systems, the GAIIx and HiSeq2000, using current sequencing chemistry and up-to-date base-calling software. We focus on errors and biases that have an impact on common sequencing applications and we provide suggestions on how to trim and filter the reads in order to substantially reduce error rates. Since high quality reference sequences are not always available in a sequencing project, we first report properties of the unprocessed raw reads. Then we assess the error rates and biases after mapping against high quality reference sequences derived from two plants (Beta vulgaris and Arabidopsis thaliana) and the bacterial virus PhiX174.
Properties of the sequence data
Bv + PhiX
At + PhiX
Number of lanes
Number of sequenced pairs
Number of sequenced bases
Fraction of uncalled bases
Fraction of uncalled bases - read 1
Fraction of uncalled bases - read 2
Fraction of reads with at least one uncalled base
Fraction of entirely uncalled reads
Fraction of bases in B-tails
Fraction of uncalled bases in B-tails
Fraction of bases in B-tails - read 1
Fraction of bases in B-tails - read 2
Average length of B-tails
Fraction of reads with B-tail
Fraction of reads containing at least one uncalled base in B-tail
Fraction of both reads with B-tail
Average Q-score - read 1
Average Q-score - read 2
Q ≥ 30 bases
37.27 Gbp (79.68%)
10.56 Gbp (73.99%)
1.74 Gbp (64.29%)
Q ≥ 30 bases - read 1
18.70 Gbp (39.98%)
5.49 Gbp (38.42%)
0.90 Gbp (33.01%)
Q ≥ 30 bases - read 2
18.57 Gbp (39.71%)
5.08 Gbp (35.57%)
0.85 Gbp (31.28%)
Properties of raw reads and filtering criteria
As a first quality evaluation we analyzed the raw read sequences and their corresponding quality values assigned by the base-calling software. The Illumina base-calling software calculates a quality score for each base reflecting the probability that the called base is wrong. The calculation takes into account the ambiguity of the signal for the respective base as well as the quality of neighboring bases and the quality of the entire read. The quality score Q is defined by Q = -10 log10(P); for example, Q = 30 corresponds to the probability P = 0.001 that a base has been called incorrectly. The highest possible value for Q assigned by the base-calling software is 40, corresponding to P = 0.0001.
In the samples sequenced on the HiSeq, 80% (Bv + PhiX, read length 95 nucleotides) and 74% (At + Phix, 100 nucleotides) of all bases had quality scores of at least 30, whereas for the PhiX data (150 nucleotides) sequenced on the GAIIx this fraction was 64%. The average quality score was Q = 31.8 (Bv + PhiX) and Q = 30.2 (At + PhiX) for the HiSeq data and Q = 27.2 for the GAIIx data. For both platforms the first read of a read-pair had slightly better average quality scores than the second read. The difference of Q between both reads was in the range of 0.3 (HiSeq) and 1.7 (GAIIx), respectively.
Uncalled bases are represented by a 'dot' in the sequence and by a 'B' in the quality string (corresponding to a quality score of Q = 2; the quality values are represented by ASCII characters). In the entire HiSeq data set 1.4% of all bases were uncalled, affecting 2.4% of all reads, and 0.5% of all reads were entirely composed of uncalled bases. In the GAIIx data set we found 14% of all bases to be uncalled, affecting 16% of all reads, and 7% of all reads were entirely uncalled (Table 1).
The quality of the 3' end of a sequencing read can be low for reasons such as phasing artifacts. If most bases at the 3' end of a read have quality values of Q ≤ 15, the base-calling software considers the whole segment as unreliable and assigns values of Q = 2 to the bases of this segment (represented by a 'B' in the quality string, just like uncalled bases). Illumina recommends excluding this portion of the read in further analysis (CASAVA1.7 User Guide). In the following we use the term 'B-tail' for consecutive Bs at the 3' end of a read, including unreliably called bases as well as uncalled bases. The most extreme cases - that is, reads entirely composed of Bs or reads containing only one B at the 3' end - are also considered as B-tailed reads. The fraction of bases lying within B-tails was 13.8% in the HiSeq data and 25.8% in the GAIIx data. Among these B-tail bases, 10.3% (HiSeq) and 53.3% (GAIIx) were uncalled. The B-tail length distribution shows a slight increase towards short B-tails and a sharp increase towards reads entirely composed of B. The predominant size is the full-length B-tail even after the removal of reads entirely composed of uncalled bases (Figure S3 in Additional file 1). In both HiSeq data sets, on average, 32.8% of all reads we studied had B-tails, and 19.6% of all read pairs had a B-tail in both reads. In the GAIIx data 67.9% of all reads had a B-tail, and 53% of all read pairs had a B-tail in both reads. Excluding the reads entirely composed of uncalled bases, the average read length after B-tail trimming was reduced to 122 bases in GAIIx reads (original read length 150 nucleotides), to 85 bases for reads from the HiSeq Bv + PhiX sample (original read length 95 nucleotides) and to 74 bases for the HiSeq At + PhiX reads (original read length 100 nucleotides).
Expected error rates after filtering
Expected error ratea(percentage of bases discarded)
Bv + PhiX
At + PhiX
B-tail + N
B-tail + ChF
B-tail + ChF + C33
B-tail + ChF + A30
B-tail + ChF + N + C33
B-tail + ChF + N + A30
The GC content (%GC) of the unfiltered HiSeq reads was higher than expected: 40% for Bv + PhiX data and 45.5% for At + PhiX. The B. vulgaris reference sequence has a %GC of 35%  and that of the A. thaliana genome is 36% (calculated from TAIR10 ). The fraction of PhiX reads (44.7% GC) accounts for only 1 to 2% of the data. For the PhiX sample sequenced on the GAIIx the %GC of 45.7% is much closer to the expected value of 44.7%.
Mapping of raw reads against reference sequences
We evaluated the actual quality of the sequencing reads by mapping the reads against high-quality reference sequences. We used the 5-kbp bacteriophage PhiX sequence, the 110-kbp insert sequence of a B. vulgaris BAC clone, and the 30 Mbp chromosome 1 of A. thaliana as references (see Materials and methods). The small and gene-dense PhiX genome is commonly used in Illumina sequencing as quality control. Sugar beet has a highly repetitive genome, and from Arabidopsis we used the large sequence of an entire chromosome in order to include references of different lengths and properties in our study.
We mapped the whole data set against the PhiX reference genome (5,386 bp) and kept all read pairs that had passed the Illumina chastity filter, did not contain adapter sequence, and were matching the genome uniquely with correct read orientation and expected mapping distance. This resulted in 4,302,400 and 887,009 PhiX read pairs sequenced on the HiSeq (2 × 95 nucleotides together with sugar beet or 2 × 100 nucleotides together with Arabidposis, respectively) and 6,405,298 PhiX read pairs sequenced on the GAIIx (2 × 150 nucleotides). To distinguish these three PhiX data sets in the following analysis, we use the terms PhiX-95nt, PhiX-100nt, and PhiX-GAIIx.
The sugar beet sample is derived from a whole genome shotgun library that was sequenced in three HiSeq lanes. The reference is the BAC insert ZR-47B15 (109,563 bp), here called 'ZR', sequenced to finished quality  and previously used in a study on the quality of Illumina reads produced on the GA I sequencing instrument . We implemented filtering steps for sugar beet reads in order to exclude reads that mapped to ZR but originated from a different region of the genome (see Materials and methods). Such wrongly assigned reads could lead to erroneous conclusions on read coverage and read error rates - for instance, in the case of divergent repetitive regions. We obtained 53,101 reads covering ZR (26,495 pairs, 111 singletons). This read data set is referred to as Bv-95nt in the following.
The Arabidopsis whole genome shotgun sequencing data were mapped against the entire Arabidopsis genome sequence. Pairs were kept if they had passed the Illumina chastity filter and matched chromosome 1 uniquely with correct read orientation and expected mapping distance, resulting in 5,815,990 pairs (referred to as the At-100nt data set).
Read distribution over the reference sequence
Accumulation of reads with B-tails
Coverage drop after B-tail removal
Remaining per-base coverage after B-tail removal
Number of positions with this coverage in the ZR reference (Bv-95nt data)
Number of positions with this coverage in the PhiX reference (PhiX-95nt data)
90 to 100%
80 to 90%
60 to 80%
40 to 60%
20 to 40%
0 to 20%
The sequencing error analysis in the following paragraphs was performed after B-tail trimming.
Substitution error rates and distributions
Indels and substitution errors in B-tail trimmed reads
Uniquely aligned bases
Substitution errors (rate)
84 (1.7 E-5)
3,789 (4.9 E-6)
26,130 (2.4 E-5)
546 (3.2 E-6)
7,077 (4.0 E-6)
In the sequencing data, increased error rates were observed in some sequencing cycles (Figure 4; Figure S10 in Additional file 1). It turned out that only a fraction of the reads was affected by these peaks. When inspecting their spatial placement within the flow cell we found that they concentrated in certain regions (Figure S11 and text supplement T1 in Additional file 1). The increased error rates were consistently reflected in the average quality scores of the particular cycles and regions for HiSeq reads (Figures S16 and S17 in Additional file 1) as well as GAIIx reads (data not shown). Thus, taking quality values into account should safely prevent potential interfering effects caused by these outliers during downstream analysis.
For a miscalled base three different substitution errors are possible. It was reported previously that in GA I data particular base conversions were more frequently observed than others . We counted all conversion events in our HiSeq and GAIIx data and found again certain preferences. Summarized over all HiSeq data we found A replacing C or vice versa (29.2%) and G replacing T or vice versa (26.8%) to be the most frequent substitutions. The fluorophore groups attached to bases A and C are excited by the same laser and distinguished only by the emission at different wave lengths; the same is true for the fluorophores of bases G and T. The fact that these pairs of bases are exchanged at high frequencies suggests an impact of these detection settings; the emission spectra of bases excited by the same laser might not be perfectly separated.
The individual conversions show slight variation between different HiSeq samples and greater variation between HiSeq and GAIIx samples (Figure 6b). The most frequent conversion in GAIIx data (A into C) is the same as reported for GA I data . In three of four HiSeq samples G was the base most frequently appearing as a miscall (conversion of any other base into G: PhiX-95nt, 38%; PhiX-100nt, 32%; At-100nt, 32%); the Bv-95nt sample had A (33%) as the most frequent resulting base (Table S2 in Additional file 1). The correct base being most frequently called incorrectly was A in At-100nt, PhiX-100nt and PhiX-GAIIx, C in Bv-95nt, and T in PhiX-95nt. We analyzed positions of significantly elevated error rates mentioned above separately. Each of the 136 positions analyzed in the PhiX genome showed a mixture of the three possible substitution errors but in all cases one of them was clearly dominating (seen at fractions of 42.5% to 99.1%). This may lead to confusion with low abundance polymorphisms as observed in heterogeneous samples. Since the individual error rate for the dominating base differed greatly in many cases between the two strands (for 117 of 136 positions by at least 10-fold, for 125 positions by at least 5-fold) a strand-specific analysis can help to distinguish real polymorphisms from region-specific substitution errors by confirming the occurrence of a variation on both strands at about the same rate. Furthermore, as mentioned above, positions of elevated error rates are reflected in the quality values, which should also be taken into account. The conversion from A or T in the reference sequence to G or C in the read sequence was seen at 118 (87%) of the 136 positions as the dominant base substitution and, among these, in 102 cases (86%) the positions were preceded by a G, resulting in G[A/T] being the most frequent motif at positions of elevated error rates. However, this motif occurs many more times (992) in both the forward and reverse strand of the PhiX genome.
Insertions and deletions
Number of insertions and deletions
1 base of T, A, C or G
1 base of T or A
1 base of C or G
> 1 base
ELANDv2 performs multiseed and gapped alignments, allowing the detection of indels with a length of up to 20 bases. The description of the conditions of ELANDv2 indel calls implies that no indels are reported in terminal regions of the reads. Indeed, simulations showed that no indels were detected if they were located before position 5 or after position 89 within reads of 95 nucleotides. All indels between positions 21 and 76 were reported, and a fraction of the indels was reported for positions 5 to 20 and 77 to 89. Consequently, the indel error rates as displayed in Table 4 can be considered as slightly underestimated.
Assessment of quality values
Quality scores are relevant for SNP detection and consensus calling and they are also used by mapping programs such as BWA  and Bowtie . In all sequenced HiSeq samples the observed error rates fitted well with the expected error rates derived from the quality values assigned by the Illumina base-calling software. The At-100nt and PhiX-100nt data base-called with a newer software version scatter closer around the expected error rates than the Bv-95nt and PhiX-95nt data processed with an earlier version (Figure 6d). Correctly called bases have, on average, a high quality score of 35 to 37 (At-100nt, Q = 37; Bv-95, Q = 36; PhiX-95nt, Q = 35) and wrong called bases have, on average, a low quality score of 18 to 28 (At-100nt, Q = 18; Bv-95nt, Q = 28; PhiX-95nt, Q = 18). We found no major differences when analyzing reads 1 and 2 of the read pair separately (data not shown).
Expected and observed error rates after filtering of aligned reads
Percentage bases removed
Percentage bases removed
B-tail + ChF
B-tail + N
B-tail + C33
B-tail + A30
B-tail + ChF + C33
B-tail + ChF + A30
In this work we have analyzed several sequence data sets generated on state of the art Illumina second-generation sequencing instrumentation. Specifically, we analyzed data from the HiSeq2000 and the GAIIx. We determined base substitution and indel error frequencies, and assessed biases of read coverage, sequencing errors, and base quality score assignments.
Based on our analysis of the observed and expected error rates after application of different read filtering steps, we recommend to perform B-tail trimming and to remove reads containing adapter sequence prior to analysis. The accuracy of the sequencing data can be further improved by removing reads that have less than two-thirds of the bases with Q ≥ 30 in the first half of the read, reads not passing the chastity filter, and reads containing at least one uncalled base. However, rigorous quality filtering might reduce the local coverage in regions of accumulated low quality reads. This effect should be taken into account when performing quantitative analyses rather than comparative sequencing. For de novo assemblies the coverage loss might result in contig breaks but the accuracy of the consensus sequence will benefit greatly from using error-free input reads; a regionally divergent unfiltered read population will result in either contig breaks or an erroneous consensus.
Despite the improvements of Illumina cluster amplification kits and sequencing reagents, sequencing on the HiSeq and the GAIIx still shows a GC bias. The finding that templates such as ZR with a %GC varying from 24% to 47% (1st and 99th percentiles) show increased coverage of GC-rich regions when using Illumina standard protocols is in accordance with previous results [2, 4]. Aird et al.  analyzed a mixture of genomes covering a broad %GC spectrum and reported a read coverage increase for templates with %GC up to 47% followed by a coverage decrease in regions of higher %GC. Since the %GC of the PhiX is in a narrow interval around the angular point of %GC = 47% (1st and 99th percentiles of PhiX: 41 to 49%) the lack of correlation between read coverage and %GC in PhiX data is in line with the findings of Aird et al. The GC bias is also reflected in the %GC of the raw read sequences from the samples we sequenced, which differs from the reference %GC for the two plant species but is close to the %GC of the PhiX genome.
Sequence reads with low base quality can result from, for example, phasing discrepancies. The quality measurement is able to indicate these effects to a certain extent by assigning low quality values of Q = 2, depicted as 'B' in the quality string, typically at the 3' end of reads. Such B-tailed reads were expected to be randomly distributed across the reference genome. However, we observed regions in the reference genome in which the mapping reads have a lower average quality and B-tails accumulate. After removal of B-tails, such regions remained error-free or greatly error-reduced. We found single positions of significantly increased error rates remaining after B-tail trimming. These positions displayed one dominant base conversion, which mainly occurred on one strand.
We were not able to identify sequence-based criteria to predict such error-prone positions unambiguously. Most positions with increased error rates had an over-representation of G in close vicinity upstream and were located within regions of low average base quality values. We found over-representation for several G-containing motifs, especially GGG and CGG. Nakamura et al.  suggested that a GGC motif precedes error-prone regions. In Nakamura et al., the start of such a region is defined by positions of very high error rates (applying the same criteria, we find only one error-prone region in our PhiX-95nt data). In our analysis, we distinguish between two observations: error-prone regions (errors removable by B-tail trimming) and error-prone positions (remaining after B-tail trimming). There does not seem to be any universal short motif that co-occurs with elevated error rates.
We successfully eliminated most error-prone regions by trimming B-tails and retaining the parts of higher quality values. Still, we encountered single positions of elevated error rates, and neither the extension of B-tail trimming towards the 5' end nor complete exclusion of B-tailed reads could remove the error rate peaks at these positions. We speculate that the two phenomena, although related to each other, originate from two effects, one resulting in regions of accumulated errors and low quality, and another being responsible for single positions of drastically increased error rates.
As a single sequence motif could not be found, the co-occurring pattern is expected to be more complex. It has been suggested that folding effects due to inverted repeats might be a reason for the accumulation of errors . We agree that secondary structures might be a potential source of region-specific sequencing artifacts, although the details are not yet understood. The responsible sequence pattern(s) may be located in any part of the fragment to be sequenced, even beyond the actually sequenced end. Pairs of low quality peaks on different strands of the reference should be related to each other, as long as the distance between them does not exceed the library insert size. Closer inspection of low quality regions might reveal the factor(s) causing B-tail accumulation as well as error-prone single positions. For now, we have shown that error-prone regions can be efficiently cleaned by B-tail trimming, and error-prone single positions can be detected by the directionality of the reads and the quality value of the affected base. Low copy single-nucleotide polymorphisms (occurring in viral populations or arising from somatic mutations or RNA editing) have to be distinguished from such sequencing errors. Confirmation on both strands may help to find true variants. Data sets obtained from strand-specific sequencing of RNA are particularly sensitive to such errors as only sequences from one strand are available for analysis. Data interpretation might be complicated in situations when a polymorphic position coincides with an error-prone position and base conversions lead to alterations of allele ratios. Quality values of bases at error-prone positions were found to be clearly reduced, which can serve as an additional criterion for discrimination.
For the Illumina GA I sequencing platform, we previously reported an average error rate of 0.6% for sequencing of ZR with 27-nucleotide reads . For HiSeq data, after B-tail trimming and removal of reads that did not pass the chastity filter, we observed an error rate of 0.16% in reads of 95 bases, testifying to the advanced accuracy of this now matured second generation sequencing technology. In particular, by removing read pairs containing the sequencing adapter, neither HiSeq reads nor the GAIIx reads (data not shown) displayed an exponential increase of error rates towards read ends as reported for GA I data previously. The increase was found to be about two- to three-fold for HiSeq data (95 to 100 nucleotides) and about five- to ten-fold for GAIIx data (150 nucleotides).
For plant DNA sequenced with the HiSeq we obtained slightly higher base substitution rates and indel error rates than for the spiked in PhiX controls. Also, the ratio between insertions and deletions, the ratio between indels of one versus more than one base, and the ratio between A/T versus G/C single base indels were distinctively different in plant and PhiX sequencing data. These differences can potentially be explained by somatic variation present within the plant material from which the DNA was extracted or by the occurrence of consensus errors in the plant reference sequences, for example, within repeat regions, which are difficult to assemble.
For the successful application of sequencing technologies the read data quality is crucial. We here provided a resource of information regarding several error types as well as ways to detect and minimize bad quality. We showed how appropriate data filtering criteria, inferred from properties of raw reads, substantially reduces error rates. When comparing expected and observed error rates the quality scores assigned by the base-calling software were generally accurate. Within reads, a significantly increased error rate towards the end of the read was not observed after quality filtering. Within the reference sequence, we found regional accumulation of low quality bases and single positions of notably elevated error rates, which are important to consider when analyzing nucleotide variations. Supporting previous recommendations [6, 12], we conclude from our data that true variants should be confirmed on both strands and quality values should be taken into account. Error types found in GA data are also present in HiSeq data, such as %GC bias, preferred base conversions, or the presence of a preferred base preceding wrong base calls.
Materials and methods
DNA extraction, sequencing library preparation, sequencing
Leaf material from sugar beet (B. vulgaris) genotype KWS2320 and from A. thaliana accession Columbia (Col-0) were used for DNA extraction. These genotypes were chosen because the same accessions were used in the preparation of the reference genome sequences of the respective species ( and unpublished data). Genomic DNA was prepared using the Nucleospin Plant XL Kit (Macherey Nagel, Düren, Germany). The DNA of PhiX174 RF1 was purchased from New England Biolabs (Ipswich, MA, USA). Genomic DNA was fragmented in a Covaris instrument (Woburn, MA, USA) to an average size of 250 nucleotides (plant DNA) or 300 nucleotides (PhiX174 RF1). Library preparation was performed using standard Illumina protocols and Illumina paired-end adapters .
Sequencing on the Illumina GAIIx was performed with a paired-end cluster generation kit v4 and TruSeq SBS v5 sequencing kits. A library prepared from PhiX174 RF1 was loaded onto the flowcell at a concentration of 5 pM. Clusters were prepared using the Illumina cluster station according to the manufacturer's instructions. Sequencing was performed following a 2 × 150-nucleotide cycle recipe. For Hiseq sequencing, a PhiX kit v2 library (Illumina) was spiked into the B. vulgaris and A. thaliana sample libraries at a proportion of about 1% each. The total loading concentration was 7 pM. Amplification was performed in the cBOT (Illumina) using an Illumina HiSeq paired-end cluster generation kit PE-401-1001. For sequencing, a 200 cycle SBS kit FC-401-1001 was used, and 2 × 95 (B. vulgaris) or 2 × 100 (A. thaliana) cycles of sequencing were performed.
Data processing, mapping, and read filtering
The Illumina pipeline version 2.8 was used for base-calling of GAIIx data, and the HiSeq2000 control software version 1.1.37 for the B. vulgaris sample and version 188.8.131.52 for the A. thaliana sample. From the GAIIx run, we obtained a full lane of data consisting exclusively of reads from PhiX. HiSeq data consisted of genomic reads from sugar beet or Arabidopsis plus 1% of control PhiX that had been spiked into the genomic sample. HiSeq read pairs were mapped with ELANDv2 (within Casava 1.7) to a PhiX reference sequence provided by Illumina, and consecutively to a sugar beet genome reference sequence prepared and assembled by our group (unpublished data) and sugar beet BAC clone ZR-47B15 insert that had been previously sequenced to finished quality with Sanger dideoxy terminator sequencing chemistry ('ZR', GenBank: FJ752587) . The ZR genotype is the same as the genome reference we prepared (unpublished data). We used this draft genome assembly to select the portion of reads covering ZR. In the first step, we mapped all read pairs of the three sugar beet lanes against the B. vulgaris draft assembly. We kept only those pairs of which at least one read mapped to the part of the sugar beet genome corresponding to ZR and nowhere else in the genome. The resulting 37,696 pairs were mapped against the high-quality ZR sequence from Dohm et al.  and were kept if they had passed the Illumina chastity filter and matched ZR uniquely with the correct read orientation and expected mapping distance of less than 500 nucleotides. Reads passed the chastity filter if they had, within the first 25 cycles, no more than one cycle of a chastity below 0.6 (Chastity = Highest intensity/(Highest intensity + Next highest intensity)). To keep adapter-free pairs only, pairs were removed if the two reads showed the wrong matching order within the reference, that is, if the reverse matching read was found upstream of the forward matching read. This occurs if the read length is larger than the sequenced library insert, resulting in an overlapping read pair containing the Illumina 3' adapter. From the remaining 28,993 pairs we further excluded 4,885 reads that mapped to a region of 6 kbp in ZR at positions 30 to 36 kbp. Within this putatively repetitive region, we found that error rates were elevated five-fold compared to other ZR regions, and the read coverage was five times higher than seen on average in ZR (Figures S1 and S2 in Additional file 1). This region had passed the first filtering step because it was underrepresented in the draft assembly. The final read data set comprises 26,495 read pairs and 111 single reads.
Read match coordinates and information on mismatches were retrieved from the ELAND output file. The Illumina PhiX preparation that was sequenced on the HiSeq contained three base positions that did not correspond to the PhiX reference. Errors at these positions were ignored during analysis.
ELANDv2 performs multiseed and gapped alignment of paired reads. In a mulitseed alignment, in case the first seed (default first 32 bases) cannot be mapped with up to 2 mismatches, the next seed (default next 32 bases) is attempted to be mapped. For B. vulgaris/PhiX reads we reduced the seed length to 31 bases to allow up to three seeds in reads of length 95 bp. Starting from the matching seed, the alignment is extended to the full length of the read allowing for more mismatches and gaps (indels) of up to 20 bases. ELANDv2 only opens gaps if a gap corrects at least five mismatches downstream and if the ratio between the number of mismatches in the gapped versus ungapped alignment is above a certain threshold. The latter criterion permits gaps that improve noisy ungapped alignments to be distinguished from bona fide small insertions/deletions (CASAVA1.7 User Guide).
Scripting and data visualization
To create statistics on errors and quality values, we extracted and processed information from the ELAND output using scripts written in Perl v5.8.9  and R 2.9.0 . Plots were generated with R.
Calculation of per-base indel error rates in homopolymer sequences
Homopolymer sequences in the reference were categorized according to their length. For each of the homopolymer tracts we determined the number of spanning reads and the number of reads with indels. We ignored the first and last ten bases of each read within which indel errors are not reliably called by ELAND. The number of indel errors in homopolymers of a certain size was divided by the number of reads spanning homopolymers of that size. The obtained rates were divided by the homopolymer sizes to obtain the per-base indel error rate in homopolymers (necessary to detect if the indel error rate actually increases with longer homopolymers despite the fact that a longer stretch of sequence can accumulate more indel errors).
Sequence data of this study have been submitted to the Sequence Read Archive (SRP008975).
bacterial artificial chromosome
next generation sequencing
Sequencing was performed in the Ultrasequencing Unit of the CRG. This work was supported by GABI-FUTURE grant 'BeetSeq-a reference genome sequence for sugar beet (Beta vulgaris)', FKZ 0315069A. The members of the 'BeetSeq' consortium are aware of the fact that sugar beet genomic sequences were used in this work, and we are grateful for their constructive comments throughout this study.
- GenomeWeb. [http://www.genomeweb.com/]
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36: e10510-View ArticleGoogle Scholar
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008, 5: 183-188. 10.1038/nmeth.1179.PubMedView ArticleGoogle Scholar
- Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R1810-View ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009, 6: 291-295. 10.1038/nmeth.1311.PubMedPubMed CentralView ArticleGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39: e90-10.1093/nar/gkr344.PubMedPubMed CentralView ArticleGoogle Scholar
- Qu W, Hashimoto S-I, Morishita S: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Res. 2009, 19: 1309-1315. 10.1101/gr.089151.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Dohm JC, Lange C, Reinhardt R, Himmelbauer H: Haplotype divergence in Beta vulgaris and microsynteny with sequenced plant genomes. Plant J. 2009, 57: 14-26. 10.1111/j.1365-313X.2008.03665.x.PubMedView ArticleGoogle Scholar
- TAIR. [http://arabidopsis.org/]
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R2510-Google Scholar
- Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011, 12: 443-451. 10.1038/nrg2986.PubMedPubMed CentralView ArticleGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
- The Perl Programming Language. [http://www.perl.org/]
- The R Project for Statistical Computing. [http://www.r-project.org/]
- Milne I, Bayer M, Cardle L, Shaw P, Stephen G, Wright F, Marshall D: Tablet--next generation sequence assembly visualization. Bioinformatics. 2010, 26: 401-402. 10.1093/bioinformatics/btp666.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.