Teaser: Individualized benchmarking and optimization of read mapping results for NGS data

Mapping reads to a genome remains challenging, especially for non-model organisms with lower quality assemblies, or for organisms with higher mutation rates. While most research has focused on speeding up the mapping process, little attention has been paid to optimize the choice of mapper and parameters for a user’s dataset. Here, we present Teaser, a software that assists in these choices through rapid automated benchmarking of different mappers and parameter settings for individualized data. Within minutes, Teaser completes a quantitative evaluation of an ensemble of mapping algorithms and parameters. We use Teaser to demonstrate how Bowtie2 can be optimized for different data. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0803-1) contains supplementary material, which is available to authorized users.

Recent and ongoing advances in sequencing technologies and applications [1,2] lead to a rapid growth of methods that align next generation sequencing (NGS) reads to a reference genome (read mapping). By mid 2015, nearly 100 different mappers are available, although not all are equally suited for a given application or dataset [3]. The large number of potential options, and the even larger number of potential parameter configurations makes it challenging to choose the most appropriate mapper for a given dataset. Consequently most users generally rely on the default, unoptimized, parameters for one of a few popular algorithms, even when this choice performs very poorly compared to an optimized approach. This may introduce substantial biases in the subsequent analysis, including reduced coverage, reduced rates of mutations or heterozygosity, false determination of allele-specific expression, or other artifacts [4,5].
Previous research [3,[6][7][8] has focused on benchmarking mappers for particular scenarios (for example, SNP calling, or data from a specific sequencing instrument) or for selected organisms. Although these surveys represent a valuable resource for certain tasks, they are most often performed using only the default parameters and versions of the software that can be outdated by the time they are published. More significantly, these evaluations may not capture the data types or genomes used in the study at hand, which may have substantially different characteristics. To choose the most suitable parameter settings for a given mapper requires in depth knowledge of the data as well as the mapper. This is extremely complex to do, and in many cases, even the author(s) of the software may not fully appreciate how to best optimize their own software for a given dataset.
Recent efforts to provide guidance for choosing the mapper and its parameter setting like GCAT focus on human data [8]. GCAT is an online resource that hosts simulated reads for a version of the human genome that users can download and analyze, using their own analysis pipelines. Afterwards the results can be uploaded and are compared to the gold standard. On a voluntary basis the parameter settings of the analysis are made publicly available for the benefit of the community. However, not all researchers work with the human genome, and instead many researchers face the challenge of markedly different genome and read characteristics including the SNP rate, error rate, read lengths, quality of the reference genome, and reference sequence complexity such as GC content and repetitive regions, all of which influence the ability of mappers to align reads [3]. For example, whereas some mappers efficiently map reads to a human reference genome, they might be less adequate when applied to a draft de-novo assembled genome with incomplete and/or fragmented contigs. In any case, the choice of the mapper parameters depends on the characteristics of the data.
Here we introduce Teaser, a method that assists users to determine the most suitable mapper and parameters given the core characteristics of their individual experiment. Teaser simulates read data, executes a number of popular mapping tools under an ensemble of parameter settings, and then evaluates and illustrates the results. Teaser can also be used to optimize the mapping of genuine NGS reads, especially to account for difficult to simulate characteristics such as run-specific error modes or sequencing biases. Teaser's short runtime enables users to evaluate a multitude of different scenarios. Furthermore, Teaser is highly flexible, easily allowing to: (1) extend the catalog of mappers; (2) customize mapper parameters; (3) provide your own simulation or select from a list of preconfigured simulation methods; and (4) finetune the evaluation of mappers. Teaser provides several summary statistics from the experiments such as the fraction of correctly/wrongly mapped reads, correctly mapped reads per second, precision and recall, F-measure (that is, the harmonic mean of precision and recall) [7], maximum memory usage, and runtime. In the end, Teaser generates an HTML-based report including interactive figures that can be viewed using common web browsers. Teaser is available open-source as a web application (teaser.cibiv.univie.ac.at), a virtual machine image, and as a standalone version (github.com/Cibiv/Teaser).

Teaser description and data characteristics
Teaser comprises three main steps. First, it simulates reads based on user-defined specifications (for example, sequencing technology, sequencing error model, read length, SNP rate) and a reference genome. Second, Teaser automatically executes the mappers, and monitors and evaluates their performance. Third, it summarizes the evaluation results and generates an HTML-based report. In the following we will describe each step.

Simulation of reads from subsampled reference genomes
The first stage of the simulation encompasses randomly subsampling regions from the reference genome. By default, the length of each region is 10-fold of the userspecified average read length or insert size in the case of paired-end sequencing. We introduce the following default sampling rates based on genome size. Sampling stops if the total length of sampled regions exceeds 50 % of the reference genome length for small (<100 mb) or 25 % of medium sized genomes (<500 mb). For larger genomes (for example, human) Teaser samples regions until reaching 1 % of its genome length. Overall Teaser samples at least 15 mb to guarantee a robust measurement. The sampled regions are then concatenated into an artificial chromosome. To avoid reads branching from one region to another, regions are separated using two times the specified average read length or insert size of N's as padding. Subsequently either DWGSIM [9] or Mason [10] generate simulated reads from the artificial chromosome using the user specified characteristics for read lengths, error models, and genome characteristics (rate of heterozygosity, proportion of indels, and so on). Moreover, Teaser optionally accepts fastq and SAM files [11], containing user-simulated reads and their presumably correct alignment positions and strand (used for evaluation). Finally, Teaser can also accept just a fastq file of genuine or simulated reads for evaluation, although only a subset of metrics will be available since the true mapping positions are not known.

Mapping of simulated reads
After read generation, Teaser executes the user-defined mappers with the corresponding parameter values to align the reads to the complete reference genome. By default, Teaser includes: BWA (version 0.7.12-r1039) [12], BWA-MEM (version 0.7.12-r1039) [13], BWA-SW (version 0.7.12-r1039) [14], Bowtie2 (version 2.2.5) [15], and NextGenMap (version 0.4.13) [16]. Teaser monitors the runtime and maximum memory consumption for each mapper. The reported runtimes are the times needed for mapping the reads only and do not include the time for preprocessing steps (that is, indexing the reference genome), reading in the reference sequence, and initializing the mapper. Thus, these runtimes provide a useful estimation for larger read sets.

Evaluation of simulated and real data
Teaser first checks if a mapped read exceeds the userdefined mapping quality threshold (default equals 0). A mapped read with mapping quality below that threshold is counted as not mapped. Reads passing the mapping quality check are considered correctly mapped if the following conditions are true: (1) The reported starting position of the aligned read is within X bp (by default 50 % of the defined read length) from the original starting position of the simulated read. (2) The read is mapped to the same strand as it was simulated from. If any of these conditions is violated the read is considered wrongly mapped. In case of multi-mapping reads, only the primary alignment (the one the mapper considers to be best [11]) is taken into account. Other methods have been suggested to evaluate read alignments than comparing the distance between the correct and the reported mapping position of a read [6]. However, we evaluate the mapped reads based on mapping quality, mapped strand, and position, which is a common strategy that is both fast and accurate [15][16][17].
Teaser further provides a way to assess the mapping rate of the mappers and parameter settings based on real data. To grant a short runtime, Teaser first subsamples the same number of reads as it would use for the simulation (see above). After the mapping, however, Teaser computes the overall percentage of mapped reads rather than the percentage of correctly mapped reads since that information is not available. We recognize that mapping additional reads does not necessarily indicate higher quality alignments, but nevertheless this information is valuable to assess the robustness of parameter choices. Other metrics, especially runtime and memory requirement are recorded as before, allowing for in-depth performance optimization.

Mapping summary and report
Teaser provides six statistics for further evaluation. First, Teaser outputs the number of correctly, wrongly, and not mapped reads. Next, Teaser reports the precision (fraction of correctly mapped reads compared to all mapped reads) and the recall rate (fraction of correctly mapped reads if compared to correctly mapped reads and not-mapped reads) for each mapper. Teaser computes the F-Measure, the harmonic mean of precision, and recall as suggested by Caboche et al. [7]. Thus, the F score is a measure of a mapper's accuracy, ranging from 0 (worst) to 1 (best). Finally, Teaser reports the correctly mapped reads per second, the overall runtime, and the peak memory requirement for each mapper.
All results are displayed as part of an HTML-based report providing easy to read tables and interactive figures. The report allows a direct comparison of the mapping results for different summary statistics and different mapping quality thresholds. Note 1 (see Additional file 1) describes the options in more detail. We also invite the reader to visit teaser.cibiv.univie.ac.at for further details and the presentation of results.

Parameter optimization
In addition to benchmarking different mappers, Teaser can evaluate different parameter sets for each mapper, defined as either specific parameter values or ranges of values to be evaluated for each mapper. The second option can be used to automatically explore the impact of key parameters, such as the k-mer length, on the mapping results. If parameter ranges are defined for more than one parameter, Teaser systematically tests every combination and reports the results for each combination separately.
Finally, to identify the optimal parameter set for the user specific genomic data, Teaser provides an additional plot that shows the correctly mapped reads in percentages and the number of reads processed per second for all evaluated mapper and parameter combinations.

Performance evaluation
To demonstrate the usefulness of Teaser we benchmarked five mappers for 10 simulated datasets (Table 1). Similar to the data studied by GCAT [8], we simulated two human Illumina-like read sets assuming a genomic SNP frequency of 0.1 % and a 0.03 % (H1) or 0.07 % (H2) probability for the occurrence of insertions and deletions. Read length was set to 100 bp, assuming a  Table 1.
To investigate the influence of the subsampling process on the performance of the mappers we further generated five datasets with different down-sampling rates for each organism: human, Mus musculus and Drosophila melanogaster. Mason was used to simulated Illumina-like reads for the entire genome, and Teaser was applied to the full dataset. Table S2 (see Additional file 1) lists the properties of each dataset.
Finally, to assess the performance of the mappers on a de-novo assembly we simulated two 100 bp long paired end Illumina-like datasets (C1, C2) from a de-novo assembly of the 1Gbp Cottus rhenanus genome (FJS, J. Cheng, J. Altmüller, AvH, A.W. Nolte, under consideration). This de-novo assembly is accessible at DDBJ/ EMBL/GenBank under the accession LKTN00000000. C1 and C2 were simulated with 0.9 % SNP rate, to mimic a closely related population (Cottus perifretum) to further assess cross-species mapping performance. Based on the observation that a portion of the reads from the real dataset had a higher sequencing error rate, we simulated C1 and C2 with a 0.6 % (default of Mason) and 1 % sequencing error, respectively. The combined datasets C1 and C2 were provided to Teaser using the import function described above.
Unless otherwise mentioned mappers were executed with their default parameters on a desktop computer with an Intel(R) Core(TM) i5-2500 K (3.30 GHz) quadcore CPU and 16 GB of RAM. Only the datasets used to verify the down-sampling process were computed on an Intel(R) Xeon(R) CPU X5650 (2.67 GHz) with 32 GB of RAM.

The influence of subsampling for benchmarking mappers
Benchmarking the five mappers with Teaser based on a full human genome took more than 10 h. To reduce computing time, Teaser randomly samples nonoverlapping subsequences from the genome, and from each subsequence reads were simulated as described. The simulated reads were then mapped to the entire reference genome and evaluated. Figure 1a displays the fractions of correctly mapped reads for Human, Mus musculus, and Drosophila melanogaster datasets comparing no subsampling (star) with the standard subsampling rate of Teaser per mapper. Figure 1a shows that the proportion of correctly mapped reads is not significantly affected by the subsampling rate. The percentage of correctly mapped reads varied by less than 0.5 % for the default sampling rates for the five mappers and the three reference genomes. Thus, investigating the performance of mappers on randomly sampled regions of the genome suffices. This has the great advantage that it saves computing time as shown in Fig. 1b. The saving can be quite substantial, for the human genome we observe a runtime reduction from over 10 h to 13 min with negligible differences in observed mapping characteristics.

Benchmarking different mappers
Teaser benchmarked the two human datasets in 26 min (Fig. 2a, H1 + H2). For both datasets, BWA-MEM outperformed the other mappers in terms of correctly mapped reads. This result is consistent with the results reported by GCAT. Furthermore, the ranking of the four mappers (NextGenMap was not evaluated) was also consistent. This shows that Teaser produced for the human data the same results as GCAT, although Teaser has much greater flexibility.
Next, we evaluated if BWA-MEM still performs best when using data from other model organisms or using different sequencing platforms and protocols. This evaluation is only possible with Teaser. Teaser required between 4 (D3) and 58 (M4) min to benchmark the data, running five mappers with their default parameters. The long runtime of M4 was mainly due to the runtime of BWA-MEM (41 min), which accounted for 70 % of the total runtime. However we note that BWA-MEM was not designed for reads shorter than 70 bp [13]. For the remaining data BWA-MEM used on average 17.9 % of the total runtime. These results show the benefit of using Teaser to find the best mapper for a specific dataset, especially considering that a 1 % change in performance translates to tens of millions of additional reads mapped correctly in a genome-wide evaluation.
As highlighted by Fonseca et al. [3] and several others, runtime is crucial when selecting the best mapper especially in large-scale projects. In addition, for datasets like D1/D2 and M1/M2, where NextGenMap and BWA-MEM show very similar accuracies, runtime can be used to break the tie. To account for this, Teaser further reports the number of correctly mapped reads per second. Here, NextGenMap and Bowtie2 show the best performance, while BWA-MEM ranges from second to fifth place depending on the data (Fig. 2b).

Automated parameter optimization and evaluation of mapping accuracy
Our previous results (Fig. 2a) showed that BWA-MEM and NextGenMap performed best in terms of mapping reads to their correct position, while in terms of mapped reads per second (Fig. 2b), Next-GenMap and Bowtie2 outperformed the other mappers. However, the efficiency of Teaser with respect to computing times allows another application; Teaser can be used to identify parameter combinations that increase the fraction of correctly mapped reads and the number of correctly mapped reads per second.
To show the versatility of Teaser, we ran Teaser for Bowtie2 with the default parameters, and eight parameter-options provided by Bowtie2 (for example, '-very-fast' through '-very-sensitive'), the so-called Bowtie Preset parameters. In addition, we defined a custom range of values for three critical parameters: the length of the seed (−L), extending the alignment (−D), and the maximum number of times a repetitive read will be reseeded (−R). Thus a total of 34 different combinations of Bowtie2 parameters were evaluated and compared to BWA-MEM and NextGenMap, when the default parameters were used. a) b) Fig. 1 Effect of subsampling on the percentage of correctly mapped reads (a) and the runtime (b) for five mappers. For the genomes of Human, Mus musculus, and Drosophila melanogaster we each generated sets of reads, once for the whole genome (shown as star symbol) and 25 times using the default subsampling rates of Teaser. The plots in (a) show, that subsampling genomic regions has an insignificant effect on the mapping rates. However, subsampling saves substantial computing time, which is most impressive for the human genome, where we observed a 60-fold reduction We see that changing the mapping parameters can lead to a large increase in the number of reads mapped per second and also in the fraction of correctly mapped reads. The plot can be divided in four quadrants relative to the position of Bowtie2 using the default parameter (black triangle): upper/lower by left/right. The lower left quadrant (N-) represents the parameter values that resulted in a lower amount of correctly mapped reads and a reduced throughput. Thus, they are outperformed by the default parameters. The lower right quadrant (N − +) encompasses the parameter settings that improve throughput, but correctly map fewer reads. The upper left quadrant (N + −) reflects parameter values that achieved a higher percentage of correctly mapped reads, but with reduced throughput. Those parameter combinations are interesting, but come at the expense of additional runtime. Thus, they may not be preferred in all applications. Finally, the upper right quadrant (N++) represents those parameter settings that outperformed the default parameter both in terms of speed and correctly mapped reads. Thus, the N++ parameter settings are always preferable to the default settings.
For the Drosophila resequencing experiment (D1) the default parameter values were substantially improved in terms of runtime (32 % faster) at a marginal loss of 0.3 % in accuracy. For the related species sequencing experiment (D2) and the Ion Torrent resequencing experiment (D3) some parameters settings were superior in terms of both speed and accuracy. For D2 we identified parameter settings that lead to a runtime a) b) Fig. 2 Mapping accuracy and mapping efficiency for different mappers and different input data. a Percentages of correctly mapped reads; and (b) number of correctly mapped reads per second for five mappers. Table 1 gives a detailed description of the data. A * indicates that at least 99.9 % of the reads were not mapped, due to limitations of the mapper. Teaser runtime in minutes is shown in parenthesis improvement of 21 % and that mapped an additional 1 % of the reads correctly. The higher error rate and the length of the reads provided by Ion Torrent (D3) posed another challenge to optimize Bowtie2. Among the parameter combinations testes, Teaser found parameter values that led to a speed-up of 16 % and a mapping accuracy that increased by 1.2 %, compared to the Bowtie2 default values. More remarkably, one parameter combination (A-arrow Fig. 3 D3) tested by teaser almost doubled the percentage of correctly mapped reads (from 53.3 % to 88.6 %), while achieving the same throughput as BWA-MEM (blue triangle).
Summarizing, systematically varying the parameter combinations of a particular mapper often leads to a substantial increase of accuracy and throughput. This optimization can be done within a few minutes and can be adapted to the specific data at hand. However, this analysis can be automatically extended to other mappers.
Here we have only shown how to improve Bowtie2's performance, but we could also easily attempt to optimize BWA-MEM, NextGenMap, or other mappers as well.

Application to real data
Finally, we investigated how the mappers perform on a draft de-novo assembly using genuine and simulated Illumina data. We investigated the mapping of Cottus perifretum reads to the de-novo assembly of Cottus rhenanus (FJS, J. Cheng, J. Altmüller, AvH, A.W. Nolte, under The black triangle shows the performance for the Bowtie2 default settings. While BWA-MEM and NextGenMap results using their default settings are shown as a blue triangle and a yellow triangle, respectively. Symbols above the dotted horizontal line show an increase in mapping accuracy if compared to Bowtie2 default, whereas symbols right of the dotted vertical line represent parameter settings with increased throughput. The 'A' arrow points to the parameter combination that achieved the highest accuracy, whereas the 'S' arrow marks the parameter combination with the highest throughput. The numbers in parenthesis behind the input data (D1, D2, D3) are the total runtimes needed to evaluate all parameter combinations on the respective data consideration). Teaser was used to map a subset of the real reads (SRP064498) to the de-novo assembly using the same preset and default parameter settings of Bow-tie2 and the default parameter settings of BWA-MEM and NextGenMap. Figure 4 shows the result for the parameter optimizations for real and simulated data. For real data, the true mapping positions are not available, so Teaser uses the percentage of mapped reads as the criteria to assess the performance of each run as described above. Teaser supports this statistic to allow a comparison of the optimized parameter settings and mappers for the real and simulated data. Thus recognizing if the parameters for the simulation were chosen reasonable or not. For example, if the simulated and genuine results strongly disagree, this can indicate if the genuine reads have unrecognized trimming and/or sequencing error issues.
Teaser automatically simulated and subsampled 100 k reads for the simulated and real data, respectively. In both datasets the parameter settings of Bowtie2 perform equally well. Only the fastest parameter (indicated by S) changed for Bowtie2 from a preset set to a custom parameter defined by Teaser for the real dataset. The overall throughput of the mappers and parameter settings changed between the simulation and the real data. This is expected to some degree as the simulation cannot fully mimic the complexity of the real data, especially when mapping to a de-novo assembly that may be incomplete and/or fragmented [18].

Software
Teaser is available as a web application (teaser.cibiv.univie.ac.at), a virtual machine image and a command line version (https://github.com/Cibiv/Teaser) to increase the usability for expert and non-expert users. To further boost the applicability and advantage of using Teaser we provide different parameter settings used in this study on our github page (https://github.com/Cibiv/Teaser). Teaser is licensed under the MIT License. Furthermore, we encourage the community to contribute parameter settings that improve the performance of mappers. We will incorporate such settings in the parameter files provided on the github page. Teaser is easy to use and at the same time extendable for other methods and parameters to be evaluated.

Discussion
Choosing the most suitable read mapper and its parameters for a particular dataset is far from trivial [3]. Improper mapper or parameter selection can result in many significant technical artifacts, including reduced coverage, reduced rates of heterozygosity, false determination of allele-specific expression, or other false results. Nonetheless, most current studies rely on default parameters of arbitrarily chosen mappers. Teaser seeks to overcome this deficiency by assisting in choosing the appropriate mapper and parameter setting by measuring the performance over an ensemble of different mappers and parameter combinations. This evaluation takes only Fig. 4 Benchmarking parameters on a real (a) and simulated (b) draft de-novo assembly. For the simulations Teaser sampled 100 k reads of the C. perifretum dataset and mapped it to the de-novo assembly of C. rhenanus. Simulation parameters were approximated based on a visual inspection of the real dataset. For a detailed explanation of the legend we refer to Fig. 3