A tandem simulation framework for predicting mapping quality

Read alignment is the first step in most sequencing data analyses. Because a read’s point of origin can be ambiguous, aligners report a mapping quality, which is the probability that the reported alignment is incorrect. Despite its importance, there is no established and general method for calculating mapping quality. I describe a framework for predicting mapping qualities that works by simulating a set of tandem reads. These are like the input reads in important ways, but the true point of origin is known. I implement this method in an accurate and low-overhead tool called Qtip, which is compatible with popular aligners. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1290-3) contains supplementary material, which is available to authorized users.

1. Only alignments at or above an alignment score threshold S min are considered. These are called valid.

2.
Local similarity filters require that one or more substrings of the read to align with few or no differences. For instance, a filter might require that alignments contain a length-k stretch of nucleotides that match the reference exactly. For long enough k, this disqualifies otherwise valid alignments. 3. Repeat filters avoid the extensive and mostly unproductive work performed when many candidates pass the local similarity filter. Say we take a 20 nt read then use a fast index lookup to find all reference locations where the first 10 nt of the read occur exactly. Say there are 100,000 such places. Instead of examining each candidate, Bowtie 2 will examine a subset and ignore the rest. 4. Early stopping halts alignment work once it seems to have become unproductive.
For instance, the aligner might stop searching once a certain number of consecutive index queries have failed to yield a valid alignment.

Note 2 Prevalence of alignment errors
We defined three categories of alignment error: 1. The read is reported to have originated from a locus in the reference genome, but actually originates from a DNA sequence not included in the reference 2. No alignment to the reference is found, but the read actually originates from some locus in the reference 3. An alignment to locus L r in the reference is reported, but the read actually originates from a different locus in the reference, L t To measure prevalence of these errors in a realistic setting, we conducted three sets of simulation experiments. In all three, most simulated reads were derived from a known "target" reference genome, either human (GRCh38) or mouse (GRCm38), with genetic variation and sequencing errors added. These represent reads from a donor individual that have their true point of origin in the reference genome. We also simulated two types of foreign sequences, having their true point of origin outside the reference genome. The first type of foreign sequence consists of species present in laboratories and on humans that have been shown to contaminate sequencing datasets. The second consists of sequence from a CHM1 hydatidiform mole assembly not present in the GRCh38 assembly.
These represent sequences that are present in a donor individual but missing from the reference genome. Both types of foreign sequence are important components of Category-1 errors [3]. All reads are aligned to the target reference only. The foreign sequences from the contaminating genomes are included in all experiments; the CHM1 sequences are included in one of the two human experiments, as indicated in Table S1.
To obtain the first kind of foreign sequence, we simulated 7% of the reads from contaminating genomes: 1% each from mouse (GRCm38), the bacteria Propionibacterium acnes, Mycolpasma Hominis, Mycolpasma fermentans, Acholeplasma laidlawii, and Mycolpasma hyorhinis, and the fungus Malassezia globosa. We arrived at these contaminants after sur-veying studies on sequencing data contamination [5,11,13]. When a read from one of these genomes aligns to the GRCh38 primary assembly, we call this a Category-1a error.
For the two experiments that did not include foreign sequence from CHM1, we simulated the remaining 93% of reads from the reference genome. For the third experiment, we included CHM1-specific sequence by simulating 93% of the reads from a concatenation of the GRCh38 reference and a collection of sequences of length at least 260 present in the CHM1 assembly but not in GRCh38. Note 3 gives further detail on how the CHM1-specific sequences were obtained. When a read from the CHM1 assembly aligns to GRCh38, we call this a Category-1b error.
We simulated four samples for each of the three experiments: unpaired and pairedend, 100-nt and 250-nt samples. We ran Mason as described in Note 4. We aligned reads with Bowtie 2 v2.3.2 as described in Note 5. In one experiment we inverted the roles of the mouse and human genomes, using mouse as the target and human as one of the contaminant genomes. For each aligned sample, we counted errors of the three categories and tabulated them as a percentage of the simulated reads and as a percentage of the errors (Table S1).
As expected, longer reads yield fewer errors than shorter reads and paired-end reads yield fewer errors than unpaired reads. A large majority of the errors, ranging from 95. .7%, are of category 3. The next largest, ranging from 0.2-3.9% of errors, are of category 2. These numbers could be reduced using more sensitive alignment parameters. Category 1 errors range from 0.27-0.46% of errors in the more realistic experiment that included CHM1 sequence, and from 0.05-0.014% in the other two experiments. The large fraction of category 3 errors underscores the need for better methods for predicting mapping quality.  Table S1: Prevalence of errors of different categories when analyzing Mason-simulated samples using Bowtie 2. The "% reads" column gives percent of all the reads, both correctly and incorrectly aligned, having that error category, and "% errors" gives percent of errors having that error category.
Similarly, we used MUMmer and nucmer [9] to align the MHAP CHM1 assembly [2] assembly to the GRCh38 primary assembly. We analyzed the resulting alignment using Assemblytics [12], with the results available at: http://assemblytics.com/ analysis.php?code=sIOvqHUKNaCpU8NXdcTS. This assembly is like the one used above (accession: GCA 000772585), but with low-support contigs (having fewer than 50 supporting reads) removed. We also downloaded this version of the assembly from For the paired-end samples, we ran Mason with parameters -ll 3L -le L. These establish a mean (3L) and standard deviation (L) for the fragment length distribution.
This means that some simulated fragments will fall outside the range 2L − 4L, despite the fact that this is the range of fragment lengths that the aligners will consider "concordant" (Note 5). This is desirable; for our paired-end samples, we would like the input to include reads that yield all three alignment categories: conc, disc and bad-end. where the end that one would expect to find upstream would actually be found downstream. For example, in a typical Illumina paired-end protocols, the two ends of a fragment originating from the forward reference strand would align with (a) end 1 upstream of end 2, (b) end 1 aligning in the forward orientation, and (c) end 2 aligning in the reverse orientation. But for around 0.10-0.15% of the pairs in a Mason-simulated sample, we found that the upstream/downstream relationship between the ends was reversed.
These were filtered out prior to our experiments.
wgsim. wgsim is one of the simulators we evaluate in Note 8. To generate unpaired samples with wgsim, we ran it with these parameters: wgsim -S (seed) -1 (read_len) -2 (read_len) -N 4000000 \ (genome_fasta) (output_fastq) -S sets the pseudo-random seed, which we set arbitrarily for our experiments. -1 and -2 specify the read lengths for either end of a paired-end read. -N specifies the number of reads to simulate. The last two parameters specify the reference genome FASTA file and the output FASTQ file.

SNAP.
For paired-end samples, we ran SNAP with the -s parameter to set the minimum and maximum concordant fragment lengths. We specified -s L 3L indicating that the distance between the leftmost base of the upstream made and the leftmost base of the downstream made is allowed to vary from L to 3L, achieving the same result as we did above with Bowtie 2's -I and -X parameters. Additionally, Qtip always sets SNAP's -= parameter for reasons described in Note 11. When running Qtip together with SNAP, the user may also specify additional SNAP parameters on the command line.

Note 6 Modifications to read aligners
We modified three alignment tools -Bowtie 2, BWA-MEM , and SNAP-to output feature data needed by Qtip. To make the modifications, we studied each tool's source code and identified data that could be relevant to the mapping quality prediction task.
We started by studying how the tool itself predicts mapping quality, since the inputs to those routines are candidate features. In all three cases, we added more features beyond these, usually with the goal of capturing more information about the repetitiveness of the read and how many of its "seeds" either failed to align or aligned repetitively.
Here we examine each tool, discussing (a) how the tool predicts mapping quality, (b) which features we re-used from that process, and (c) which additional features we added.
Bowtie 2 Bowtie 2's mapping quality calculation uses a lookup table, with the lookup keys being functions of (a) the best alignment's score, (b) the second-best alignment's score, (c) whether end-to-end or local alignment were used, (d) whether the read aligned concordantly in a pair. We preserve (a) and (b) in the features we elect to report from Bowtie 2. We do not need to report either (c) or (d) because these follow naturally from how Qtip works. In the case of (c), the training data is tailored to match the alignment parameters used to align the input data; the training data will always match the input data in terms of the alignment mode used. In the case of (d), Qtip trains separate models for separate alignment categories (conc, disc, unp, bad-end), and always uses the appropriate model for prediction.
Following is a list of the unpaired features output in the modified version of Bowtie 2: • Best alignment score • Difference between the best and second-best alignment score, or NA if no secondbest alignment was found • Difference between the best and second-best edit distance, or NA if no second-best alignment was found • Number of seeds for which Bowtie 2 sought an alignment (attempted seeds) divided by the number of nucleotides in the read • Fraction of attempted seeds that aligned uniquely (to exactly one place in the reference) • Same as previous but limited to seeds that aligned in the same orientation as the overall reported alignment • Fraction of attempted seeds that aligned repetitively (to more than one place in the reference) • Same as previous but limited to seeds that aligned in the same orientation as the overall reported alignment • Mean number of alignments found for any attempted seed • Same as previous but limited to seeds that aligned in the same orientation as the overall reported alignment Following is a list of the paired-end features output in the modified version of Bowtie 2: • Sum of the alignment scores for both ends of the best paired-end alignment • Difference between the best and second-best paired-end alignment score (i.e. the sum of the scores of the two ends), or NA if no second-best alignment was found • Difference between the best and second-best paired-end edit distance (i.e. the sum of the edit distances of the two ends), or NA if no second-best alignment was found Note 12 discusses the relative importance of these features as we measured in our simulation experiments.

BWA-MEM.
BWA-MEM 's mapping quality calculation depends on, among other factors, (a) the best alignment's score, (b) the second-best alignment's score, (c) the number of alignments "tied" for second-best, (d) the number of aligned bases (i.e. excluding any soft-clipped bases), (e) the percent identity, summarized across the aligned bases, (f) the seed coverage, which roughly equals the mean number of times a read base is covered by an aligned seed. We use these features and others in Qtip.
Following is a list of the unpaired features output in the modified version of BWA-MEM : • Best alignment score • Edit distance of the best alignment • Difference between the best and second-best alignment score, or NA if no secondbest alignment was found • Number of alignments "tied" for second-best • Fraction of seeds that aligned repetitively • Seed coverage Following are the paired-end features output in the modified version of BWA-MEM : • Difference between the best and second-best paired-end alignment score (i.e. the sum of the scores of the two ends), or NA if no second-best alignment was found • Sum of alignment scores of the two ends of the best paired-end alignment • Number of paired-end alignments "tied" for second-best • Fraction of seeds from either end of the pair that aligned repetitively • Seed coverage over both ends of the pair Note 12 discusses the relative importance of these features as we measured in our simulation experiments.

SNAP.
For a given read, SNAP assigns a probability to each alignment found for that read. SNAP's procedure for calculating the best alignment's mapping quality mostly consists of dividing the best alignment's probability by the sum of the probabilities of all the other alignments found. The probability of a given alignment is function of the length of the read, the read's quality values, and a collection of parameters giving the prior probability of a variant, gap opening, gap extension, etc. We use these probabilities and other features in Qtip.
Following are the unpaired features we output in the modified version of SNAP: • Edit distance of the best alignment • Difference between best and second-best alignment's edit distance, or NA if no second-best alignment was found • Probability of the best alignment divided by the total probability of all the alignments found • Number of seeds ignored because they matched in the genome too many times Qtip parses feature data from the extra ZT:Z SAM field. So once we identified a feature, we enable Qtip to use that feature simply by printing it as an additional value in the ZT:Z field. This field contains comma-separated numeric data. The following is an example of an alignment produced by BWA-MEM with the ZT:Z field. The ZT:Z field is not printed for unaligned reads, and Qtip will ignore these.
Qtip also ignores the small number of chimeric alignments reported by BWA-MEM.
Proper handling of chimeric alignments is future work.
In addition to the features discussed above, which are reported by the read aligners in the ZT:Z field, Qtip adds a few additional features, specifically: • Length of the read or end • Sum of the base qualities of the aligned bases in the read or end • Sum of the base qualities of the soft-clipped bases in the read or end Finally, for a given end of a paired-end alignment, whether concordant or discordant, feature data for the opposite end is included in the record for the given end. I.e., in the record for a given end of a paired-end read, you will find aligner-specific unpaired feature data, aligner-specific paired-end feature data, generic features, and both aligner-specific and generic features for the opposite end.
When the training and test data are compiled within Qtip, columns in which all elements are equal are eliminated. E.g., if the input consists of reads all of the same length, then the read length will be identical for all records and the corresponding column will be dropped from the training and test matrices. Similarly, if the input consists of reads aligned in end-to-end alignment mode (Bowtie 2's default), then the sum of the base qualities of the soft-clipped bases will always be 0, and the column corresponding to that feature will again be dropped.  Table S2: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for various aligners and two versions of the human reference genome, expressed as percent change. Some experiments simulated reads from just the reference ("GRCh37," "GRCh38") while other simulated from insertions in the CHM1 assembly relative to the reference assembly ("GRCh37+CHM1," "GRCh38+CHM1"). The experiments used 100 nt or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over 10 random trials, repeated starting from the input modeling step.

Note 7 Varying aligner sensitivity
We used Qtip together with Bowtie 2 to align and predict mapping qualities for the Mason-simulated 100 nt and 250 nt unpaired and paired-end samples. We repeated each experiment four times, varying Bowtie 2's sensitivity parameter: --very-fast, --fast, --sensitive or --very-sensitive. We also tried Qtip both in its (default) end-to-end alignment mode and in its local alignment mode. We calculated RCA and RCE for all 10 trials of all experiments (Table S3) and plotted CSED for the first trial of the experiments that used the 100 nt reads ( Figure S1).
Qtip-predicted mapping qualities are superior overall to Bowtie 2's in every scenario tested, as indicated by the negative RCAs and RCEs (Table S3). Variability of RCAs and RCEs across trials is generally modest, but higher for --very-sensitive mode. Standard deviations range up to 5.03.
CSED curves ( Figure S1) again show that for some ranges of cutoffs, aligner-reported mapping qualities are better at minimizing cumulative squared error, corresponding to the portions of the CID curves that rise above y = 0. This is most clearly observed for unpaired --very-sensitive alignment. But Qtip provides superior predictions for the vast majority of cutoffs, including nearly all cutoffs for the paired-end experiments.  Table S3: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) when running Qtip and Bowtie 2 at various levels of sensitivity on Masonsimulated Illumina-like samples. Relative change is expressed as a percent. Each simulated sample consists of 4 million reads or pairs. Bowtie 2 is run in either end-to-end or local alignment mode as indicated. Results are means and standard deviations over 10 random trials, repeated starting from the input modeling step.   (Table S4) and plotted CID for the first trial ( Figure S2).   Table S4: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for samples simulated by various simulators. Bowtie 2 end-to-end alignment was used in all cases. Relative change is expressed as a percent. The experiments used 100 nt or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over 10 random trials, repeated starting from the input modeling step.

Note 9 Variant calling experiment
Alignment: We used Qtip v1.6.2 together with Bowtie 2 v2.3.2 to align the paired-end sequencing reads from ERR194147. The Qtip command used was: The Bowtie 2 arguments used (inserted where ${BOWTIE2_ARGS} appears above) are: -I 0 -X 550 -t -p 24. The first two parameters set minimum and maximum fragment length to 0 and 550 respectively. The -t parameter enables extra timing output.
The -p 24 --reorder parameters run Bowtie 2 with 24 threads while keeping output alignments in an order corresponding to the input reads.
With these parameters, the output directory from Qtip contains both an input.sam file with the original mapping qualities, and a final.sam file with the Qtip-predicted mapping qualities for the same alignments.
Variant calling: After converting the SAM output from the alignment run to sorted BAM using sambamba [15], we then used Freebayes v1.1.0 to call variants: The -X -u arguments ensure Freebayes calls only SNVs and indels, though the indels calls were not studied here. The --haplotype-length 0 argument ensures each SNV appears as a separate call in the VCF file, even in cases where nearby SNVs can be phased into a haplotype block. This helps make the SNVs more comparable between the The above command was run twice for every mapping quality threshold: once with ${INPUT_BAM} is set to the sorted BAM derived from the input.sam file, and once with with ${INPUT_BAM} set to the sorted BAM derived from input.sam.

Variant filtering:
We applied two filters to the variant calls output by Freebayes. First, we eliminated variant calls with spuriously high coverage, as suggested in prior work [10]. Spuriously high coverage tends to indicate the presence of a copy number change that confounds variant calling. This was accomplished with a simple awk script that removed VCF records with depth (as indicated by the DP field of the INFO column) at least four poisson standard deviations above the mean. Mean depth of coverage was calculated to be 52.79 using samtools flatstat, so the filter removed variants with depth greater than 82. This script, and all others used for these experiments, can be found at https://github.com/BenLangmead/qtip-experiments/tree/v1.6.1.
Next, we filtered the Freebayes output to eliminate variant calls outside genomics regions deemed "high confidence" by the Platinum Genomes [4] project. We used the vcfintersect tool from the vcflib (https://github.com/vcflib/vcflib) collection; the specific command was:  Finding the maximal F scores: All of the above experiments were carried out for all mapping-quality thresholds and for both the original and the Qtip-predicted mapping qualities. For the original mapping qualities and given a particular β , we calculated maximal F β by evaluating F β :
We then reported the maximal F β .

Note 10 Setting the number of tandem reads
Qtip has parameters that control the minimum number of tandem reads or pairs of each category (conc, disc, bad-end and unp) to generate. The default number of training reads/pairs generated for each category is 45 · √ x where x is the number of input alignments of that category. Both the scaling factor and the function are configurable via Qtip's --sim-function and --sim-factor parameters.
We conducted a set of simulation experiments on a few selected combinations of functions and coefficients. We used Mason to simulate two sets of 50M reads, one unpaired set and one paried-end. Both sets used 100 nt reads, with paired-end reads having simulated fragments lengths mostly between 200 and 400 nt, per the strategy described in Note 4.
We tried the following policies for setting a target number of tandem alignments: . 50, 000

100, 000
We say that these policies set a "target" number of alignments because the actual tandem simulation procedure could result in fewer or more alignments. This is because (a) not every tandem read will align successfully, (b) the target is inflated somewhat prior to simulation in order to account for alignment failures, and (c) the tandem simulator actually conducts a large number of binomial draws; the total across all the draws will be very close to, but not necessarily equal to, the inflated target.
We tried each of these policies for each of the following numbers of input reads: 1. 1,000,000 2. 5,000,000 3. 10,000,000 4. 50,000,000 The script used to drive these experiments is present in the qtip-experiments repository as experiments/simulated reads/train series.sh.
As seen in Figure

Note 11 Input modeling and tandem simulation
Scanning the input. The construct the input model, Qtip using reservoir sampling to obtain an appropriate-sized subset of the input alignments. It does this over the course of a single scan through the input alignments. For paired-end alignments, Qtip's scan matches up the two ends so that they can be considered together here. Reads that fail to align are ignored, as are non-primary alignments,. During the scan, Qtip distinguishes between the four different categories of alignment: conc, disc, unp and bad-end. A separate reservoir sampler is used for each category. The maximum number of alignments sampled in a category can be configured with Qtip's --input-model-size parameter (default: 30,000).
Templates. Every category of template contains at least the following fields: • Score SNAP does not report the MD:Z field, and the CIGAR field alone is not always sufficient to determine where mismatches are located. However, SNAP can optionally print a CIGAR string that does contain sufficient information to determine where mismatches are located (the -= parameter), so when running SNAP from Qtip, that parameter is always specified.
When that parameter is specified, the edit transcript can be inferred from the CIGAR field alone.
The 5 fields listed in the previous subsection are sufficient for representing an unpaired read template. A bad-end template includes two additional fields: • Whether aligned end is end 1 or 2 • Length of the (unaligned) opposite end These are inferred from the SAM FLAG and SEQ fields for the opposite end.
A conc template contains all of the fields listed above plus these additional fields: • Score of the opposite end Most of these additional fields are inferred straightforwardly from the SAM records for the two ends as described previously. Which end aligns upstream is determined by examining the SAM POS field for the two records. Fragment length is inferred from the SAM TLEN field.
A disc template is identical to a conc template. Note, though, that the fragment length field may not be interpretable for a discordantly aligned pair, since the two discordantlyaligned ends can align very far apart or even to different chromosomes.

Handling soft clipping. Input alignments with soft-clipped bases (S CIGAR operation)
pose a problem for our templates: because the soft-clipped bases failed to align, we do not know how to represent them in the edit transcript. Simply excluding those bases from the edit transcript -effectively trimming them from the tandem read -is not desirable. We would like the tandem reads to mimic the input reads in all important ways, including in the prevalence of soft-clipping.
One solution is to, for each instance of soft clipping, extract the corresponding stretch of reference bases and perform post-facto alignment of the soft-clipped bases. The alignment would likely be poor (hence the soft clipping), but would yield informative values for the edit transcript. This is slow in practice, however, since it requires that we extract a arbitrary stretch of the reference genome -requiring disk head movement in many cases -for each instance of soft clipping.
Instead, Qtip represents soft-clipped bases with special S characters in the edit transcript. When simulating the read, Qtip simply inserts a random base into each softclipped position. Advantages of this approach are: (a) it is fast and simple, requiring no additional alignment, and (b) random bases are a reasonable proxy for soft-clipped bases since' they are unlikely to align well to the reference, and so will likely be clipped. The disadvantage is that there are no guarantees about how successfully the random bases will act as a proxy for the soft-clipped bases. For example, the random bases could, by coincidence, match the reference genome well enough that they align without any soft clipping, or with less soft clipping than was needed for the corresponding input read.
This causes the tandem reads to depart somewhat from the character of the input reads, though Qtip's good performance on local read aligners suggests that the effect is likely minor.
Simulating from templates. Qtip uses the input models to create the tandem read set in the following way. The user specifies the location of the reference genome in FASTA format to Qtip. Qtip iterates through overlapping windows (chunks) of the reference genome.  (7) printing the read to a FASTQ file in preparation for the tandem alignment step.
A bad-end read is simulated similarly except that a dummy opposite end is synthesized by generating a random nucleotide sequence. The dummy opposite end is virtually guaranteed not to align, mimicking the situation for the input pair.
Paired-end reads are simulated similarly except that when extracting the substring from the reference genome, a substring corresponding to the entire fragment is extracted.
In the case of a conc pair, fragment length is dictated by the length given in the template. In the case of a disc pair, fragment length is set to a pre-determined value larger than the largest concordant fragment length simulated. This is configurable via Qtip's --max-allowed-fraglen parameter. Once a fragment substring is extracted, read sequences are extracted from either end of the fragment, choosing the upstream end and the orientations according to the template.
In order to simulate a long fragment, the entire fragment must be present in some chunk returned by the FASTA parser. To guarantee this will be the case, Qtip sets the overlap length O equal to the maximum fragment or read length in any of the input models.
Aligning tandem reads. After tandem reads are simulated, they are aligned (step 4) using the same read aligner and the same alignment parameters as were used to align the input reads in step 1. However, circumstances can cause the tandem alignments to depart from the template.
First, a tandem read may simply fail to align, decreasing the number of Qtip's training examples by one. This is relatively rare, since the templates themselves correspond to reads that aligned successfully, and are therefore unlikely to run seriously afoul of the aligner heuristics. However, the tandem read might be extracted from a qualitatively different genomic region than the original read. For example, the tandem read might be drawn from a repeat, whereas the original read was not. This increases the chance that, e.g., the aligner might stop early before finding a valid alignment for the tandem read.
Second, the tandem read might align successfully, but with a different alignment score and/or edit transcript from those in the template. E.g., if the tandem read is drawn from a repetitive portion of the genome, the edits induced by the transcript might cause the read to align with higher alignment score elsewhere. Alternately, the aligner might fail to align the read to its true point of origin and instead align it such that its score is lower than the template score. We so no strong evidence that Qtip's predictions are systematically biased in either direction, either producing tandem alignments with overall higher or overall lower scores than the input alignments.

Discussion.
We described a method for deriving input models from aligned reads. We also described a method for using the input models, together with the reference genome, to simulate a collection of tandem reads that mimic the input reads in key ways. The method has several advantages: (a) it is simple, not requiring an external read simulator, (b) input models are built during a single pass over the input alignments, (c) reads are simulated during a single pass over the reference genome, (d) since the input alignments are scanned one-or two-at-a-time, and since only a single chunk of the reference genome considered at a time during simulation, the added memory footprint is low.
It should also be noted that, because read sequences are drawn randomly from across the genome and are matched with templates randomly, the tandem reads taken as a whole are not consistent with any particular genome sequence. That is, the reads will not generally agree with each on, e.g., where any SNPs or other variants are located with respect to the reference genome. But since the read aligners themselves consider each read independently and one at a time, this should not affect the performance of the mapping quality model trained in step 6.

Note 12 Feature importances
We studied the feature importances reported by the RandomForestRegressor model for the simulation experiments described in the main text in Figure 2 and Table 2. Feature importances are calculated by scikit-learn and included in the object representing the fit model. Importances range from 0 to 1, with higher importance indicating the feature tends to appear higher in the decision trees.
Recall that a separate model is trained for each of the four categories of alignment: conc, disc, bad-end, and unp. The first three are relevant for the paired-end simulations and the last for unpaired simulations. We plot importances for the 6 features with the highest average importance across all trials and reference genomes. The boxplots show the importances across the 10 trials.
Recall that the random forest model is indifferent to scale. Some of the features described below should intuitively have a positive correlation with mapping quality (e.g. Score diff) while others should have a negative correlation (e.g. % repetitive). The random forest is not predisposed to one or the other. Figure S4 shows results for the Bowtie 2 experiments for 100 nt reads for all reference genomes and models. Figure S5 shows the same for 250 nt reads. The features mentioned in those plots are:

Bowtie 2
• Best score: Alignment score of the reported (best) alignment.
• Score diff: Score of best alignment minus score of second-best alignment, using the configured scoring scheme. If no second-best alignment is found, this is set to a value larger than the largest observed score difference.
• Score diff other: Like Score diff but for the opposite rather than the current end of the paired-end alignment.
• Conc score diff: Score of best concordant paired-end alignment minus score of secondbest concordant paired-end alignment, using the configured scoring scheme. The score of a concordant alignment is calculated as the sum of the alignment scores of both ends. If no second-best concordant paired-end alignment is found, this is set to a value larger than the largest observed difference.
• Score diff other: Same as Score diff but for the opposite rather than the current end of the paired-end alignment.
• Edit diff: Edit distance of best alignment minus edit distance of second-best alignment, ignoring the configured scoring scheme. If no second-best alignment is found, this is set to a value larger than the largest observed edit distance difference.
• Edit diff other: Same as Edit diff but for the opposite rather than the current end of the paired-end alignment.
• Conc edit diff: Edit distance of best concordant paired-end alignment minus edit distance of second-best concordant paired-end alignment, ignoring the configured scoring scheme. The edit distance of a concordant alignment is calculated as the sum of the edit distances of both ends. If no second-best concordant paired-end alignment is found, this is set to a value larger than the largest observed difference.
• % unique stranded: Fraction of seeds from the same strand as the reported alignment that matched the reference in exactly one location.
• % unique stranded other: Same as % unique stranded but for the opposite rather than the current end of the paired-end alignment.
• % unique: Fraction of seeds that matched the reference in exactly one location, regardless of seed's strand.
• % repetitive: Fraction of seeds that matched the reference in more than one location, regardless of seed's strand.
• Avg seed hits: Average number of times a seed matched the reference.
• Avg seed hits stranded: The average number of times a seed from the same strand as the reported alignment matched the reference.
• Frag length: Inferred fragment length as reported in the SAM TLEN field. Figure S6 shows results for the BWA-MEM experiments for 100 nt reads for all reference genomes and models. Figure S7 shows the same for 250 nt reads. The features mentioned in those plots are:

BWA-MEM
• Best score: Alignment score of the reported (best) alignment.
• Score diff: Score of best alignment minus score of second-best alignment, using the configured scoring scheme.
• Score diff o: Like Score diff but for allows the second-best alignment to be redundant with the first.
• Score diff other: Like Score diff but for the opposite rather than the current end of the paired-end alignment.
• Score diff o other: Like Score diff o but for the opposite rather than the current end of the paired-end alignment.
• Conc score diff: Score of best concordant paired-end alignment minus score of secondbest alignment, using the configured scoring scheme. The score of a concordant alignment is calculated as the sum of the alignment scores of both ends.
• # tied for 2nd: Number of alignments known to be tied at the second-best alignment score.
• # tied for 2nd o: Like # tied for second but for allows the second-best alignment to be redundant with the first.
• # tied for 2nd o other: Like # tied for second o but for the opposite rather than the current end of the paired-end alignment.
• # conc tied for second: Number of concordant alignments known to be tied at the second-best alignment score. The score of a concordant alignment is calculated as the sum of the alignment scores of both ends.
• Seed coverage: Fraction of read length covered by aligning seeds.
• % repetitive: Fraction of seeds that matched the reference in more than one location, regardless of seed's strand.
• Frag length: Inferred fragment length as reported in the SAM TLEN field. Figure S8 shows results for the SNAP experiments for 100 nt reads for all reference genomes and models. Figure S9 shows the same for 250 nt reads. The features mentioned in those plots are:

SNAP
• Prob best: SNAP's estimate of the probability the best alignment is correct.
• Prob best other: Like Prob best but for the opposite rather than the current end of the paired-end • Score diff: Edit distance of best alignment minus edit distance of second-best concordant paired-end alignment, ignoring the configured scoring scheme. (Note: SNAP's scoring scheme is edit distance, so "edit distance" and "score" are used interchangeably here.) If no second-best alignment was found, this is set to a value just smaller than the most negative observed difference.
• Score diff other: Like Score diff but for the opposite rather than the current end of the paired-end • Min hits: The minimum number of matches for any seed, regardless of strand.
• Min hits other: Like Min hits other but for the opposite rather than the current end of the paired-end • Min hits strand: The minimum number of matches for any seed taken from the same strand as the reported alignment.
• Min hits strand other: Like Min hits strand but for the opposite rather than the current end of the paired-end alignment.
• Pop seeds skip: Number of seeds ignored because they matched the reference too many times.
• Hits per lookup: Average number of seed hits per index query.
• Hash misses: Number of index queries that returned 0 seed hits.
• Frag length: Inferred fragment length as reported in the SAM TLEN field.  Figure S5: Feature importances for Bowtie 2 experiments on 250 nt reads for all reference genomes and models. Plotted are the 6 features with greatest average importance across all trials and reference genomes. Boxplots show importance measurements across 10 trials. The model for the unpaired simulation is labeled "Unp" and models for the paired-end simulations are labeled "Conc," "Disc" and "Bad-end." A more detailed description of the features is given in Note 12.    Figure S8: Feature importances for SNAP experiments on 100 nt reads for all reference genomes and models. Plotted are the 6 features with greatest average importance across all trials and reference genomes. Boxplots show importance measurements across 10 trials. The model for the unpaired simulation is labeled "Unp" and models for the pairedend simulations are labeled "Conc," "Disc" and "Bad-end." A more detailed description of the features is given in Note 12.  Figure S9: Feature importances for SNAP experiments on 250 nt reads for all reference genomes and models. Plotted are the 6 features with greatest average importance across all trials and reference genomes. Boxplots show importance measurements across 10 trials. The model for the unpaired simulation is labeled "Unp" and models for the pairedend simulations are labeled "Conc," "Disc" and "Bad-end." A more detailed description of the features is given in Note 12.

Note 13 Mapping-quality prediction model
Random forest model Qtip uses the scikit-learn Python module [14] to build its mapping quality prediction model. Specifically Qtip uses scikit-learn's RandomForestRegressor, a model consisting of an ensemble of decision trees. Each tree is trained on a bootstrap sample of the training data, and each tree contributes a "vote" as to the probability the given alignment is correct. The final prediction is an average of all trees' votes.
Random forests have many advantages in this setting. First, decision tree training is not sensitive to how features are scaled. When adding a new feature, an aligner author can simply report the data as-is, without first having to consider whether log scaling, or any other monotone scaling, is needed to bring the feature into agreement with other features. This is a major advantage over scale-sensitive models like support vector machines or neural networks.
Second, random forests are efficient. They have few hyperparameters, making the training process simpler and faster. E.g., the SNAP aligner takes over 2 hours to align the paired-end ERR050083 dataset, whereas the resulting random forests take about 2 minutes to train. This is a key advantage over, e.g., neural networks. Having said that, the prediction step -step 7 -is the single biggest contributor to Qtip's computational overhead. We plan to explore whether using a different random forest implementation could increase prediction throughput.
Finally, after a RandomForestRegressor model is fit, it optionally reports feature importances. A feature's importance is essentially the degree to which splitting on the feature helps to separate positive from negative training examples.
Hyperparameters Parameters governing the size, shape and splitting behavior of the random forest, often called hyperparameters, have to be chosen at model fitting time. For this model, the relevant parameters are: • Number of decision trees in the forest. Set to 30 by default in Qtip, but adjustable using the --num-trees parameter.
• Maximum number of leaf nodes per decision tree. Set to 35 by default in Qtip, but adjustable using the --max-leaf-nodes parameter.
• Maximum number of features to split on at each internal decision tree node. This is expressed as the fraction of the total number of features, and it's allowed to vary from 0.1 to 0.45 by default. It is adjustable using the --max-features parameter.
As mentioned, the --max-features hyperparameter is not fixed. In fact, all three hyperparameters can be configured to take values in a range; e.g., specifying the parameter --num-trees 30,35,40,45,50 will allow --num-trees to take on any of the specified values. For each hyperparameter allowed to vary in this way, a particular value is chosen at model-training time via a hyperparameter fitting procedure that performs hill-climbing with a configurable numeric tolerance (--optimization-tolerance).
The hyperparameters selected are those that yield the best out-of-bag score after training. The out-of-bag score is calculated and by the RandomForestRegressor model.
As an alternative to the out-of-bag score, cross validation can be used instead (via Qtip's --no-oob parameter).
Alternate models Qtip can optionally be configured to use the ExtraTreesRegressor class, which implements a variant on random forests called extremely randomized trees [6]. The hyperparameters are defined, configured, and fit in exactly the same manner as for the RandomForestRegressor.

Forming the training and test matrices
The test matrix is a matrix where each row is an input alignment and each column is a feature. Elements contain feature data. In the case of a paired-end alignment, the two ends appear as separate rows in the matrix. The training matrix is similar, except that the rows are tandem alignments rather than input alignments. For each alignment category, Qtip builds a training matrix and uses it to train the model (step 6). Qtip uses the model to predict mapping quality for the input alignments (step 7).
As mentioned, the columns of the matrices are features and the rows are alignments.
Qtip might simplify the matrix in certain ways. For example, if all elements in a column are identical in the training matrix, that column is removed from both the training and test matrices. Also, if two columns of the training matrix are identical, the rightmost of the two columns is removed from both the training and test matrices.
To maintain a low memory footprint, matrices are handled such that only a single fixed-sized block of consecutive rows needs to be present in memory at a time. For example, when Qtip parses test records derived from the input SAM file, the records are read in blocks, with only one block residing in memory at a time. By default, a single block contains at most 250K records. This is configurable via Qtip's --max-rows option. Qtip then uses the model to predict mapping qualities for the block and writes the corresponding mapping qualities to disk before deallocating the block and moving on.
Missing values As discussed in Note 6, some feature data can be "missing." For example, an important feature used in Qtip is the difference between the score of the best and second-best alignments found by the aligner. But if the aligner fails to find a second-best alignment, this feature will take the value NA, meaning missing or "not available." Qtip deals with NA values by automatically setting all of the NA values in the column equal to x + 1 where x is the maximum non-NA value in the column. The fact that NAs are set to be slightly larger than the largest non-NA value often suggests particular way of ordering the data values for a feature. For example, in the case of the difference between the best and second-best score, it makes sense for larger differences to be represented by larger numbers, so that when an NA is replaced by x + 1, this is interpreted by the model, sensibly, as a large difference.