Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Strobealign: flexible seed size enables ultra-fast and accurate read alignment

Fig. 2

Seed uniqueness and time allocation in strobealign. A The expected number of hits from a seed randomly drawn from reference (E-hits) for some popular seeding approaches (k-mers, minimizers, syncmers) compared to strobealign’s seeds. Minimizers and syncmers are both sampled at a sub-sample fraction of 1/5, and minimizers use a random hash function. For strobealign’s seeds which have variable lengths, the median seed length is plotted. Strobealign’s seeds for read lengths 100–500 nt (typically two linked syncmers of length 20) reduce the repetitiveness an order of magnitude on hg38 compared to using a single syncmer or minimizer of length 20. B The fraction of seeds that would be hard masked in strobealign (occurring over 1000 times). On hg38, strobealign’s seeds for read lengths 100–500 nt hard masks 2.6–6 times fewer seeds over syncmers of length 20. C The real time spent at various steps in strobealign using 16 threads for the SIM3 datasets of 10 million paired-end reads of different read lengths and the subsampled BIO150 and BIO250 datasets of 4 million paired-end reads of length 250. Reading refer to reading the fastq files. The label “strobemers” refers to the time to generate strobemer seeds from the reads, “find_matches” refers to retrieving and creating merged matches from all strobemer seeds below the repetitive abundance threshold, “rescue” refers to finding merge matches in the rescue mode, “sort_matches” sorts the matches with respect to the candidate map score, and “aln” refers to the base level alignment, in which the large majority of runtime constitutes of calling ssw and a small fraction is computing the hamming distance. Writing the output to SAM was not logged in the experiments but typically takes less time than reading input

Back to article page