Skip to main content

Table 1 Glossary of terms

From: When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Term

Definition

Bit-pattern observable

The run of 0 s in a binary string

Bit vector

An array data structure that holds bits

Canonical k-mer

The smallest hash value between a k-mer and its reverse complement

Hash function

A function that takes input data of arbitrary size and maps it to a bit string that is of fixed size and typically smaller than the input

Jaccard similarity

A similarity measure defined as the intersection of sets, divided by their union

K-mer decomposition

The process of extracting all sub-sequences of length k from a sequence

Minimizer

The smallest hash value in a set

Multiset

A set that allows for multiple instances of each of its elements (i.e. element frequency)

Register

A quickly accessible bit vector used to hold information

Sketch

A compact data structure that approximates a data set

Stochastic averaging

A process used to reduce the variance of an estimator