Skip to main content
Figure 1 | Genome Biology

Figure 1

From: Determining the quality and complexity of next-generation sequencing data without a reference genome

Figure 1

Schematic overview of main kPAL principles. (A) An overview of the procedure used by kPAL to assess the frequency of all k-mers within sequencing data. k-mers are identified and counted by a sliding window of size k. The k-mer spectrum can then be produced using the k-mer frequencies. The main functions of kPAL can be divided by their application to single or multiple profiles. For single k-mer profiles, general information about the number of nullomers, total number of counts, distribution of k-mer counts and balance between sequencing information from the plus and minus strands can be obtained with dedicated functions. If needed, profiles can be manipulated by the balance, shuffle and shrink functions. The balance function uses a sum of k-mers and their reverse complements to enforce balance between sequence information from the minus or plus strand. The shuffle function is designed to produce random k-mer profiles without changing the overall distribution of counts. (B) kPAL efficiently processes k-mers, as it encodes the sequences with a binary code using specific keys that can also facilitate a quick conversion to the reverse complement. Each nucleotide is represented by a binary code that is subsequently used to construct each k-mer. (C) The strand balance of a given k-mer profile is the overall distance measure between the frequency of the unique k-mer and its reverse complement. Thus, k-mer profiles are split into two sub-profiles that are reverse complements of each other and these are used to calculate the strand balance. (D) By design, kPAL can shrink k-mer profiles of size k to any smaller size. Counts from k-mers that share the first (n – 1) nucleotides are merged to collapse k-mer profiles to a size k – 1. (E) The smoothing function borrows the utility of shrinking and applies it locally to only k-mers that have lower counts than one defined by the user. Thus, for those affected, k-mer counts are merged and dropped to the size k – 1. The smoothing function accepts thresholds for the minimum, maximum or average counts of k-mers that share the first (n – 1) nucleotides but it also accepts user-defined functions. This process reiterates until the threshold condition is met. Prof., profile.

Back to article page