Determining the quality and complexity of next-generation sequencing data without a reference genome

We describe an open-source kPAL package that facilitates an alignment-free assessment of the quality and comparability of sequencing datasets by analyzing k-mer frequencies. We show that kPAL can detect technical artefacts such as high duplication rates, library chimeras, contamination and differences in library preparation protocols. kPAL also successfully captures the complexity and diversity of microbiomes and provides a powerful means to study changes in microbial communities. Together, these features make kPAL an attractive and broadly applicable tool to determine the quality and comparability of sequence libraries even in the absence of a reference sequence. kPAL is freely available at https://github.com/LUMC/kPAL. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0555-3) contains supplementary material, which is available to authorized users.


Supplementary Notes
The distribution of k--mers (DNA words of length k) in sequencing data can provide a unique view of the complexity and quality. It has been shown that features that occur too often or too rarely reflect crucial information about the functional and structural elements of the genome of the sequenced species (1--5). Thus, the number of overrepresented sequence motifs does not disappear by increasing the k size. Many studies have shown the unimodality of the genomic k--mer spectra of most species with the exception of mammalians (1). All mammalians exhibit a multimodal distribution of k--mer frequencies. This feature highlights the existence of common and extremely rare features that reflect the complexity of the genome in question. The multimodality of the k--mer spectrum is independent of the genome size. Chor et al. (1) have shown that while the genome of T. thermophila has a comparable size (97Mb) to that of the human chromosome 12 (108Mb), the k--mer spectra differ in modality (unimodal and multimodal, respectively). The modality of the human genome is also subjected its function. Strikingly, all coding regions, including the 5' un--translated regions (UTRs) exhibit a unimodal k--mer spectrum while the introns, 3' UTRs and other intergenic regions hold a multimodal distribution (1, 3). Moreover, the composition of k--mers in the spectrum is context--specific. Nullomers (missing k--mers) and rare k--mers tend to be GC--rich and often contain many CpGs (1,5). This is caused by the hypermutability of CpGs (C → T and G → A; also known as CpG suppression) that mutate at 10--20 times higher rate than other types of point mutations (5--8). In this work, we present the utility of k--mer spectrum in determining the quality and complexity of next--generation sequencing data that relies on shared and unique features across species as shown for estimating the level of relatedness between microbiomes.

kMer Methodology Index
The first step in any k--mer analysis is the generation of a profile ( Figure  1A), which is constructed by the indexing algorithm. The efficiency of the algorithm is improved by encoding the DNA string in binary following the map given in Figure 1B. Subsequently, the binary encoded k--mers are used as the index of a count table. This can be achieved by the concatenation of the binary code for each nucleotide in a given DNA string. This procedure eliminates the need to store the actual k--mer sequences since they can be retrieved from decoding the offset in the count table. The binary code for each nucleotide is chosen in such a way that the complement of the nucleotide can be calculated using the binary NOT operator. The indexing algorithm returns a profile that holds observed counts for all possible substrings of length k that can be stored for other analyses.

Distance (diff)
Since the k--mer profile is in essence a vector of almost independent values, we can use any metric defined for vectors to calculate the distance between two profiles. We have implemented two metrics which are the standard Euclidian distance measure and the multiset (9) distance measure (see Formula 1). The last metric is parameterised by a function that reflects the difference between a pair. We have implemented two pairwise distance functions (shown in Formulae 2 and 3).
For a multiset X, let S(X) denote its underlying set. For multisets X, Y with S(X), S(Y) ⊆ 1, 2, … , we define: (2) Pairwise function: (3) Pairwise function: When analysing sequencing data, which frequently consist of reads from both strands (e.g., due to non strand--specific sample preparation or paired--end sequencing), we can assume that the chance of observing a fragment originating from the plus and minus strands are equal. Additionally, if the sequencing depth is high enough, we expect a balance between the frequencies of k--mers and their reverse complement in a given k--mer profile. Every type of NGS data has an expected balance (i.e., SAGE is not expected to produce a balanced profile while whole genome shotgun sequencing is expected to have a perfectly balanced frequency between k--mers and their reverse complement). Thus, k--mer balance can indicate the quality of NGS data in respect to over--amplification, insufficient number of reads, or poor capture performance in the case of whole exome sequencing.
To calculate the balance, first we observe that every k--mer has a reverse complement. One of these is lexicographically smaller (or equal in the case of a palindrome) than the other. We first split a profile into two vectors, = ( ! , ! , … ) and = ( ! , ! , … ) where ! represents the reverse complement of ! and vice versa. The distance between these vectors can be calculated in the same way as described for pairwise comparison of two full k--mer profiles ( Figure 1C).
Additionally, kMer can forcefully balance the k--mer profiles (if desired) by adding the values of each k--mer to its reverse complement. This procedure can improve distance calculation if the sequencing depth is too low.

Shrink
A profile indexed at a certain k size contains information about k--mers of smaller lengths. This can be seen from the fact that a word over an alphabet has possible suffixes of length one. To calculate the number of occurrences of , we simply need to calculate the . This only holds when k is relatively small compared to the length of the indexed sequencing reads. Indeed, if a sequence of length is indexed at length k, then ( − + 1) k--mers are encountered per sequence. However, shrinking of a profile will yield ( − ) k--mers. Usually, this border effect is small enough to ignore, but should be taken into consideration when indexing large amounts of small (approaching length k) sequences. Shrinking is useful when trying to estimate the best k for a particular purpose. One can start with choosing a relatively large k and then reuse the generated profile to construct a profile of smaller k sizes ( Figure 1D).

Smoothing
Ideally, the samples that are used to generate profiles are sequenced with the same sample preparation, on the same platform, and most importantly at sufficient depth. However, in practice, this is rarely the case. When two similar samples are sequenced at insufficient depth, it will be reflected in a k--mer profile by zero counts for k--mers that are not expected to be nullomers. While this is not a problem in itself, the fact that most sequencing procedures have a random selection of sequencing fragments will result in a random distribution of these zero counts. When comparing two profiles, the pairwise distances will be artificially large. Scaling the profiles can partially compensate for differences in the sequencing depth but can not account for nullomers since no distinction can be made between true missing words and artificially missing words. An obvious solution would be to shrink the profile until nullomers are removed. This method is valid as long as all zero counts reflect artificial nullomers. Otherwise, shrinking will reduce the specificity and does not reflect the true complexity of the sequenced genome. To deal with this problem, we have developed the pairwise smoothing function. This method locally shrinks a profile only when necessary. In this way, we retain information if it is available in both profiles and discard missing data ( Figure 1E).
Let P and Q be sub--profiles of words over an alphabet of length (with devidable by ). Let t be a user--defined threshold and let f be a method of summarizing a profile. If min , > we divide the profiles in equal parts and recursively repeat the procedure for each part. If this is not the case, we collapse both P and Q to one word. Implemented methods of summarizing are minimum, mean, and median. In Figure  1E we show an example of how smoothing might work.
We have chosen f = min and t = 0 as default parameters. With this method, we can index a dataset with a large k and retain the overall specificity of the profile since this method can automatically select the optimal choice of k locally.