KrakenHLL: Confident and fast metagenomics classification using unique k-mer counts

False positive identifications are a significant problem in metagenomic classification. We present KrakenHLL, a novel metagenomic classifier that combines the fast k-mer based classification of Kraken with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenHLL gives better recall and F1-scores than other methods, and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog (HLL), KrakenHLL is as fast as Kraken and requires little additional memory.


Background 20
Metagenomic classifiers attempt to assign a taxonomic identity to each read in a data set. 21 Because metagenomics data often contain tens of millions of reads, classification is typically done using exact matching of short words of length k (k-mers) rather than alignment, which 23 would be unacceptably slow. The results contain read classifications but not their aligned 24 positions in the genomes [as reviewed by 1]. However, read counts can be deceiving. Sequence 25 contamination of the samples-introduced from laboratory kits or the environment during sample 26 extraction, handling or sequencing-can yield high numbers of spurious identifications [2,3]. 27 Having only small amounts of input material can further compound the problem of 28 contamination. When using sequencing for clinical diagnosis of infectious diseases, for example, 29 less than 0.1% of the DNA may derive from microbes of interest [4,5]. Additional spurious 30 matches can result from low-complexity regions of genomes and from contamination in the 31 database genomes themselves [6]. 32 33 Such false-positive reads typically match only small portions of a genome; e.g., if a species' 34 genome contains a low-complexity region, and the only reads matching that species fall in this 35 region, then the species was probably not present in the sample. Reads from microbes that are 36 truly present should distribute relatively uniformly across the genome rather than being 37 concentrated in one or a few locations. Genome alignment can reveal this information. However, 38 alignment is resource intensive, requiring the construction of indexes for every genome and a 39 relatively slow alignment step to compare all reads against those indexes. Some metagenomics 40 methods do use coverage information to improve mapping or quantification accuracy, but these 41 methods require results from much slower alignment methods as input [7]. Assembly-based 42 methods also help to avoid false positives, but these are useful only for highly abundant species 43 [8]. 44 Here, we present KrakenHLL, a novel method that combines very fast k-mer based classification 46 with a fast k-mer cardinality estimation. KrakenHLL is based on the Kraken metagenomics 47 classifier [9], to which it adds a method for counting the number of unique k-mers identified for 48 each taxon using the efficient cardinality estimation algorithm HyperLogLog [10][11][12]. By 49 counting how many of each genome's unique k-mers are covered by reads, KrakenHLL can 50 often discern false positive from true positive matches. Furthermore, KrakenHLL implements 51 additional new features to improve metagenomics classification: (a) searches can be done against 52 multiple databases hierarchically, (b) the taxonomy can be extended to include nodes for strains 53 and plasmids, thus enabling their detection, and (c) the database build script allows the addition 54 of >100,000 viruses from the NCBI Viral Genome Resource [13]. KrakenHLL provides a 55 superset of the information provided by Kraken while running equally fast or slightly faster, and 56 while using very little additional memory during classification. 57

58
KrakenHLL was developed to provide efficient k-mer count information for all taxa identified in 59 a metagenomics experiment. The main workflow is as follows: As reads are processed, each k-60 mer is assigned a taxon from the database ( Figure 1A). KrakenHLL instantiates a HyperLogLog 61 data sketch for each taxon, and adds the k-mers to it ( Figure 1B and Supplementary 62 Information). After classification of a read, KrakenHLL traverses up the taxonomic tree and 63 merges the estimators of each taxon with its parent. In its classification report, KrakenHLL 64 includes the number of unique k-mers and the depth of k-mer coverage for each taxon that it 65 observed in the input data ( Figure 1C).    Table 3). We place greater emphasis on the eleven biological datasets, which contain 133 more realistic laboratory and environmental contamination. In the first part of this section, we 134 show that unique k-mer counts provide higher classification accuracy than read counts, and in 135 the second part we compare KrakenHLL with the results of eleven metagenomics classifiers. We 136 ran KrakenHLL on three databases: 'orig', the database used by McIntyre et al.,'std',which 137 contains all current complete bacterial, archaeal and viral genomes from RefSeq plus viral 138 neighbor sequences and the human reference genome, and 'nt', which contains all microbial 139 sequences (including fungi and protists) in the non-redundant nucleotide collection nr/nt 140 provided by NCBI (see Suppl. Methods Section 2 for details). The 'std' database furthermore 141 includes the UniVec and EmVec sequence sets of synthetic constructs and vector sequences; and 142 low-complexity k-mers in microbial sequences were masked using NCBI's dustmasker with 143 default settings. We use two metrics to compare how well methods can separate true positives 144 and false positives: (a) F1 score, i. e. the harmonic mean of precision p and recall r, and (b) recall 145 at a maximum false discovery rate (FDR) of 5%. For each method, we compute and select the 146 ideal thresholds based on the read count, k-mer count or abundance calls. Precision p is defined 147 as the number of correctly called species (or genera) divided by the number of all called species 148 (or genera) at a given threshold. Recall r is the proportion of species (or genera) that are in the 149 test dataset and that are called at a given threshold. Higher F1 scores indicate a better separation 150 between true positives and false positives. Higher recall means that more true species can be 151 recovered while controlling the false positives. 152 Because the NCBI taxonomy has been updated since the datasets were published, we manually 154 updated the "truth" sets in several datasets (see Suppl. Methods Section 2.3 for all changes). Any 155 cases that might have been missed would result in a lower apparent performance of KrakenHLL. 156 Note that we exclude the over ten-year-old simulated datasets simHC, simMC and simLC from 157 Mavromatis et al. (2007)

Unique k-mer versus read count thresholds 165
We first looked at the performance of the unique k-mer count thresholds versus read count 166 thresholds (as would be used with Kraken). The k-mer count thresholds worked very well, 167 particularly for the biological datasets (Table 1 and Suppl. Table 3). On the genus level, the 168 average recall in the biological datasets increases by 4-9%, and the average F1 score increases 2-169 3%. On the species level, the average increase in recall in the biological sets is between 3 and 170 12%, and the F1 score increases by 1-2%. 171 On the simulated datasets, the differences are less pronounced and vary between databases, even 173 though on average the unique k-mer count is again better. However, only in two cases (genus 174 recall on databases 'orig' and 'std') the difference is higher than 1% in any direction. We find 175 that simulated datasets often lack false positives with a decent number of reads but a lower 176 number of unique k-mer counts, which we see in real data. Instead, in most simulated datasets 177 the number of unique k-mers is linearly increasing with the number of unique reads in both true 178 and false positives (Suppl. Figure 4). In biological datasets, sequence contamination and lower 179 read counts for the true positives make the task of separating true and false positives harder. 180 181 . KrakenHLL with database 'nt' has the highest average 187 recall and F1 score across the biological datasets, as shown in Table 2. As seen before, using 188 unique k-mer instead of read counts as thresholds increases the scores. While the database 189 selection proves to be very important (KrakenHLL with database 'std' is performing 10% worse 190 than KrakenHLL with database 'nt'), only Blast has higher average scores than KrakenHLL with 191 k-mer count thresholds on the original database. On the simulated datasets, KrakenHLL with the 192 'nt' database still ranks at the top, though, as seen previously there is more variation (Suppl. genomes (see Suppl. Methods Section 2.4). We randomly sampled between one hundred and one 215 million reads (logarithmically distributed) from each experiment, which gave 34 million read 216 pairs in total. Furthermore, we sub-sampled five read sets with between one to twenty million 217 reads. All read sets were classified with KrakenHLL using the 'std' database. the k-mer thresholds for the ideal F1 score and recall at a maximum of 5% FDR, respectively. In this dataset, a unique k-mer count in the range 10000-20000 would give the best threshold for 225 selecting true species. 226 227 Consistent with the results of the previous section, we found that unique k-mer counts provide 228 better thresholds than read counts both in terms of F1 score and recall in all test datasets (e.g. 229 In general, we find that for correctly identified species, we obtain up to approximately L-k 245 unique k-mers per each read, where L is the read length, because each read samples a different 246 location in the genome. (Note that once the genome is completely covered, no more unique k-247 mers can be detected.) Thus the k-mer threshold should always be several times higher than the 248 read count threshold. For the discovery of pathogens in human patients, discussed in the next 249 section, a read count threshold of 10 and unique k-mer count threshold of 1000 eliminated many 250 background identifications while preserving all true positives, which were discovered from as 251 few as 15 reads. 252 253

254
Metagenomics is increasingly used to find species of low abundance. A special case is the 255 emerging use of metagenomics for the diagnosis of infectious diseases [27,28]. In this 256 application, infected human tissues are sequenced directly to find the likely disease organism. 257 Usually, the vast majority of the reads match (typically 95-99%) the host, and sometimes fewer 258 than 100 reads out of many millions of reads are matched to the target species. Common skin 259 bacteria from the patient or lab personnel and other contamination from sample collection or 260 preparation can easily generate a similar number of reads, and thus mask the signal from the 261 pathogen. 262

263
To assess if the unique k-mer count metric in KrakenHLL could be used to rank and identify 264 pathogen from human samples, we reanalyzed ten patient samples from a previously described 265 series of neurological infections [4]. That study sequenced spinal cord mass and brain biopsies 266 from ten hospitalized patients for whom routine tests for pathogens were inconclusive. In four of 267 the ten cases, a likely diagnosis could be made with the help of metagenomics. To confirm the 268 metagenomics classifications, the authors in the original study re-aligned all pathogen reads to 269 individual genomes. 270 271 Table 3 shows the results of our reanalysis of the confirmed pathogens in the four patients, 272 including the number of reads and unique k-mers from the pathogen, as well as the number of 273 bases covered by re-alignment to the genomes. Even though the read numbers are very low in 274 two cases, the number of unique k-mers suggests that each read matches a different location in 275 the genome. For example, in PT8, 15 reads contain 1570 unique k-mers, and re-alignment shows 276 2201 covered base pairs. In contrast, Table 4 shows examples of identifications from the same 277 datasets that are not well-supported by k-mer counts. We also examined the likely source of the 278 false positive identifications by blasting the reads against the full nt database, and found rRNA of 279 environmental bacteria, human RNA and PhiX-174 mis-assignments (see Suppl. Methods for 280 details). Notably, the common laboratory and skin contaminants PhiX-174, Escherichia coli, 281 Cutibacterium acnes and Delftia were detected in most of the samples, too (see Suppl. Table 6). 282 However, those identifications are solid in terms of their k-mer counts -the bacteria and PhiX-283 174 are present in the sample, and the reads cover their genomes rather randomly. To discount 284 them, comparisons against a negative control or between multiple samples is required (e.g. with 285 Pavian [29]). 286 287 Table 3: Validated pathogen identifications in patients with neurological infections have high 288 numbers of unique k-mers per read. The pathogens were identified with as few as 15 reads, but the high number of unique k-mers indicates distinct locations of the reads along their genomes. 290 Re-alignment of mapped reads to their reference genomes (column "Covered Bases") 291 corroborates the finding of the unique k-mers (see also Suppl. Figure 5). Interestingly, the k-mer 292 count in PT5 indicates that there might be multiple strains present in the sample since the k-mers 293 cover more than one genome. Read lengths were 150-250 bp. 294  The additional features of KrakenHLL come without a runtime penalty and very limited 326 additional memory requirements. In fact, due to code improvements, KrakenHLL often runs faster than Kraken, particularly when most of the reads come from one species. On the test 328 dataset, the mean classification speed in million base-pairs per minute increased slightly from 329 410 to 421 Mbp/m (see Suppl. Table 3). When factoring in the time needed to summarize 330 classification results by kraken-report, which is required for Kraken but part of the classification 331 binary of KrakenHLL, KrakenHLL is on average 50% faster. The memory requirements increase 332 on average by 0.5 GB from 39.5 GB to 40 GB. 333 334 On the pathogen Id patient data, where in most cases over 99% of the reads were either assigned 335 to human or synthetic reads, KrakenHLL was significantly faster than Kraken (Suppl. Table 5). 336 The classification speed increased from 467 to 733 Mbp/m. The average wall time was about 337 44% lower, and the average additional memory requirements were less than 1GB, going from 338 118.0 to 118.4 GB. All timing comparisons were made after preloading the database and running 339 with 10 parallel threads. implements coverage-sensitive mapping, which should give better abundance calls, but it did not 352 perform very well in our tests. Going from alignments to a good taxonomic profile is difficult 353 because coverage information cannot be as easily computed for the LCA taxon and summarized 354 for higher levels in the taxonomic tree. In comparison, reads and unique k-mer counts can be 355 assigned to the LCA taxa, and summed to higher levels. Notably, KrakenHLL's k-mer counting 356 is affected by GC biases in the sequencing data the same way as other read classifiers and 357 aligners [32], and may underreport GC-rich or GC-poor genomes. 358 359 Conclusions 360 KrakenHLL is a novel method that combines fast k-mer based classification with an efficient 361 algorithm for counting the number of unique k-mers found in each species in a metagenomics 362 dataset. When the reads from a species yield many unique k-mers, one can be more confident 363 that the taxon is truly present, while a low number of unique k-mers suggests a possible false 364 positive identification. We demonstrated that using unique k-mer counts provides improved 365 accuracy for species identification, and that k-mer counts can help greatly in identifying false 366 positives. In our comparisons with multiple other metagenomics classifiers on multiple 367 metagenomics datasets, we found that KrakenHLL consistently ranked at the top. The strategy of 368 counting unique k-mer matches allows KrakenHLL to detect that reads are spread across a 369 genome, without the need to align the reads. By using a probabilistic counting algorithm, 370 KrakenHLL is able to match the exceptionally fast classification time of the original Kraken 371 program with only a very small increase in memory. The result is that KrakenHLL gains many of 372