Sequence classification algorithm
To classify a DNA sequence S, we collect all k-mers within that sequence into a set, denoted as K(S). We then map each k-mer in K(S), using the algorithm described below, to the LCA taxon of all genomes that contain that k-mer. These LCA taxa and their ancestors in the taxonomy tree form what we term the classification tree, a pruned subtree that is used to classify S. Each node in the classification tree is weighted with the number of k-mers in K(S) that mapped to the taxon associated with that node. Then, each root-to-leaf (RTL) path in the classification tree is scored by calculating the sum of all node weights along the path. The maximum scoring RTL path in the classification tree is the classification path, and S is assigned the label corresponding to its leaf (if there are multiple maximally scoring paths, the LCA of all those paths’ leaves is selected). This algorithm, illustrated in Figure 1, allows Kraken to consider each k-mer within a sequence as a separate piece of evidence, and then attempt to resolve any conflicting evidence if necessary. Note that for an appropriate choice of k, most k-mers will map uniquely to a single species, greatly simplifying the classification process. Sequences for which none of the k-mers in K(S) are found in any genome are left unclassified by this algorithm.
The use of RTL path scoring in the classification tree is necessary in light of the inevitable differences between the sequences to be classified and the sequences present in any library of genomes. Such differences can, even for large values of k, result in a k-mer that is present in the library but associated with a species far removed from the true source species. By scoring the various RTL paths in the classification tree, we can compensate for these differences and correctly classify sequences even when a small minority of k-mers in a sequence indicate that the sequence should be assigned an incorrect taxonomic label.
Database creation
Efficient implementation of Kraken’s classification algorithm requires that the mapping of k-mers to taxa is performed by querying a pre-computed database. Kraken creates this database through a multi-step process, beginning with the selection of a library of genomic sequences. Kraken includes a default library, based on completed microbial genomes in the National Center for Biotechnology Information’s (NCBI) RefSeq database, but the library can be customized as needed by individual users [18].
Once the library is chosen, we use the Jellyfish multithreaded k-mer counter [19] to create a database containing every distinct 31-mer in the library. Once the database is complete, the 4-byte spaces Jellyfish used to store the k-mer counts in the database file are instead used by Kraken to store the taxonomic ID numbers of the k-mers’ LCA values. After the database has been created by Jellyfish, the genomic sequences in the library are processed one at a time. For each sequence, the taxon associated with it is used to set the stored LCA values of all k-mers in the sequence. As sequences are processed, if a k-mer from a sequence has had its LCA value previously set, then the LCA of the stored value and the current sequence’s taxon is calculated and that LCA is stored for the k-mer. Taxon information is obtained from the NCBI taxonomy database.
Database structure and search algorithm
Because Kraken very frequently uses a k-mer as a database query immediately after querying an adjacent k-mer, and because adjacent k-mers share a substantial amount of sequence, we utilize the minimizer concept [20] to group similar k-mers together. To explain our application of this concept, we here define the canonical representation of a DNA sequence S as the lexicographically smaller of S and the reverse complement of S. To determine a k-mer’s minimizer of length M, we consider the canonical representation of all M-mers in the k-mer, and select the lexicographically smallest of those M-mers as the k-mer’s minimizer. In practice, adjacent k-mers will often have the same minimizer.
In Kraken’s database, all k-mers with the same minimizer are stored consecutively, and are sorted in lexicographical order of their canonical representations. A query for a k-mer R can then be processed by looking up in an index the positions in the database where the k-mers with R’s minimizer would be stored, and then performing a binary search within that region (Figure 5). Because adjacent k-mers often have the same minimizer, the search range is often the same between two consecutive queries, and the search in the first query can often bring data into the CPU cache that will be used in the second query. By allowing memory accesses in subsequent queries to access data in the CPU cache instead of RAM, this strategy makes subsequent queries much faster than they would otherwise be. The index containing the offsets of each group of k-mers in the database requires 8 × 4M bytes. By default Kraken uses 15-bp minimizers, but the user can modify this value; for example, in creating MiniKraken, we used 13-bp minimizers to ensure the total database size stayed under 4 GB.
In implementing Kraken, we made further optimizations to the structure and search algorithm described above. First, as noted by Roberts et al. [20], a simple lexicographical ordering of M-mers can result in a skewed distribution of minimizers that over-represents low-complexity M-mers. In Kraken, such a bias would create many large search ranges, which would require more time to search. To create a more even distribution of minimizers (and thus speed up searches), we use the exclusive-or (XOR) operation to toggle half of the bits of each M-mer’s canonical representation prior to comparing the M-mers to each other using lexicographical ordering. This XOR operation effectively scrambles the standard ordering, and prevents the large bias toward low-complexity minimizers.
We also take advantage of the fact that the search range is often the same between queries to make Kraken’s queries faster. Rather than compute the minimizer each time we perform a query, we first search the previous range. If our queried k-mer is found in this range, the query can return immediately. If the k-mer is not found, then the minimizer is computed; if the k-mer’s minimizer is the same as the last queried k-mer’s, then the query fails, as the minimizer’s search space has been shown not to have the k-mer. Only if the minimizer has changed does Kraken have to adjust the search range and search again for the k-mer.
Constructing simulated metagenomes
The HiSeq and MiSeq metagenomes were built using 20 sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project [21] or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes (Additional file 1: Table S1). For both the 10,000 and 10 million read samples of each of these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance). All sequences were trimmed to remove low quality bases and adapter sequences.
The composition of these two metagenomes poses certain challenges to our classifiers. For example, Pelosinus fermentans, found in our HiSeq metagenome, cannot be correctly identified at the genus level by Kraken (or any of the other previously described classifiers), because there are no Pelosinus genomes in the RefSeq complete genomes database; however, there are seven such genomes in Kraken-GB’s database, including six strains of P. fermentans. Similarly, in our MiSeq metagenome, Proteus vulgaris is often classified incorrectly at the genus level because the only Proteus genome in Kraken’s database is a single Proteus mirabilis genome. Five more Proteus genomes are present in Kraken-GB’s database, allowing Kraken-GB to classify reads better from that genus. In addition, the MiSeq metagenome contains five genomes from the Enterobacteriaceae family (Citrobacter, Enterobacter, Klebsiella, Proteus and Salmonella). The high sequence similarity between the genera in this family can make distinguishing between genera difficult for any classifier.
The simBA-5 metagenome was created by simulating reads from the set of complete bacterial and archaeal genomes in RefSeq. Replicons from those genomes were used if they were associated with a taxon that had an entry associated with the genus rank, resulting in a set of replicons from 607 genera. We then used the Mason read simulator [22] with its Illumina model to produce 10 million 100-bp reads from these genomes. First we created simulated genomes for each species, using a SNP rate of 0.1% and an indel rate of 0.1% (both default parameters), from which we generated the reads. For the simulated reads, we multiplied the default mismatch and indel rates by five, resulting in an average mismatch rate of 2% (ranging from 1% at the beginning of reads to 6% at the ends) and an indel rate of 1% (0.5% insertion probability and 0.5% deletion probability). For the simBA-5 metagenome, the 10,000 read set was generated from a random sample of the 10 million read set.
Evaluation of accuracy and speed
We elected to measure accuracy primarily at the genus level, which was the lowest level for which we could easily determine the taxonomy information for PhymmBL and NBC’s predictions in an automated fashion. (This is due to the manner in which PhymmBL and NBC report their results). Because some genomes do not have taxonomic entries at all seven ranks (species, genus, family, order, class, phylum and kingdom), we defined genus-level sensitivity as A/B, where A is the number of reads with an assigned genus that were correctly classified at that rank, and B is the total number of reads with any assigned genus. We defined sensitivity similarly for other taxonomic ranks.
Because Kraken may classify a read at levels above the species, measuring its precision requires us to define the effect on precision of assigning the correct genus (for example) while not assigning a species at all. For this reason, we defined rank-level precision as C/(D + E), where C is the number of reads labeled at or below the correct taxon at the measured rank, D is the number of reads labeled at or below the measured rank, and E is the number of reads incorrectly labeled above the measured rank. For example, given a read R that should be labeled as Escherichia coli, a labeling of R as E. coli, E. fergusonii or Escherichia would improve genus-level precision. A label of Enterobacteriaceae (correct family) or Proteobacteria (correct phylum) would have no effect on genus-level precision. A label for R of Bacillus (incorrect genus) or Firmicutes (incorrect phylum) would decrease the genus-level precision.
When evaluating PhymmBL’s accuracy, following its developers’ advice [7], we selected a genus confidence threshold for our comparisons. We selected 3,333 reads from the simulated medium complexity (simMC) [23] data set, covering 31 different genera. To simulate short reads from the Sanger sequence data in the simMC set, we selected the last 100 bp from each of the reads. We then ran PhymmBL against those 100-bp reads, and evaluated the genus-level sensitivity and precision of PhymmBL’s predictions with genus confidence thresholds from 0 to 1, in increments of 0.05. We found that a threshold of 0.65 yielded the highest F-score (the harmonic mean of sensitivity and precision), with 0.60 and 0.70 also having F-scores within 0.5 percentage points of the maximum (Additional file 1: Table S2). We therefore used the 0.65 genus confidence threshold in our comparisons. Although the selection of a threshold depends on a user’s individual needs, and so is to some extent arbitrary, a threshold selected in this manner provides a more proper comparison to a selective classifier such as Kraken than no threshold at all.
The time and accuracy results when using Megablast as a classifier were obtained from the log data produced by PhymmBL, as PhymmBL uses Megablast for its alignment step. When assigning a taxonomic label to a read with Megablast, we used the taxon associated with the first reported alignment. Megablast was run with default options.
Speed was evaluated using the single-threaded operation of each program (except for NBC). PhymmBL was altered so that its call to the blastn program used one thread instead of two. NBC was run with 36 concurrent processes operating on disjoint sets of genomes in its genomic library, and the total time for the classifier was determined by summing the decompression and scoring times for each genome. Wall clock times were recorded for all classifiers. In comparing Kraken to the other classifiers, we used BLAST+ 2.2.27, PhymmBL 4.0, NBC 1.1 and MetaPhlAn 1.7.6. Classifiers were all run on the same computer, with 48 AMD Opteron 6172 2.1 GHz CPUs and 252 GB of RAM, running Red Hat Enterprise Linux 5. The data sets used for speed evaluation had 10,000 reads each for all programs other than Kraken (and its variants) and MetaPhlAn, which used 10,000,000 read data sets. Higher read numbers were used with these faster programs to minimize the effect of the initial and final operations that take place during the programs’ execution.
Although Kraken is the only one of the programs we examined that explicitly performs operations to ensure its data is in physical memory before classification, we wanted to be sure that all programs were evaluated in a similar manner. When evaluating speed, for each program, we read all database files (e.g. IMM files and BLAST databases for PhymmBL, k-mer frequency lists for NBC and the Bowtie index for MetaPhlAn) into memory three times before running the program, in order to place the database content in the operating system cache (which is stored in physical memory).
Reduced database sizes
To generate the 4-GB database for our MiniKraken results, we removed the first 18 of every block of 19 records in the standard Kraken database. A shrinking factor of 19 was selected as it was the smallest integer factor that would reduce the size to less than 4 GB, a size that can easily fit into the memory of many common personal computers. For users that have more RAM available, Kraken allows a smaller shrinking factor to be used, which will give increased sensitivity.
Use of draft genomes
When constructing the Kraken-GB database, we noticed there were several contigs with known adapter sequences at the ends. In subsequent tests, we also found that some sequences in samples with large amounts of human sequence were consistently misclassified by this database, leading us to conclude that contamination was likely present in the draft genomes. In an attempt to counteract this contamination, we removed from the database those k-mers from known adapter sequences, as well as the first and last 20 k-mers from each of the draft contigs. While this did improve classification, it did not eliminate the misclassification problem. For this reason, we believe that if draft genomes are used in a Kraken database, very stringent measures should be used to remove contaminant sequences from the genomic library.
Clade exclusion experiments
When re-analyzing the simBA-5 data set for our clade exclusion experiments, some reads were not used for certain pairs of measured and excluded ranks. If a read’s origin lacked a taxonomic entry at either of the measured or excluded ranks, it was not used for that particular experiment.
In addition, a read was not used in an experiment unless at least two other taxa represented in our database (aside from the excluded clade) at the excluded rank shared the clade of origin’s taxon at the measured rank. For example, a read from genus G would not be used in an experiment measuring accuracy at the class rank and excluding the genus rank unless G’s home class had at least two other genera with genomes in Kraken’s genomic library. Without this filtering step, were a genus excluded when it was the only genus in its class, Kraken could not possibly name the correct class, as all entries in the database from that class would be excluded as well. This is the same approach taken in similar experiments that were used to evaluate PhymmBL [5].
Human microbiome data classification
We classified the Human Microbiome Project data using a Kraken database made from complete RefSeq bacterial, archaeal and viral genomes, along with the GRCh37 human genome. We retrieved the sequences of three accessions (SRS019120, SRS014468 and SRS015055) from the NCBI Sequence Read Archive, and each accession had two runs submitted. All reads were trimmed to remove low quality bases and adapter sequences. Krona [24] was used to generate all taxonomic distribution plots.
Because the sequences were all paired reads, we joined the reads together by concatenating the mates with a sequence of ‘NNNNN’ between them. Kraken ignores k-mers with ambiguous nucleotides, so the k-mers that span these ‘N’ characters do not affect classification. This operation allowed Kraken to classify a pair of reads as a single unit rather than having to classify the mates separately.
Software and data availability
Kraken is written in C++ and Perl, and is available for download at [25] along with the metagenome data used to evaluate the accuracy of the classifiers presented here, and a downloadable 4-GB MiniKraken database similar to the one used here. The source code is also available from GitHub [26].