Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification

Fig. 1

Overview of Centrifuger. Left: classification procedure on the forward read. Centrifuger searches from the end of the read and applies the backward search to extend the match until reaching a mismatch. This yields the first 60-bp exact match hitting three sequences {X, Y, Z} in the database. Centrifuger then skips the mismatch and restarts the search again, giving the second 39-bp match hitting two sequences {X, Y}. The same search procedure applies to the reverse complement of the read. Centrifuger then scores each matched sequence and classifies the read to the sequences with the highest scores, where the example read is classified to the sequence X with the score 2601. Right: the structure of Centrifuger’s lossless compressed FM-index. Centrifuger utilizes the RBBWT representation for the BWT sequence. In the example of compressing the BWT sequence “AAAAACGTAAAA”, RBBWT represents it as two sequences “AA” and “ACGT” when the block size is 4. For the sequence IDs that are sampled on the BWT sequence, Centrifuger will compact their bits representation. In this example, there are four sequences in the database (W, X, Y, Z), so 2 bits are sufficient to represent the sequence ID. Therefore, for the substring of the BWT sequence shown in the example, Centrifuger spends 6 bits to represent sequence IDs that are sampled every other four positions on the BWT sequence

Back to article page