Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

Fig. 2

HmmUFOtu core algorithms. a Constructing a consensus sequence FM-index (CSFM-index) from a MSA using the Burrows-Wheeler transform (BWT) coupled with Wavelet-tree compression algorithms. Red: Actual stored data in a CSFM-index. b A “plan 7” (p7) HMM architecture specifically designed for 16S rRNA gene and other target gene/marker sequencing, with M (match), I (insertion), D (deletion), N (N′: 5′), C (C′: 3′), B (begin), and E (end) states, respectively. Dashed circles and arrows: “wing-retraction” process used to avoid empty alignment paths; red arrows: special transitions used to control the “global” or “local” alignment mode. c Banded-HMM Viterbi algorithm to find the most likely (minimum cost) path given the HMM profile (row), a read sequence (column), and two known “seed” paths by querying the CSFM-index. Only shaded grids are searched by the banded-Viterbi algorithm. The first and last shaded search areas rarely reach the profile ends. d An example of a 16S rRNA phylogenetic tree. In this tree, all directional conditional log-likelihoods (arrows in (e), (f), (g)) of all branches were pre-evaluated. The ancestral sequences of all internal nodes were inferred using maximum likelihood. e For a potential “seed” branch u--v, a small sub-tree containing only nodes u, v, the original conditional log-likelihoods L(u) and L(v) and original branch length w0 are copied. f To place a new read n to branch u--v, a new internal node r is introduced, the new conditional log-likelihoods L(n) are evaluated, then initial branch lengths wrv, wur, and wnr are estimated using observed distance (p-Dist). g For a candidate top estimation, the branch lengths wrv, wur, wnr and L(rv), L(ru), and L(rn) are iteratively and jointly optimized until convergence

Back to article page