Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads

Fig. 1

Comparative functional metagenomics with Carnelian. Preprocessing. We build a gold standard reference database by combining reviewed prokaryotic proteins with complete Enzyme Commission (EC) labels and evidence of existence from UniProtKB/Swiss-Prot with curated prokaryotic catalytic residues with complete EC labels from the Catalytic Site Atlas. Carnelian first represents gold standard proteins in a compact feature space using low-density, even-coverage locality-sensitive Opal-Gallager hashing. Then, it trains a set of one-against-all (OAA) classifiers (implemented using the Vowpal Wabbit framework) using the compact feature representation of those proteins as well as negative samples based off of random shuffled sequences generated by HMMER. Functional profiling. To functionally profile reads from a whole metagenomic sequencing (WMS) experiment, Carnelian first performs probabilistic ORF prediction using FragGeneScan. Next, the ORFs are represented in a compact feature space using the same Opal-Gallager hashing technique. The trained OAA classifier ensemble is then used to classify the ORFs into appropriate EC bins. Abundance estimates of ECs are computed from the raw ORF counts in the EC bins by normalizing against effective protein length per EC bin and a per million scaling factor. Pathway profiles (Orange) are computed by grouping the ECs into metabolic pathways and summing the abundance estimates. Comparative metagenomics. We start from pathway profiles (Orange) of different populations and conditions. (Blue) Functional relatedness of healthy microbiomes across different populations is assessed by co-abundance pathway analysis. Pathway co-abundance estimates are quantified by Kendall’s rank correlation. Co-abundance clusters are determined by Ward-Linkage hierarchical clustering, and the PERMANOVA test is used to determine if the centroids of those clusters differ between populations A and B. (Green) Functional trends analysis across different case-control cohorts of a disease is performed using differential abundance analysis by Wilcoxon rank-sum test and shared significance analysis by Fisher’s combined probability test

Back to article page