Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Maast: genotyping thousands of microbial strains efficiently

Fig. 2

SNP genotyping of 146 human gut bacterial species using tag genomes. a and b SNP discovery comparison of Maast with all genomes (gray), only tag genomes (green), and a random set of genomes equal in number to the tag genomes (brown) shows that more genomes do not lead to the discovery of more SNPs. Each box in a summarizes the number of SNPs across 146 species. Each point in b represents a species, with black lines connecting the data for the same species. For computational efficiency, only the 1000 highest quality genomes were included for species with > 1000 genomes. c Comparison of SNPs discovered by Maast with all genomes versus only tag genomes. Each bar represents a species, with the height of a bar showing the number of SNPs discovered exclusively with all genomes (gray), exclusively with tag genomes (green), or by both approaches (beige). Arrows point to eight example species (from left to right): Faecalibacterium prausnitzii_K (species id: 101300), Akkermansia muciniphila (102454), Akkermansia muciniphila_B (102453), Succinivibrio sp000431835 (100412), Sutterella wadsworthensis_B (101361), Phascolarctobacterium faecium (103439), Alistipes shahii (100003), and Anaerotignum sp000436415 (100177). Species label color indicates whether this species has a high (red) or low (blue) level of tag-only SNPs, which is estimated as a fraction of all SNPs that are discovered with tag genomes and not with all genomes. d SNP sites missing from a small percentage of genome assemblies as they fall below the user-specified prevalence threshold due to being absent in a group of redundant genomes. Connected dots represent example species in c with a high proportion of tag-only SNPs. The proportion of tag-only SNPs drops if the MAF cutoff for calling SNPs with all genomes is lowered from 0.01 (orange) to 0.001 (green). Most of the SNPs only discovered with tag genomes (tag-only SNPs) are due to MAFs below the 1% threshold in all genomes. e For each of the four species where tag genomes and all genomes called different numbers of SNPs (Faecalibacterium prausnitzii_K, Phascolarctobacterium faecium, Alistipes shahii, and Akkermansia muciniphila_B), we further investigated the source of this discrepancy by defining a set of true population SNPs. To do so, we built a phylogenetic tree and used it to sample genomes so that they covered the species diversity but came from different subpopulations (essentially removing bias from over-sampling of subpopulations with redundant genomes). SNPs called using this set of genomes were defined as the true SNPs for the population. We ran Maast on each species using tag genomes or using all genomes to call SNPs with a 1% MAF threshold. For each species, tag genomes capture a higher number of the true SNPs than do all genomes, suggesting that redundant genomes bias MAF estimation and reduce SNP calling sensitivity. f The single largest genome cluster is larger for species with many tag-only SNPs compared to those with fewer. The largest genome clusters were compared between species from c with high (red axis tick) and low (blue axis tick) levels of tag-only SNPs. For three of four species with more SNPs discovered in the tag-only analysis, the largest cluster contains more than half of all genomes, implying a high level of genome redundancy that biases MAF estimation and leads to an undercount of SNPs. Height of bars shows the total number of genomes in a species and the proportion colored in purple indicates the size of the single largest genome cluster of that species. Every two adjacent species have a similar total number of genomes

Back to article page