Summary of gene content determination pipeline. (a) Procedure of data selection for metagenomic gene content variation analysis. The initial dataset consisted of 252 metagenomic samples and a non-redundant set of reference genomes representative of 929 species based on 40 universal single copy marker genes. Metagenomic reads from each sample were aligned to each species and was followed by a multi-step filtering procedure used in sample and genome selection. The final dataset corresponded to 103 individuals that mapped to 11 species. (b) Diagram illustrating gene coverage of core and accessory genes of one species (Dialister invisus) for 10 individuals. The species is used to exemplify the typical variability in core and accessory genes coverage and location across the genome based on different individuals. Green denotes core genes, red denotes accessory genes, and white to missing genes. The bottom bar corresponds to the cross-samples consensus gene representing the core-accessory status, denoting the core, accessory and missing gene regions. (c) Boxplot shows the percentage of accessory genes (%) in Dialister invisus calculated from a subsampling procedure. The median values of different sample sizes were used to fit to the exponential regression model curve. (d) Shows the fitted exponential regression model for the 11 gut bacterial species and uses the same approach as in (c).