Species-resolved sequencing of low-biomass or degraded microbiomes using 2bRAD-M

Microbiome samples with low microbial biomass or severe DNA degradation remain challenging for amplicon-based or whole-metagenome sequencing approaches. Here, we introduce 2bRAD-M, a highly reduced and cost-effective strategy which only sequences ~ 1% of metagenome and can simultaneously produce species-level bacterial, archaeal, and fungal profiles. 2bRAD-M can accurately generate species-level taxonomic profiles for otherwise hard-to-sequence samples with merely 1 pg of total DNA, high host DNA contamination, or severely fragmented DNA from degraded samples. Tests of 2bRAD-M on various stool, skin, environmental, and clinical FFPE samples suggest a successful reconstruction of comprehensive, high-resolution microbial profiles. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02576-9.


Species-resolved sequencing of low-biomass microbiomes by 2bRAD-M
This file includes:

Supplemental Methods
Fig S1 to S12.

Supplemental Methods
Supplemental Figures   Fig S1. Distribution of unique 2bRAD tags and theoretically existent 2bRAD tags on different taxonomy levels.          Tables   Table S1. Availability of 2bRAD-M markers for taxonomic profiling at each of the taxonomic levels. Table S2. Expected abundance of bacterial species in simulation data and profiling results by the 2bRAD-M computational pipeline. Table S3. The relative enrichment of 2bRAD reads originated from microbial species versus those originated from host in the high-host-contamination (HoC) group. Table S4. The relative abundance of major taxa identified in the three fecal samples at the species level using 2bRAD-M or WMS or at the genus level using 16S rRNA gene amplicon sequencing. Table S5. The relative abundance of microbial species that are uniquely detected in the WMS or 2bRAD-M data of fecal samples. Table S6. The initial DNA content and metadata for underarm skin, home and car samples. Table S7. The 2bRAD-M profiling results for underarm, home and car samples. Table S8. The relative abundance of bacteria, fungi and archaea in the indoor built-environmental samples. Table S9. Species-level microbial organismal markers for the highly reliable diagnosis of cervical cancer from FFPE samples. Table S10. The adaptors and primers used in 2bRAD-M sequencing (5'-3').

Feasibility of 2bRAD-M for microbiome profiling by additional type IIB restriction enzymes
To test the feasibility of 2bRAD-M for microbiome profiling, two fundamental questions would be addressed in the in silico experiments based on an extensive set of microbial genomes: (i) whether the surveyed 2bRAD-M data is a reliable reduced representation of the microbial genomes; (ii) whether the surveyed 2bRAD-M data harbor phylogenetic markers that can enable the taxonomic profiling of microbial taxa at the species level.
We started by downloading 173,165 microbial genomes from NCBI RefSeq (Oct, 2019), including 15,162 bacterial, archaea and fungal complete genomes. The digital restriction digestion of all these genomes by an IIB restriction enzyme (such as BcgI) resulted in averagely 2930.38±2790.84 2bRAD-M tags per genome. To date, there are totally 16 type IIB restriction enzymes discovered, and they have distinct DNA recognition sites. We thus performed the digital digestion of all microbial genomes using all these 16 type IIB restriction enzymes, which produced multiple and flexible reduced representations of each microbial genome (Fig. S1b). Collectively, we identified and collected the restriction fragments of all microbial genomes using 16 restriction enzymes, which represent the most comprehensive 2bRAD-M reference genome database.
To assess whether the surveyed restriction fragments represent a random subset of a given microbial genome, we compared a number of features of the digitally digested DNA fragments from a given microbial genome to those from the entire genome. We found that the surveyed fragments are typically evenly distributed along a microbial genome, and across genic and non-genic regions.
Likewise, %G + C content (53%) of surveyed 2bRAD-M tags are very similar to the genome-wide averages (Pearson's correlation R=0.992). Furthermore, the number of 2bRAD-M tags is highly correlated with the genome size of a given microbe (Pearson's correlation R=0.976). This suggests that 2bRAD-M fragments can be employed to survey genome-wide features of microbes without requiring sequencing the full genome, regardless of the specific type-2B enzyme used here (Fig. S2).
We next sought to identify universal 2bRAD-M phylogenetic markers from a total set of 173,165 reference microbial genomes. Different strategies have been introduced to determine microbial community compositions and estimate their abundances from metagenomic data. Our approach is to identify taxa-specific DNA markers by analyzing the 2bRAD-M reference genome database and further quantify the read coverage of those markers for taxonomic profiling from 2bRAD-M data.
Therefore, desired DNA markers in our study should be specific to taxa (i.e., species), iso-length (around 33 bp long) and short DNA fragments that only occur once per genome. Overall, the higher taxonomic level, the more 2bRAD-M tags are available (Fig. S1a). At the Kingdom level, almost all 2bRAD-M tags are kingdom-specific, thus there are very few shared 2bRAD-M tags among bacteria, fungi, archaea and human, regardless of the restriction enzymes. This suggested that the abundance ratio between the kingdoms can be readily derived from the 2bRAD-M data (yet can be challenging for WMS). The phylum-specific 2bRAD-M markers accounted for up to 90% ~ 97% of all theoretical 2bRAD-M tags produced from a given restriction enzyme from a given microbial genome. We next explored the 2bRAD-M markers specific to the 26,163 microbial species. Among all 521,289,189 restriction fragments produced by one typical type IIB restriction enzyme (BcgI), 99.21% are singlecopy within a given microbial genome, while averagely 21.86% are specific to species-level taxa ( Table S1). The other restriction enzymes can also generate distinct sets of 2bRAD-M tags from each 6 microbial genome. In fact, 18.81%-25.80% single-copy species-specific markers were identified from the 2bRAD-M genomes digested by the other restriction enzymes. Therefore, in principle, 2bRAD-M data provides a rich and highly flexible source of phylogenetic markers for metagenomic profiling.  Correlation of fragment size (left panel) or GC content (right panel) is shown. For a given genome, the collective size of all DNA tags cleaved by a type IIB restriction enzyme corresponds to a reduction in sequencing for one to two orders of magnitude (depending on genome size).

Fig S3. Comparison of the profiling performance between individual Type IIB restriction
enzymes and a combined set of them. The bar plot shows the taxonomic profiling performance (L2 similarity and Pearson correlation) of the 2bRAD marker set from one of the 16 Type IIB restriction enzymes or a combined set. The X-axis shows Type IIB restriction enzymes and the percentage of original microbial genomic content that corresponding 2bRAD fragments can represent, whereas Yaxis shows the profiling performance. The yellow bars represent the L2 similarity between predicted and ground-truth abundances, and the green bars refers to the Pearson correlation between them. The combined marker set from 16 Type IIB restriction enzymes does not significantly improve performance in abundance estimation as compared to that from individual Type IIB restriction enzymes.

Fig S4. Rarefaction analysis reveals the desirable sequencing depth for 2bRAD-M and WMS
for reliable taxonomic profiling. In each scatter plot, we compared the profiling results of a fecal sample based on a method (either 2bRAD-M or WMS) at deep or shallow (by subsampling) depth of sequencing, via Shannon diversity, beta diversity (on the Bray-Curtis similarity), and species richness.
Based on the Shannon diversity and Bray-Curtis similarity, profiling performance of 2bRAD-M quickly saturates at a shallow sequencing depth (2-3 million reads per sample). In contrast, for the same metrics, WMS-based taxonomic profiles saturate at a far deeper sequencing depth (about 50 million reads per sample), suggesting much higher sequencing costs. The microbial richness (number of species-level taxa detected) based on both methods still grows as the sequencing depth increases, which is consistent with current knowledge on metagenomic diversity analysis. The inset (Venn diagram) shows the overlapping fraction of identified taxa between 16S rRNA and 2bRAD-M profiles. Results for each of the 32 microbiome samples from underarm, car surfaces and home surfaces were presented.      Fig S12. G score provides a higher precision in taxonomic profiling than the relative abundance based on simulated sequencing data. The identified species is ranked by abundance (by Bracken in the upper plot) or G score (by 2bRAD-M in the lower plot), with color indicating whether it is a false positive (FP). Bracken generates far more FPs than 2bRAD-M, as many FPs are in high abundance for Bracken. In contrast, the G score boundary between true positives and FPs are more prominent when using 2bRAD-M.  Tables   Table S1. Availability of 2bRAD-M markers for taxonomic profiling at each of the taxonomic levels. In each row, the value indicates the average percentage of taxa-specific 2bRAD-M tags in all 2bRAD-M tags produced by a given type IIB enzyme. Those 2bRAD-M marker tags are all singlecopy in a microbial genome and specific to a given taxon. Thus for each of the type IIB enzymes and at each of the taxonomic levels, 2bRAD-M markers or taxonomic profiling are abundant.