bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes

Most microbes cannot be easily cultured, and metagenomics provides a means to study them. Current techniques aim to resolve individual genomes from metagenomes, so-called metagenome-assembled genomes (MAGs). Leading approaches depend upon time series or transect studies, the efficacy of which is a function of community complexity, target abundance, and sequencing depth. We describe an unsupervised method that exploits the hierarchical nature of Hi-C interaction rates to resolve MAGs using a single time point. We validate the method and directly compare against a recently announced proprietary service, ProxiMeta. bin3C is an open-source pipeline and makes use of the Infomap clustering algorithm (https://github.com/cerebis/bin3C). Electronic supplementary material The online version of this article (10.1186/s13059-019-1643-1) contains supplementary material, which is available to authorized users.


Background
The number of microbial organisms which can be readily investigated using culture-based techniques is relatively small in proportion to the Earth's apparent total diversity [1,2]. Although concerted efforts have found the individual conditions necessary to cultivate a relatively small number of species in the laboratory [3][4][5], scaling-up this discovery process to the remaining majority is daunting, if not intractable.
Beyond the issue of cultivation, an environmental population can possess at once phenotypic microdiversity and within that group large differences in gene content. With as little as 40% of genes shared within a species [6], this accessory genome is thought to contribute significantly to the dynamics of microbial adaptation in the environment [7][8][9]. Phylogenetic marker surveys (16S amplicon sequencing), while still informative, stand essentially as a proxy for broader discovery processes of the genomic landscape, should they exist. The systematic extraction of entire genomes from an environment will enable a more thorough determination of the constituent species core *Correspondence: matthew.demaere@uts.edu.au The ithree institute, University of Technology Sydney, 15 Broadway, Ultimo, NSW, 2007 Australia and accessory gene content (pangenome). The extracted pangenome and community profile will enable investigation of the functional basis of species fitness and niche partitioning within an environment, and further longitudinal experiments will permit studying the dynamics.
Metagenomics offers a direct culture-independent sampling approach as a means to study the unculturable majority. Recent advances in this field have begun to make possible the systematic resolution of genomes from metagenomes, so-called metagenome-assembled genomes (MAGs) [10]. Tools designed to assess the quality of retrieved MAGs [11,12] have brought with them suggestions for categorical quality rankings (Table 1). Marking an increasing acceptance, the Genomic Standards Consortium (GSC) recently introduced standardized reporting criteria (Table 2) for the submission of MAGs to public archives [13], and as of mid-2018, there are more than 5200 MAGs registered in the Genomes Online Database (GOLD) [14]. As retrieval methodologies improve and new complex environments are studied, the registration rate of new MAGs is expected to eventually exceed that of culture-based studies [13].
Most current approaches to the accurate retrieval of MAGs (also called genome binning or clustering) © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Completeness and contamination are independently ranked and are intended to be used in conjunction, e.g., "nearly complete and low contamination." For simplicity when counting MAGs, we have used equivalent ranks together, but only refer to that of completeness, i.e., 90/5, 70/10, and 50/15 depend on longitudinal or transect data series, operating either directly on WGS sequencing reads (LSA) [15] or on assembly contigs (CONCOCT, GroopM, metaBAT, MaxBin2, Cocacola) [16][17][18][19][20]. The need for multiple samples can, however, pose a barrier both in terms of cost of sequencing and the logistics of obtaining multiple samples as, for instance, with clinical studies. As an alternative single-sample approach, Hi-C (a high throughput sequencing technique which captures in vivo DNA-DNA proximity) can provide significant resolving power from a single time point when combined with conventional shotgun sequencing. The first step of the Hi-C library preparation protocol is to crosslink proteins bound to DNA in vivo using formalin fixation. Next, cells are lysed and the DNA-protein complexes are digested with a restriction enzyme to create free ends in the bound DNA strands. The free ends are then biotin labelled and filled to make blunt ends. Next is the important proximity ligation step, where blunt ends Low-quality draft < 50, < 10 The "finished" rank is left to future advances, while lower ranks are achievable now by Hi-C-based genome binning methods. The additional criterion of rRNA genes makes the "high-quality" rank challenging to achieve with current methods are ligated under dilute conditions. This situation permits ligation to occur preferentially among DNA strands bound in the same protein complex, that is to say, DNA fragments which were in close proximity in vivo at the time of crosslinking. Crosslinking is then reversed, the DNA is purified, and a biotin pull-down step employed to enrich for proximity junction containing products. Lastly, an Illumina-compatible paired-end sequencing library is constructed. After sequencing, each end of a proximity ligation containing read-pair is composed of DNA from two potentially different intra-chromosomal, interchromosomal, or even inter-cellular loci. As a high-throughput sequencing adaptation of the original 3C (chromosome conformation capture) protocol, Hi-C was originally conceived as a means to determine, at once, the 3-dimensional structure of the whole human genome [21]. The richness of information captured in Hi-C experiments is such that the technique has subsequently been applied to a wide range of problems in genomics, such as genome reassembly [22], haplotype reconstruction [23,24], assembly clustering [25], and centromere prediction [26]. The potential of Hi-C (and other 3C methods) as a means to cluster or deconvolute metagenomes into genome bins has been demonstrated on simulated communities [27][28][29] and real microbiomes [30,31].
Most recently, commercial Hi-C products ranging from library preparation kits through to analysis services [31,32] have been announced. These products aim to lessen the experimental challenge in library preparation for non-specialist laboratories while also raising the quality of data produced. In particular, one recently introduced commercial offering is a proprietary metagenome genome binning service called ProxiMeta, which was demonstrated on a real human gut microbiome, yielding state-of-the-art results [31].
Here, we describe a new open software tool bin3C which can retrieve MAGs from metagenomes by combining conventional metagenome shotgun and Hi-C sequencing data. Using a simulated human fecal microbiome, we externally validate the binning performance of bin3C in terms of adjusted mutual information and B 3 precision and recall against a ground truth. Finally, for a real microbiome from human feces, we compare the retrieval performance of bin3C against that published for the ProxiMeta service [31].

Simulated community
To test the performance of our tool on the task of genome binning, we designed a simulated human gut microbiome from 63 high-quality draft or better bacterial genomes randomly chosen from the Genome Taxonomy Database (GTDB) [33]. Candidate genomes were required to possess an isolation source of feces while not specifying a host other than human. To include only higher quality drafts, the associated metadata of each was used to impose the following criteria: contig count ≤ 200, CheckM completeness > 98%, MIMAG quality rank of "high" or better, and lastly a total gap length < 500 bp. For these metadatabased criteria, there were 223 candidate genomes.
In addition to the metadata-based criteria, FastANI (v1.0) [34] was used to calculate pairwise average nucleotide identity (ANI) between the 223 candidate genome sequences. As we desired a diversity of species and mostly unambiguous ground truth, a maximum pairwise ANI of 96% was imposed on the final set of genomes. This constraint controlled for the over-representation of some species within the GTDB. Additionally, when two or more genomes have high sequence identity, the assignment process becomes more difficult and error-prone as it challenges both the assembler [35] and creates ambiguity when assigning assembly contigs back to source genomes.
The resulting 63 selected genomes had an ANI range of 74.8 to 95.8% (median 77.1%) and GC content range of 28.3 to 73.8% (median 44.1%) ( Fig. 1 and Additional file 1: Table S1). A long-tailed community abundance profile was modeled using a generalized Pareto distribution (parameters: shape = 20, scale = 31, location = 0) (Additional file 2: Figure S1), where there was approximately a 50:1 reduction in abundance from most to least abundant.
Lastly, before read simulation, genomes in multiple contigs were converted to a closed circular form by concatenation, thereby simplifying downstream interpretation.

Read-set generation
To explore how increasing depth of coverage affects bin3's ability to correctly retrieve MAGs, Hi-C read-sets were generated over a range of depths while keeping shotgun coverage constant. Hi-C depth was parameterized simply by the total number of pairs generated, while shotgun depth was parameterized by the depth of the most abundant community member.
From this definition, an initial read-set with high depth of coverage was produced with 250× shotgun and 200 million Hi-C pairs. The shotgun dataset at this depth constituted 18.2 M pairs. Shotgun reads were generated using the metagenomic shotgun simulator MetaART which wraps the short-read simulator art_illumina (v2.5.1) [36,37] (options: -M 100 -S 12345 -l 150 -m 350 -s 20 -z 1).
Hi-C reads were generated in two equal parts from two different four-cutter restriction enzymes (NEB names: MluCI and Sau3AI) using sim3C [37] (options: -e ${enzyme} -m hic -r 12345 -l 150 -insert-sd 20 -insert-mean 350 -insert-min 150 -linear -simple-reads). Two enzymes were used to mimic the library construction of the real dataset we also analyzed. Repositories Fig. 1 Taxonomic distribution at the order rank of 63 selected bacterial genomes used in the simulated community. The number of each order is a product of the taxonomic distribution of genomes existing in the GTDB, while the constraint that no two genomes be more similar than 96% ANI restricts the over-representation of deeply sequenced species containing Sim3C and MetaART can be found at https://github.com/cerebis/sim3C and https://github. com/cerebis/meta-sweeper, respectively.
From the initial read-set, a parameter sweep was produced by serially downsampling the initial read-set by factors of 2 using BBTools (v37.25) [38]. The initial Hi-C read-set was reduced 4 times for a total of 5 different depths or 200M, 100M, 50M, 25M, and 12.5M pairs (command: reformat.sh sampleseed=12345 samplerate=${d}). In terms of the community genomes, depth of coverage for the subsampling with the greatest reduction factor ranged from 3.5× to 171× for Hi-C.

Ground truth inference
For the task of the whole-community genome binning, a ground truth was constructed by aligning scaffolds resulting from the SPAdes assembly to the "closed" reference genomes using LAST (v941) [39]. From the LAST alignments, overlapping source assignment was determined using a methodology we have described previously [35] and implemented as the program alignmentToTruth.py (see Availability section). An overlapping (soft) ground truth better reflects the possibility of coassembly of sufficiently similar regions among reference genomes and the tendency that these regions cause breakpoints in assembly algorithms, leading to highly connected assembly fragments which belong equally well to more than one source.

Performance metrics
To validate genome binning, we employed two extrinsic measures: adjusted mutual information (AMI) (sklearn v0.19.2) and weighted Bcubed B 3 . AMI is a normalized variant of mutual information which corrects for the tendency of the number of random chance agreements between clusters to increase as the number of clusters increases for a given problem size [40]. Weighted B 3 is a soft extrinsic measure which, analogous to the F-measure, is the harmonic mean of the B 3 formulation of precision and recall. Here, precision is a measure of cluster homogeneity (like with like), while recall is a measure of the cluster completeness. The B 3 measure handles overlapping (soft) clusters and better satisfies the constraints that an ideal metric should possess, i.e., homogeneity, completeness, rag-bag, and size vs quantity when compared to other metrics. While AMI treats all objects under study as having equal value, weighted B 3 allows the value of objects to vary, and we chose to use contig length as a representation of value for the problem of genome binning [35,[41][42][43]. Therefore, the weighted B 3 results measure the fraction of the genome binned accurately, not just the number of contigs.
In employing two measures, we seek to gain confidence in their agreement while also obtaining the additional insight afforded by the separate facets B 3 precision and recall.

Real microbiome
To demonstrate bin3C on real data and make a direct comparison to the proprietary Hi-C-based genome binning service (ProxiMeta), we obtained the publicly available high-quality combined whole-metagenome shotgun and Hi-C sequencing dataset used in the previous study [31]. The dataset derives from the microbiome of a human gut (BioProject: PRJNA413092, Acc: SRR6131122, SRR6131123, and SRR6131124).
For this dataset, two separate Hi-C libraries (SRR6131122, SRR6131124) were created using two different four-cutter restriction enzymes (MluCI and Sau3AI). In using two enzymes, the recognition sites were chosen to be complementary in terms of GC content. When the libraries were subsequently combined during the generation of the contact map, site complementarity provided a higher and more uniform site density over a wider range of target sequence. We conjecture that for metagenome deconvolution, site complementarity is particularly helpful in obtaining a consistent signal from all community members, while higher site density improves recovery of smaller assembly fragments.
All read-sets were obtained from an Illumina HiSeq X Ten at 150 bp. After cleanup (described below), the shotgun read-set (SRR6131123) consisted of 248.8 million paired-end reads, while the 2 Hi-C libraries consisted of 43.7 million (SRR6131122) and 40.8 million (SRR6131124) paired-end reads.

Initial processing
Read cleanup is occasionally overlooked in the pursuit of completing the early stages of genomic analysis. This initial processing step is however essential for optimal shotgun assembly and particularly for Hi-C read mapping where remnants of adapter sequence, PhiX, or other contaminants can be a significant noise source.

Hi-C read mapping
As bin3C is not aimed at assembly correction, we opted to use assembly scaffolds rather than contigs as the target for genome binning, electing to trust any groupings of contigs into scaffolds done by SPAdes.
Both simulated and real Hi-C reads were mapped to their respective scaffolds using BWA MEM (v0.7.17-r1188) [45]. During mapping with BWA MEM, read pairing and mate-pair rescue functions were disabled and primary alignments forced to be the alignment with lowest read coordinate (5 end) (options: -5SP). This latter option is a recent introduction to BWA at the request of the Hi-C bioinformatics community. The resulting BAM files were subsequently processed using samtools (v1.9) [46] to remove unmapped reads and supplementary and secondary alignments (exclude filter: -F 0x904) then sorted by name and merged.

Contact map generation
The large number of contigs (> 500,000) typically returned from metagenomic shotgun assemblies for nontrivial communities is a potential algorithmic scaling problem. At the same time, biologically important contigs can be on the order of 1000 bp or smaller, challenging the effective analysis of metagenomic datasets from both sides.
A Hi-C analysis, when conducted in the presence of experimental biases, involves the observation of proximity ligation events, which in turn rely on the occurrence of restriction sites. The signal we desire to exploit is therefore not smoothly and uniformly distributed between and across all contigs. As a counting experiment, the shortest contigs can be problematic as they tend to possess a weaker signal with higher variance; as a result, they can have a deleterious effect on normalization and clustering if included. Therefore, bin3C imposes constraints on minimum acceptable length (default, 1000 bp) and minimum acceptable raw signal (default, five non-self observations) for contig inclusion. Any contig which fails to meet these criteria is excluded from the clustering analysis.
With this in mind, bin3C constructs a contact map from the Hi-C read-pairs. As in previous work [27], the bins pertain to whole contigs and capture global interactions, which work effectively to cluster a metagenome into genome bins. In doing so, we make the implicit assumption that assembly contigs contain few misassemblies that would confound or otherwise invalidate the process of partitioning a metagenome into genome bins.
bin3C can also optionally construct a contact map binned on windows of genomic extent. These maps are not used in the analysis per se but can be used to plot visual representation of the result in the form of a heatmap (Additional file 2: Figure S2).

Bias removal
The observed interaction counts within raw Hi-C contact maps contain experimental biases, due in part to the factors such as mappability of reads, enzyme digestion efficiency, in vivo conformational constraints on accessibility, and restriction site density. In order to apply Hi-C data to genome binning, a uniform signal over all DNA molecules would be ideal, free of any bias introduced by the factors mentioned above. Correcting for these biases is an important step in our analysis, which is done using a two-stage process. First, for each enzyme used in library preparation, the number of enzymatic cut sites are tallied for each contig. Next, each pairwise raw Hi-C interaction count c ij between contigs i and j is divided by the product of the number of cut sites found for each contig n i and n j . This first correction is then followed by general bistochastic matrix balancing using the Knight-Ruiz algorithm [47].

Genome binning
After bias removal, the wc-contact map (whole contig) is transformed to a graph where nodes are contigs and edge weights are normalized interaction strength between contigs i and j. It has been shown that DNA-DNA interactions between loci within a single physical cell (intra-cellular proximity interactions) occur an order of magnitude more frequently than interactions between cells (inter-cellular) [27], and in practice, the signal from inter-cellular interactions is on par with experimental noise. The wc-graph derived from a microbial metagenome is then of low density (far from fully connected), being composed of tightly interacting groups (highly modular) representing intra-cellular interactions and against a much weaker background of experimental noise. Graphs with these characteristics are particularly well suited to unsupervised cluster analysis, also known as community detection. Unsupervised clustering of the wc-graph has previously been demonstrated using Markov clustering [27,48] and the Louvain method [29,49]. In a thorough investigation using ground truth validation, we previously found neither method to be sufficiently efficacious in general practice [35]. Despite the high signal to noise from recent advances in library preparation methods, accurate and precise clustering of the wc-graph remains a challenge. This is because resolving all of the structural detail (all of the clusters) becomes an increasingly fine-grained task as a graph grows in size and number of clusters. Clustering algorithms can, in turn, possess a resolution limit if a scale exists below which they cannot recover a finer detail. As it happens, modularity-based methods such as Louvain have been identified as possessing such a limit [50]. For Hi-C-based microbiome studies, the complexity of the community and the experiment is sufficient to introduce significant variance within structural features of the wc-graph. Wide variance in structural features, such as the size of clusters and the weight of intracluster edges relative to the whole graph, makes a complete reconstruction difficult for algorithms with limited resolution.
The state of unsupervised clustering algorithms has however been advancing. Benchmarking standards have made a thorough extrinsic validation of new methods commonplace [51], and comparative studies have demonstrated the capability of available methods [52]. Infomap is another clustering algorithm, which like Markov clustering is based upon flow [53,54]. Rather than considering the connectivity of groups of nodes versus the whole, flow models consider the tendency for random walks to persist in some regions of the graph longer than others. Considering the dynamics rather than the structure of a graph, flow models can be less susceptible to resolution limits as graph size increases [55]. Additionally, the reasonable time complexity and the ability to accurately resolve clusters without parameter tuning makes Infomap well suited to a discovery science where unsupervised learning is required.
We have therefore employed Infomap (v0.19.25) to cluster the wc-graph into genome bins (options: -u -z -i link-list -N 10). Genome bins greater than a usercontrolled minimum extent (measured in base-pairs) are subsequently written out as multi-FASTA in descending cluster size. A per-bin statistics report is generated detailing bin extent, size, GC content, N50, and read depth statistics. By default, a whole sample contact map plot is produced for qualitative assessment.
In the following analyses, we have imposed a 50-kbp minimum extent on genome bins, partly for the sake of figure clarity and as a practical working limit for prokaryotic MAG retrieval. That is to say, being less than half the minimum length of the shortest known bacterial genome [56], it is unlikely that this threshold would exclude a candidate of moderate or better completeness. If a user is in doubt or has another objective in mind, the constraint can be removed.

Simulated community analysis
We validated the quality of bin3C solutions as Hi-C depth of coverage was swept from 12.5 M to 200 M pairs on an assembly (Fig. 2). A sharp gain in AMI, B 3 recall, and B 3 F-score was evident as Hi-C coverage rose from 12.5 M to 100 M pairs, while the gain between 100 M and 200 M pairs was less pronounced. Accompanying the upward trend for these first three measures was an inverse but relatively small change in B 3 precision. In terms of AMI, the highest scoring solution of 0.848 was at the greatest simulated depth of 200 M pairs. Concomitantly, this solution had B 3 precision, recall, and F-scores of 0.909, 0.839, and 0.873, respectively. For this highest depth sample, 22,279 contigs passed the bin3C filtering criteria and represented 95.4% of all assembly contigs over 1000 bp. There were 62 genome bins with an extent greater than 50 kbp, with total extent of 229,473,556 bp. This was 95.6% of the extent of the entire shotgun assembly, which itself was 91.1% of the extent of the set of reference genomes. The remaining small clusters of less than 50 kb extent totalled 1,413,596 bp or 0.6% of the assembly extent (Table 3), while unanalyzed contigs below 1000 bp represented 8,103,486 bp or 3.4%.
As a soft clustering measure, B 3 can consider overlaps both within predicted clusters and the ground truth. Regions of shared sequence within our simulated community meant that for 4.4% of assembly contigs, the assignment in the ground truth was ambiguous, being shared by two or more source genomes. Meanwhile, bin3C solutions are hard clusters placing contigs in only one genome bin. Even without mistakes, this leaves a small but unbridgeable gap between the ground truth and the best possible bin3C solution. Due to this, when overlap exists in the ground truth, the maximum achievable B 3 precision and recall will be less than unity. Conversely, AMI is a hard clustering measure that requires assigning each of these shared contigs in the ground truth to a single source genome through a coin-toss process. It remains, however, that when bin3C selects a bin for such contigs, either source would be equally valid. For this reason, AMI scores are also Despite these technicalities, a quantitative assessment of overall completeness and contamination is robustly inferred using B 3 recall and precision, as they consider contig assignments for the entirety of the metagenomic assembly. This is in contrast to marker gene-based measures of completeness and contamination, where only those contigs containing marker genes contribute to the score. The overall completeness of bin3C solutions, as inferred using B 3 recall, rose monotonically from 0.189 to 0.839 as Hi-C depth of coverage was increased from 12.5 M to 200 M pairs. At the same time, the overall contamination, as inferred using B 3 precision, dropped slightly from 0.977 to 0.909. Thus, bin3C responded positively to increased depth of Hi-C coverage while maintaining an overall low degree of contamination.
We validated our simulation sweep using the marker gene tool CheckM [11]. CheckM estimated that bin3C retrieved 33 nearly complete MAGs using 12.5 M Hi-C pairs, while 39 nearly complete were retrieved using 200 M pairs (Fig. 3). For the deepest run with the most retrieved MAGs, genome bins deemed nearly complete had a total extent which ranged from 1.56 to 6.97 Mbp, shotgun depth of coverage from 3.34× to 161.2×, N50 from 5797 bp to 2.24 Mbp, GC content from 28.0 to 73.9%, and number of contigs from 4 to 787 (Additional file 2: Figure S3 and Additional file 3: Table S2).
Broadening the count to include MAGs of all three ranks: moderate, substantial, and nearly (Table 1), 37 were retrieved at 12.5M Hi-C pairs, which increased to 48 when using 200 M Hi-C pairs. The small increase in the number of retrieved MAGs for the relatively large increase in Hi-C depth of coverage may seem perplexing, particularly in the face of a large change in the extrinsic validation measures AMI, B 3 recall, and F-score. To explain this, we referred to the cluster reports provided by bin3C, where for nearly complete MAGs, we found that the average number of contigs increased from 77 at 12.5 M pairs to 179 at 200 M pairs while the total number of contigs increased from 2550 to 6968. Thus, although marker gene-associated contigs are efficiently found at lower Hi-C depth of coverage, obtaining a more complete representation of each MAG can require significantly more depth.
With respect to the contamination as inferred by marker genes, CheckM estimated a low median contamination rate of 1.08% across all genome bins with completeness greater than 70%. CheckM, however, also identified 4 bins where contamination was estimated to be higher than 10% and for which marker gene counting suggested that 2 genomes had merged into a single bin. We interrogated the ground truth to determine the heritage of these bins and found that each was a composite of 2 source genomes, whose pairwise ANI values ranged from 93.1 to 95.8%. Each pair shared an average of 131 contigs within the ground truth with an average Jaccard index of 0.19, which was significant when compared against the communitywide average Jaccard of 6.5 × 10 −4 . Thus, a few members of the simulated community possessed sufficiently similar or shared sequence to produce coassembled contigs. Although the coassembled contigs were short, with a median length of 2011 bp, the degree of overlap within each pair was enough to produce single clusters for sufficiently deep Hi-C coverage. Reference genomes corresponding to 2 of these merged bins fall within the definition of intraspecies, with pairwise ANI values of 95.80% and 95.85%, respectively. The reference genomes involved with remaining two bins are close to this threshold, with ANI values of 93.1% and 93.5%. From this, we would concede that although bin3C is precise, it is not capable of resolving strains.

Library recommendations
The time, effort, and cost of producing a combined shotgun and Hi-C metagenomic dataset should be rewarded with good results. As bin3C is reliant on both the quality and quantity of data supplied, we felt it important to highlight two factors beyond Hi-C depth of coverage which can influence the results.
Shotgun sequencing data forms the basis on which Hi-C associations are made, and therefore, the more thoroughly a community is sampled, the better. To demonstrate how this affects bin3C, we reduced the shotgun depth of coverage of our simulated community by half (to 125×) and reassembled the metagenome. Basic assembly statistics for this half-depth assembly were N50 6289 bp and L50 4353. There were 43,712 contigs longer than 1000 bp with an extent of 187,388,993 bp, and overall, there were 113,754 contigs with the total extent of 222,522,774 bp. This contrasts to the full-depth (250×) assembly, which had N50 30,402 bp and L50 1105, with 23,364 contigs over 1000 bp with an extent of 232,030,334 bp, and 41,704 total contigs with an extent of 240,133,820 bp. Clearly, the reduction in shotgun depth has resulted in a more fragmented assembly. In particular, the decrease in depth has lead to a 45-Mbp drop in total extent for contigs longer than 1000 bp. This large proportional shift of assembly extent to fragments smaller than 1000 bp is significant as we have found that this length is an effective working limit within bin3C.
We then analyzed the resulting contigs with bin3C over the same range of Hi-C depth of coverage as before. Comparison of the AMI validation scores using the half and full depth assemblies (Fig. 4) shows that, for the more deeply sampled community, bin3C's reconstruction of the community members greatly improved. CheckM estimation of completeness and contamination followed a similar trend (Additional file 2: Figure S4), where the best result at half depth produced 25 nearly, 4 substantially, and 6 moderately complete MAGs, compared against 39 nearly, 4 substantially, and 5 moderately complete at full depth.
A recent trend in the preparation of metagenomic Hi-C libraries involves employing two different restriction enzymes during the digestion step [31]. The enzymes are chosen to have different GC biases at their restriction sites. For a microbial community with a diversity of species and consequently a wide range of GC content, the intent of this strategy is more uniform digestion of the extracted DNA, and therefore coverage of Hi-C reads across the metagenome. With wider and more uniform coverage, so the logic goes, should come improved results when performing Hi-C-based genome binning.
As our work already involved simulating a two-enzyme library, as used in recent real experiments [31], we elected to repurpose this data to ascertain what gain was had in using two enzymes rather than one alone. The two enzymes used in our simulated libraries are Sau3AI and MluCI. While the Sau3AI restriction site^GATC is GC balanced, the^AATT restriction site of MluCI is AT-rich. For our simulated community, source genomes ranged in GC content from 28.3 to 73.8% and their abundances were randomly distributed. For Sau3AI, these extremes of GC content translated to expected cut site frequencies of 1 in every 338 bp at 28.3% and 1 in every 427 bp at 73.8%. For the less balanced MluCI, the expected cut site frequencies were instead 1 in every 61 bp at 28.3% and 1 in every 3396 bp at 73.8%. Thus, relative to a naive four-cutter frequency of 1 in every 256 bp, while the predicted density of sites from Sau3AI is not ideal at either extreme, the site density of MluCI will be very high in the low GC range but very sparse at the high GC range.
For the simulated community full depth assembly, we used bin3C to analyze three Hi-C scenarios: two single enzyme libraries generated using either Sau3AI or Fig. 4 Adjusted mutual information (AMI) scores for bin3C solutions at two different shotgun depths of coverage. For our simulated community, shotgun libraries generated at 125× and 250× coverage demonstrate that although the depth of Hi-C coverage is crucial, so too is the depth of shotgun sequencing MluCI and a two-enzyme library using Sau3AI and MluCI together. The performance of bin3C was then assessed against the libraries at equal Hi-C depth of coverage using our ground truth. In terms of AMI, the performance of bin3C for the single enzyme libraries was less than that of the combined Sau3AI+MluCI library (Fig. 5). Although the gain was small at lower depth, the advantage of a two enzyme model grew as depth increased, where at 100 M Hi-C pairs, the AMI scores were MluCI 0.63, Sau3AI 0.71, and Sau3AI+MluCI 0.78.

Real microbiome analysis
We analyzed the real human gut microbiome (Table 3) with bin3C using the same parameters as with the simulated community along with a randomly generated seed (options: -min-map 60 -min-len 1000 -min-signal 5 -e Sau3AI -e MluCI -seed 9878132). Executed on a 2.6-GHz Intel Xeon E5-2697, contact map generation required 586 MB of memory and 15m26s of CPU time, while the clustering stage required 11.6 GB of memory and 9m06s of CPU time. Of the 95,521 contigs longer than 1000 bp, 29,653 had sufficient signal to be included in the clustering. The total extent of contigs greater than 1000 bp was 517,309,710 bp for the whole assembly, while those with sufficient Hi-C observations totalled 339,181,288 bp or 65.6% of all those in the assembly.
Clustering the contact map into genome bins, bin3C identified 296 genome bins with extents longer than 50 kbp and 2013 longer than 10 kbp. The 296 clusters longer than 50 kbp had a total extent of 290,643,239 bp, representing 40.4% of the total extent of the assembly, while clusters longer than 10 kbp totalled 324,223,887 bp in extent or 45.1% of the assembly. For clusters greater than 50 kb, shotgun depth of coverage ranged from 3.4× to 498×, N50 ranged from 3119 to 297,079 bp, GC content from 28.2 to 65.0%, total extent from 50,315 to 5,460,325 bp, and number of contigs from 1 to 495 (Additional file 4: Table S3).
We analyzed these 296 genome bins using CheckM (Fig. 6) [11]. For the proposed MAG ranking standard based on only measures of completeness and contamination (Table 1), bin3C retrieved 55 nearly, 29 substantially, and 12 moderately complete MAGs. In terms of total extent, MAGs ranked as nearly complete ranged from 1.68 Mbp to 4.97 Mbp, while for the substantially complete ranged from 1.56 to 5.46 Mbp and moderately complete ranged from 1.22 to 3.40 Mbp (Additional file 2: Table S4). In terms of shotgun coverage, MAGs ranked as nearly complete ranged from 5.9× to 447.5×, Fig. 5 For a simulated community whose GC content varied between 28.3 to 73.8%, bin3C retrieval performance improved when simulated reads were generated as if from a library prepared using a two enzyme digestion model (Sau3AI+MluCI), rather than if the library was prepared using either enzyme in isolation Fig. 6 bin3C retrieved MAGs from a real human gut microbiome, ordered by descending estimate of completeness (black circles). Plotted along with completeness is estimated contamination (gold circles). The y-axis grid lines pertain to thresholds used in quality assessment standards: completeness of 50%, 70%, and 90% and contamination of 5%, 10%, and 15%. Although there is a sharp falloff in completeness after roughly 75 MAGs, estimated contamination remains consistently low substantially from 4.3× to 416.4×, and moderately from 3.7× to 83.4×.
Using the more detailed ranking instead from the recently proposed extension to MIxS (Table 2) [13], the bin3C solution represented 17 high quality, 78 medium quality, and 105 low-quality MAGs. For the high-quality MAGs, shotgun coverage ranged from 10.7× to 447.5× and extent from 1.86 to 4.10 Mbp (Additional file 2: Table S5).

Comparison to previous work
The real microbiome we analyzed with bin3C was first described in a previous study to demonstrate a metagenomic Hi-C analysis service called ProxiMeta [31]. ProxiMeta is the only other complete solution for Hi-Cbased metagenome deconvolution with which to compare bin3C. As ProxiMeta is a proprietary service rather than open source software, the comparison was made by reanalysis of the same dataset as used in their work (Bioproject: PRJNA413092). As their study included a comparison to the conventional metagenomic binner MaxBin (v2.2.4) [57], which was one of the best performing MAG retrieval tools evaluated in the first CAMI challenge [58], we have included those results here as well. It should be noted that although MaxBin 2 is capable of multi-sample analysis, all software was run against a single shotgun sequencing sample. We have compared the CheckM validation of bin3C results to the CheckM validation of Prox-iMeta and MaxBin as provided in their supplementary data [59].
Regarding the simple ranking standard (Table 1), it was reported that ProxiMeta retrieved 35 nearly, 29 substantially, and 13 moderately complete MAGs, while MaxBin retrieved 20 nearly, 22 substantially, and 17 moderately complete MAGs. On the same metagenomic Hi-C dataset, we found that bin3C retrieved 55 nearly, 29 substantially, and 12 moderately complete MAGs (Fig. 7a). Against MaxBin, bin3C retrieved fewer moderately complete MAGs but otherwise bettered its performance. Against ProxiMeta, bin3C had equivalent performance for the substantially and moderately complete ranks, while retrieving 20 additional nearly complete genomes, representing an improvement of 57%.
In terms of the more complex MIMAG standard (Table 2), it was reported that ProxiMeta retrieved 10 high-and 65 medium-quality MAGs, while MaxBin retrieved 5 high-and 44 medium-quality MAGs. The bin3C solution retrieved 17 high-and 78 medium-quality MAGs, which against ProxiMeta represents 70% improvement in high-quality MAG retrieval from the same sample (Fig. 7b).
It was demonstrated previously that ProxiMeta possessed a higher binning precision than MaxBin and resulted in a much lower rate of contamination [31]. We have found that the precision of bin3C improves on the mark set by ProxiMeta. bin3C's gains, when retrieving MAGs in the highest quality ranks, are mainly due to the rejection of fewer bins for excessive contamination. For all genome bins over 1 Mbp in extent, bin3C had a median contamination rate of 0.8%, while for In comparison with existing conventional and Hi-C-based single-sample metagenome binning tools, bin3C performs well. When compared by ranking standards, based either on measures of completeness and contamination only (a) [11] or the recent GSC MIMAG reporting standard (b) [13], bin3C retrieves a higher or equivalent number of MAGs in each category. The apparent stringency of the MIMAG high quality is primarily due to the requirement that 5S, 16S, and 23S rRNA genes be present ProxiMeta, median contamination was 3.5% and MaxBin was 9.5%.

Discussion
We have introduced bin3C, an openly implemented and generic algorithm which reproducibly and effectively retrieves MAGs on both simulated and real metagenomic data.
To demonstrate this, we assessed bin3C's retrieval performance on a simulated human gut microbiome, by way of a ground truth and the extrinsic validation measures of AMI, as well as B 3 precision, recall, and F-score (Fig. 2). bin3C proved to be consistently precise over a wide range of Hi-C depth of coverage, while recall and the overall quality of solutions improved substantially as more Hi-C data was included. Although a high shotgun depth of coverage is not necessary to obtain low contamination MAGs, greater depth of shotgun sequencing has a strongly positive influence on the recall and overall completeness of MAG retrieval (Fig. 4).
Hi-C MAGs have a characteristically low rate of contamination by foreign genomic content [31]. On a real human gut microbiome, we have shown that bin3C achieves a lower estimated rate of contamination than both the conventional metagenome binner MaxBin [57] and the recently introduced commercial Hi-C analysis service ProxiMeta [31]. For all bins over 1 Mbp as determined by each approach, bin3C's median contamination rate was 0.8%, while MaxBin was 9.5% and ProxiMeta was 3.5%.
This low contamination rate is the primary reason why bin3C attained the most complete retrieval of MAGs from the real human gut dataset when compared to MaxBin and ProxiMeta (Fig. 6). Retrieving 20 more nearly complete MAGs than ProxiMeta, bin3C achieved a gain of 57% on this previous best result (Fig. 7a). For the stringent GSC MIMAG high-quality ranking, bin3C retrieved 17 MAGs from the gut microbiome, a gain of 70% against the previous best result (Fig. 7b).
For best results, we recommend that Hi-C metagenomic libraries be constructed using a two enzyme digestion model.

Limitations and future work
The ground truth as determined in our work is imperfect and this has an impact on the extrinsic validation metrics. Notably, when a simulated community possesses multiple strains of a single species, parts of a contig can derive from one strain while other parts of the same contig may be a coassembly of multiple strains, and this is not accommodated by the ground truth. The plethora of extrinsic validation measures from which to choose also have their limitations and differences [42,43,52]. Though we chose measures which we felt best suited our problem space, these are not in widespread use. Different measures can have significantly different opinions on the agreement between a ground truth and a given solution. Those with the lowest scoring results are not always the most readily chosen for publication.
The use of non-trivial simulated microbial communities makes determining ground truth and measuring accuracy difficult, and yet, these are a crucial element of the development process if the resulting methods are to be robust in real experimental use. Under such circumstances, we work from the premise that achieving close to unity on strong validation measures is unlikely to be possible. In our work here, bin3C demonstrated a B 3 precision varying between 0.909 and 0.977, while in work pertaining to metagenome binning with multiple samples, precision values as high as 0.998 were reported using a different formulation of the measure [18]. In practical terms by using CheckM as an operational measure of precision, bin3C achieved a much lower rate of MAG contamination on real data than has previously been reported.
Though marker gene-based validation with tools such as CheckM or BUSCO [11,12] are of great value and easily applied to our work, as validators, their perception is limited only to those sequences which contain marker genes. Ideally, metagenome binning approaches should aim to gather together all the sequence fragments pertaining to a given genome and not only those which contained marker genes. The generalizability of an approach is not assured when the validation measure used in development is systematically insensitive to some aspect of the problem. Therefore, we believe refining the ground truth determination process, to be independent of community complexity, is warranted and would be a useful contribution.
Although bin3C can analyze sequences shorter than 1000 bp, it is our experience that allowing them into the analysis does not lead to improvements in MAG retrieval. We believe the weaker signal and higher variance in the raw observations for Hi-C contacts involving shorter sequences are to blame. A weakness here is relying on the final assembly contigs or scaffolds as the subject of read mapping, where the ends of sequences interrupt alignment.
As assembly contigs are treated wholly in the genome binning process, errors due to misassembled contigs that combine sequence from different strain or species genome(s) are propagated into the bin3C output. To compensate for this, we have provided an optional helper tool (split_ref.py) which can be used to uniformly split longer contigs into smaller pieces of a target length. Using this approach on our simulated community with target lengths of 10 kbp and 5 kbp, we saw a marked improvement in B 3 precision and recall. We simultaneously observed a detrimental increase in MAG contamination inferred by CheckM when these fragments became small (5 kbp) and little change in MAG completeness, suggesting that deeper investigation may be warranted. In future work, one might take better advantage of the shotgun sequence data to identify potential locations of strainor species-misassembly and coassembly to target contig splitting, or alternatively, apply Hi-C reads to the assembly graph to deconvolve the graph itself.
Against the simulated community, the performance of bin3C as indicated by the validation scores AMI and B 3 recall, suggests that further gains in retrieval completeness are possible (Fig. 2). In particular, strains of the same species can fail to be resolved into separate bins. Improving the resolving power of bin3C or the addition of a post hoc reconciliation process to separate these merged bins would be worthwhile.

Funding
This research was supported partially by the Australian Government through the Australian Research Council Discovery Projects funding scheme (project DP180101506, http://purl.org/au-research/grants/arc/DP180101506 and project LP150100912, CI: S.P. Djordjevic, http://purl.org/au-research/grants/ arc/LP150100912) Availability of data and materials bin3C has been implemented in Python 2.7 for the Linux operating system and under the GNU Affero General Public License v3. The source code as used in the manuscript has been deposited in archived format at https://doi.org/10. 5281/zenodo.2266522 [60] and the latest code is available at https://github. com/cerebis/bin3C [61]. Supporting tools used in generating the simulated data-sets for the manuscript included the sim3C metagenomic Hi-C reads simulator, the MetaART metagenomic shotgun reads simulator and genome binning ground truth generator alignmentToTruth.py. Source code for sim3C has been deposited in archived format at https://doi.org/10.5281/zenodo.1035049 [62]