metaMIC workflow
metaMIC is implemented in Python3 (Python = 3.7) and runs under Linux with modest running time and memory usage (Additional file 2: Table S5). It requires assembled contigs in FASTA format and paired-end reads in FASTA or FASTQ format as input. Alternatively, the user can provide a BAM file with read pairs mapping to contigs. Given the contigs, metaMIC will first identify the misassemblies by employing a random forest classifier trained on the features extracted from reads and contigs. Next, metaMIC will identify the regions containing misassembly breakpoints in the misassembled contigs based on the anomaly scores, and then recognize the exact positions of the breakpoints in the error regions. Then metaMIC will correct the misassemblies by splitting the contigs at the breakpoints. The details will be given below.
Features extracted from reads and contigs
BWA-MEM (v.0.7.17) [38] is used to map paired-end reads to assemblies, followed by using samtools (v1.9) [39] to filter low-quality mappings and sorting the alignments (see more details in Additional file 1). Then the features will be extracted from the sorted BAM file and can be categorized into four feature types, including read pair consistency, read coverage, nucleotide variants, and k-mer abundance consistency. The features belonging to each feature type are explained in detail below (see full description of features in Additional file 1).
For each paired-end reads with left and right mate reads, the insert size corresponding to the distance between two mates is assumed to follow a normal distribution N(μ, σ) [32]. The expected insert size (μ) was calculated as the median value of all insert sizes of read pairs, whereas the standard deviation (σ) was estimated by the median absolute deviation of insert sizes (Additional file 1). A read is regarded as a proper read if the insert size belongs to [μ − 3σ, μ + 3σ] and the orientation is consistent with its mate, and is a discordant read otherwise. Discordant reads can be further divided into three types: reads with their mates mapped to different contigs, reads with wrong insert size, and reads with orientation not consistent with their mates. A read is regarded as a clipped read if it contains at least 20 unaligned bases at either end of the read, and a read is regarded as a supplementary read if different parts of the read are aligned to different regions of contigs. Given a contig, metaMIC will calculate the proportion of above six types of reads among all reads mapping to the contig as six read features. The same will work for a given region from a contig when identifying error regions.
The coverage-based statistics including read coverage and fragment coverage are calculated at each base of the contig. The read coverage per base represents the number of reads that are mapped over that base, and the fragment coverage is the number of proper paired-end reads spanning that base. Both the read coverage and fragment coverage at each base i of contig c are further standardized as follows.
$$Standardized\_ read\_{coverage}_{ic}=\frac{read\_{coverage}_{ic}}{\left({\sum}_{j=1}^{L_c} read\_{coverage}_{jc}\right)/{L}_c},$$
$$Standardized\_ fragment\_{coverage}_{ic}=\frac{fragment\_{coverage}_{ic}}{\left({\sum}_{j=1}^{L_c} fragment\_{coverage}_{jc}\right)/{L}_c},$$
where Lc is the length of contig c. Then the standardized deviations of coverage and fragment coverage for contig c are used as input of metaMIC as shown below:
$${\sigma}_{read\_{coverage}_c}=\sqrt{\frac{\sum_{j=1}^{L_c}{\left( standardized\_ raad\_{coverage}_{jc}-1\right)}^2}{L_c}},$$
$${\sigma}_{fragment\_{coverage}_c}=\sqrt{\frac{\sum_{j=1}^{L_c}{\left( standardized\_ fragment\_{coverage}_{jc}-1\right)}^2}{L_c}}.$$
metaMIC will calculate the proportion of the bases with unusual coverage as well as the standardized deviations of coverage for each contig, and the same for a given region (see Additional file 1).
The nucleotide variant information is extracted from BAM file with the help of samtools. metaMIC counts the number of discordant, ambiguous, and correct alignments separately at each position. For example, the nucleotide at position j in the assembled contig is A and ten reads are mapped to that position, with six having A and three having T, another one having ambiguous base (N) at position j. Then for that position, the number of discordant, correct, and ambiguous alignments is three, six, and one, respectively. For each type of alignment in a contig, metaMIC will calculate the proportion of the alignment by dividing the number of this type of alignments to the total number of mapped bases across the contig, and the same for a given region.
metaMIC calculates the k-mer abundance difference (KAD) at each base based on the alignment of paired-end reads to contigs. The KAD value, proposed by He et al. [16], measures the consistency between the abundance of a k-mer from short reads and the occurrence of the k-mer in the genome. Here, for a given k-mer and a contig, the copies of the k-mer in the contig is n, and the k-mer abundance in the mapped paired-end reads is c, and the sequencing depth of the contig is m calculated based on the occurrence of single-copy k-mers in the reads. Thus, the KAD value is defined as follows.
$$KAD\ value={\mathit{\log}}_2\frac{\left(c+m\right)}{m\left(n+1\right)}$$
A k-mer with KAD value not belonging to [−0.5, +0.5] will be regarded as an error k-mer, and a base is regarded as an error base if an error k-mer covers that base. For a given contig, metaMIC will count the number of error bases across the contig and divide it by the contig length. The proportion of error bases within a given region from a contig will be calculated in the same way.
In summary, the above these four types of features will be extracted for the whole contig (contig-based features) or a window of 100bp (window-based features). The contig-based features will be used to train a random forest to identify misassembled contigs, while the window-based features will be used as input of isolate forests to recognize the error regions containing breakpoints.
Identification of misassembled contigs
With the above contig-based features, metaMIC trains a random forest [40] with chosen parameters implemented in Scikit-Learn [41] to discriminate misassembled contigs from those correctly assembled ones, where an ensemble of 1000 trees are used. For each contig, a probability score S(c) representing the likelihood that the contig c is misassembled will be output by metaMIC.
Model training and optimization
We split the data into two parts, and 2/3 of the data was used as the training set (20 metagenomes) for optimizing the parameters and 1/3 of the data used as the test set (10 metagenomes) for evaluating the learned models, where the test set will not be seen or used during the training process. The random forest model was trained on a training dataset, whereas the ground truth misassembly label of contigs provided by metaQUAST is used as a target for training the model. We performed 10-fold cross-validations on the training set to select the optimal hyperparameters for the random forest classifier. In more detail, given 20 metagenomes as the training set, we randomly pick 2 of them for validation while the rest 18 metagenomes for training. This procedure will be repeated for 100 times to make sure the robustness of the model trained. To handle the imbalance problem existing between the positive and negative samples, we will perform down-sampling to obtain the same number of correct contigs as that of the misassembled contigs, where there are fewer misassembled contigs as positive samples. Specifically, a subset of negative samples will be randomly selected from the whole negative training set during the down-sampling procedure and the size of the negative subset is the same as the positive training set. This down-sampling will be repeated for 10 times, and a classifier will be trained based on the positive training set and a selected negative subset each time. Thus, an ensemble classifier will be obtained with the average result of the 10 trained classifiers as output, and the ensemble classifier will be used as the final classifier.
We used the AUPRC, corresponding to the area under the precision-recall curve, to evaluate the performance of the trained classifier, and the parameters leading to the highest AUPRC during the cross-validation procedure will be used as the optimal hyperparameters (See Additional file 2: Table S6). In addition to AUPRC, area under the receiver operating characteristic curve (AUC) and out-of-bag (Oob) error rate were also used to evaluate metaMIC. The results of the 10-fold cross-validation experiments were provided in Table S7, where the 10-fold cross-validation results averaged over 100 times where shown. To evaluate metaMIC’s generalization ability, we also assessed its performance on the test set that were not used for training (Additional file 2: Table S7).
Localizing breakpoints in misassembled contigs
After identifying misassembled contigs, metaMIC is able to localize the misassembly breakpoints in those misassembled contigs. Firstly, metaMIC scans each contig with a sliding window of 100bp and calculates an anomaly score for each window by employing isolation forest [21] based on window-based features to localize the error regions containing misassembly breakpoints, where the region with a higher anomaly score may be an error region. Secondly, metaMIC uses the read breakpoint ratio to recognize the exact misassembly breakpoint in an error region. Specifically, for a given predicted misassembled contig, metaMIC identifies a 100-bp region with the highest anomaly score as an error region and then the position with the highest read breakpoint ratio within this window as the misassemly breakpoint. For those error regions without read breakpoints, the central position of the error region is regarded as the misassembly breakpoint. Note that the 300-bp regions at both ends of a contig are not considered here due to the poor mapping quality.
Correcting misassemblies by splitting at breakpoints
For a given misassembled contig identified above, metaMIC is able to correct misassembled contigs by splitting the contig into parts at the breakpoint recognized above. To avoid splitting too many contigs mistakenly, only those misassemblies with score of 0.8 (also the default setting for metaMIC) are corrected, leading to the precision of 70~80%. Moreover, metaMIC only corrects contigs that can be split into fragments no shorter than 1000bp in order to avoid generating too short sequences.
Identifying misassemblies in isolate genomes
When applying metaMIC on isolate genomes, metaMIC detects misassembly breakpoints on all contigs directly without identifying misassembled contigs due to the lower complexity and fewer genomes compared to metagenomics. Specifically, metaMIC first calculates anomaly score for each 100-bp window region across all contigs, and the windows with anomaly scores >0.95 are regarded as error regions, and the positions with read breakpoint count >5 and read breakpoint ratio >0.2 within these error regions are predicted as misassembly breakpoints. The thresholds of anomaly score and read breakpoint ratio were chosen as the ones that lead to the highest F1 score (See Additional file 1: Fig. S24).
Datasets
The training datasets used for training metaMIC contains 30 bacterial metagenomes which were generated in the same way as DeepMAsED [13]. In more detail, 1000 bacterial genomes with high-quality (CheckM estimated completeness >90 and contamination <5) from the Genome Taxonomy Database [42] were used for generating training datasets as DeepMAsED did (Additional file 1: Fig. S25). Then MGSIM [13] was utilized to create 30 replicate metagenomes in which 107 Illumina HiSeq2500 150bp paired-end reads were simulated per metagenome with the ART-defined default error distribution [43]. Abundance of each bacterial genome is sampled based on a lognormal distribution Lognormal(5, 2). Contigs are assembled with MEGAHIT, IDBA_UD, or metaSPAdes [20] to generate assembler-specific training datasets, and the contigs <5000bp were removed (see in Additional file 1 for parameter settings of assemblers).
To evaluate the performance of different tools, another six simulated and four real datasets are used to see the performance of metaMIC and other competing tools in identifying misassemblies. The simulated virome dataset were generated in the same way as the method used to generate the training dataset. It contains 10 replicate virome metagenomes simulated based on the 10,140 complete viral genomes obtained from NCBI RefSeq collection [18], where each replicate contains 1000 viral genomes. The other five simulated datasets are obtained from CAMI challenge [2], including the medium-complexity (132 genomes) and high-complexity (596 genomes) communities from CAMI1, CAMI2 Gastrointestinal (130 genomes), CAMI2 Oral (327 genomes), and CAMI2 Skin (233 genomes) from CAMI2. For all datasets, the synthetic short paired-end reads were assembled using MEGAHIT, IDBA_UD, or metaSPAdes. Only contigs >5000bp are considered here as the majority of misassemblies occurred in contigs are longer than 5000bp (Additional file 1: Fig. S26-27). As the ground truth, the misassemblies detected by MetaQUAST (parameters: --min-contig 1000 -k-mer-stats -extensive-mis-size 100) based on the alignment of assembled contigs against reference genomes are used to evaluate the accuracy of misassembly identification (See Additional file 2: Table S12-13 for more details of these datasets).
Four real datasets from GAGE-B project [30], including Bacillus cereus ATCC 10987 (B. cereus), Mycobacterium abscessus 6G-0125-R (M. abscessus), Rhodobacter sphaeroides 2.4.1 (R. sphaeroides), and Vibrio cholerae CP1032 (V. cholerae), are used to see the performance of metaMIC on isolate genomes. All the trimmed Illumina paired-end reads can be downloaded from http://ccb.jhu.edu/gage_b/genomeAssemblies/index.html. The contigs used in our experiments are assembled by Velvet [31], and all the Velvet assembled contigs can be downloaded from http://bioinf.spbau.ru/en/content/spades-30-gage-b-data-sets. These contigs are further scaffolded with BESST [33] and ScaffMatch [34]. The accuracy of both contigs and scaffolds are evaluated with QUAST [9] as gold standard. The reads are mapped to contigs in all datasets with the help of BWA-MEM (v.0.7.17).
Comparison to competing methods
We compare metaMIC against two popular approaches, i.e., ALE and DeepMAsED, for identifying misassembled contigs in metagenomic datasets. ALE calculates likelihood scores for each individual position in the contig based on the likelihood of observed reads generated from a given assembly, and the score consists of four sub-scores (placement, insert, depth, and k-mer). When applying ALE to identify misassembled contigs, we aggregated the four ALE sub-scores into a contig score similar to Kuhring et al. [14]. In more detail, we select a threshold for each ALE sub-score. For each contig, a position whose sub-score smaller than the threshold is counted as a potential error, then these error counts were summed up and normalized by the contig length and the total number of sub-scores as the final score. The thresholds of four ALE sub-scores were chosen that can lead to its best performance on the training dataset. Specifically, we computed the AUPRC for each metagenome in the training dataset given the input thresholds. We then computed the average AUPRC scores of all metagenomes for each combination of the input thresholds. We selected the thresholds leading to the highest AUPRC score on the training dataset, and the thresholds considered in ALE can be found in Table S8. DeepMAsED is a recently developed method for identifying misassembled contigs in metagenomic assemblies. We re-trained the DeepMAsED model on the training dataset used by metaMIC with recommended parameters, and the re-trained model was used to predict misassemblies in the benchmark datasets (See Additional file 2: Table S11 for the performance of default model provided by DeepMAsED on the benchmark datasets). Given the strong class imbalance problem for each dataset that contains limited number of misassemblies, we used AUPRC to evaluate the performance of these methods. We performed a more thorough comparison with other tools, and compared metaMIC against VALET [15], FRCbam [44], NucBreak [45], and Pilon [46] to identify misassembled contigs. We evaluated these tools with the same BAM files as those used for metaMIC, and all tools were run with their default settings. Note that all these tools described above have never been evaluated on metagenomics datasets. As shown in Table S9, metaMIC outperforms all the other tools significantly with the highest F1 score.
When recognizing misassembly breakpoints in metagenomic contigs, we only compare metaMIC with ALE since DeepMAsED cannot localize the misassembly breakpoints. For ALE, the four sub-scores per base are summed up and the position with a lower ALE score is a likely assembly error. Here, a position with the lowest ALE score across a given misassembled contigs is regarded as a misassembly breakpoint for ALE. Similar to metaMIC, we do not consider the 300-bp region at the both ends of the contigs for ALE.
Evaluation of binning results
When evaluating a set of bins reconstructed from simulated microbial datasets, we use BLASTn to map each bin against the ground truth genomes used for each dataset. A representative genome of each bin is determined based on the genome which can be covered by the highest fraction of nucleotides from that bin. Then for each bin, we define the number of nucleotides in the bin that belong to the representative genome as true positives (TP). The total number of nucleotides from the bin not covered by the representative genome corresponds to the false positives (FP), and the number of nucleotides in the representative genome not covered by any contigs from that bin represents the false negatives (FN). Then the completeness, contamination, and F1 score of each bin can be calculated as follows.
$$completeness=\frac{TP}{TP+ FN}$$
$$purity=\frac{TP}{TP+ FP}$$
$$contamination=1- purity$$
$$F1\ score=\frac{2\ast completeness\ast purity}{completeness+ purity}$$
For the real metagenomics data sets where the ground truth genomes are inaccessible, we employ CheckM [47] to estimate the completeness and contamination of each bin.
Validating misassemblies identified by metaMIC with long-read assemblies
A combined rumen fluid and solid sample sequenced with both short- and long-read DNA sequencing technologies (SRA BioProject numbers PRJNA507739) are used to validate the metaMIC predictions on short-read assemblies. Short-read sequencing approaches are based on Illumina NextSeq 500 sequencing technology, and long-read sequencing is performed on the Sequel systems via PacBio’s single molecule real-time (SMRT) sequencing technology. In total, the sample contains approximately 143 million short reads and 600,000 long reads. The Illumina short reads are assembled into contigs with MEGAHIT while the long-read assemblies are generated with PacBio reads by using metaFlye [48]. Then the short-read assemblies were evaluated through using long-read assemblies as gold standard with MetaQUAST.