Approach and implementation
The Trycycler pipeline consists of multiple steps which are run separately (overview in Fig. 1, more detail in Additional file 1: Fig. S1). At the clustering and reconciliation steps, the user may need to make decisions and intervene. This means that Trycycler is not an automated process appropriate for high-throughput assembly. Trycycler is implemented in Python and uses the NumPy, SciPy, and edlib packages [23,24,25,26].
Before Trycycler is run, the user must generate multiple input assemblies of the same genome (Additional file 1: Fig. S1A). The input assemblies should be complete: one contig per replicon. If complete assemblies are not possible (e.g., due to insufficient read length) or read depth is shallow (e.g., <25× depth), then Trycycler is not appropriate. We recommend users generate 12 independent input assemblies, but this value can be adjusted down (to save computational time) or up (to improve robustness). It is desirable to maximize the independence of the input assemblies, as this will reduce the chance that the same error will occur in multiple assemblies. One way to achieve such independence is to use multiple assemblers, as different assembly algorithms can lead to different assembly errors [13]. For example, in the tests reported here, we used Flye [10], Miniasm/Minipolish [13], Raven [11], and Redbean [12]. Random read subsampling can provide further independence, where each assembly is generated from a different subsample of the full read set (Trycycler v0.5.0 has a “subsample” command to facilitate this). Deeper long-read sets are therefore desirable, as they enable more independent subsets.
The first step in the Trycycler pipeline is contig clustering (Additional file 1: Fig. S1B). It aims to group contigs of the same replicon from different input assemblies, so subsequent steps can be carried out on a per-replicon basis. For example, if the genome in question had one chromosome and one plasmid, then Trycycler clustering should produce two clusters: one for the chromosomal contigs and one for the plasmid contigs. To make clusters, Trycycler conducts complete-linkage hierarchical clustering on all pairwise Mash distances between contigs [27]. To aid interpretation, a FastME tree is built using the pairwise distances [28]. After clustering is complete, the user must decide which clusters are valid (i.e., represent completely assembled replicons in the genome) and which are invalid (i.e., represent incomplete, misassembled, or spurious sequences)—a key point of human judgment in the Trycycler process.
The next step is to “reconcile” each cluster’s contig sequences with each other (Additional file 1: Fig. S1C). This involves converting sequences to their reverse complement as necessary to ensure that all sequences in the cluster are in the same orientation. Most bacterial replicons are circular, so Trycycler aligns the start and end of each contig to the other contigs in the cluster to determine if bases need to be added or removed for clean circularization (can be disabled for linear replicons by using the --linear option). It then rotates each sequence to begin at the same position. Some gene sequences (e.g., dnaA and repA) are often used as starting positions in complete genomes, so Trycycler contains a database of these genes and will preferentially use them as the contig starting position (see the “Methods” section). If no sequence from this database is found (with ≥95% coverage and ≥95% identity), Trycycler will use a randomly chosen unique sequence instead. Cluster reconciliation will fail if a contig cannot be circularized or if any of the pairwise alignments within the cluster have low identity. In such cases, Trycycler will suggest interventions to resolve the issue, but it is up to the user to manually exclude or modify the contig sequences as necessary.
After reconciliation, each cluster’s sequences will have a consistent strand and starting position, making them appropriate for global multiple sequence alignment (Additional file 1: Fig. S1D). To improve computational performance, Trycycler subdivides the sequences, using 1-kbp pieces with each piece extended as necessary to ensure that the boundaries between pieces do not start/end in repetitive regions. It uses MUSCLE [29] to produce a multiple sequence alignment for each piece and then stitches the pieces together to produce a single multiple sequence alignment for the full cluster sequences. Trycycler then aligns the entire read set to each contig sequence so it can be assigned to a particular cluster (Additional file 1: Fig. S1E).
The final step in Trycycler’s pipeline is the generation of a consensus sequence for each cluster (Additional file 1: Fig. S1F). It does this by dividing the multiple sequence alignment into regions where there is or is not any variation. For all regions where there is variation, Trycycler must choose which variant will go into the consensus. The best variant is defined as the one with the minimum total Hamming distance to the other variants, an approach which favors more common variants. In the event of a tie between two variants, Trycycler aligns the cluster’s reads to each possibility and chooses the one which produces the largest total alignment score—i.e., the variant which is in best agreement with the reads. The final Trycycler consensus sequence for the cluster is produced by taking the best variant for each region of variation in the multiple sequence alignment.
After Trycycler finishes, we recommend performing long-read polishing on its consensus sequences (Additional file 1: Fig. S1G). Polishing is not incorporated into Trycycler, as that step can be specific to the long-read sequencing technologies used, e.g., Medaka [30] polishing for ONT assemblies. If short reads are available, short-read polishing (e.g., with Pilon [31]) can also be performed to further improve assembly accuracy.
The code and documentation for Trycycler v0.3.3 (the version used to generate the assemblies in this manuscript) are available at the DOI 10.5281/zenodo.3966493. The current version of Trycycler (v0.5.0) is available on GitHub (github.com/rrwick/Trycycler).
Performance on simulated reads
In silico read simulation allows for a straightforward test of assembly accuracy against a ground truth: reads are generated from a reference genome, the reads are assembled, and the resulting assembly is compared back to the original reference sequence. For this analysis, we simulated short and long reads from 10 reference genomes which belong to the 10 most common bacterial species in RefSeq (Additional file 2: References). We assembled each genome with long-read-only approaches (Miniasm/Minipolish [13], Raven [11], Flye [10], and Trycycler), long-read-first hybrid approaches (Pilon [31] polishing of each long-read-only assembly), and a short-read-first hybrid approach (Unicycler [21]). We quantified the accuracy of each assembly’s chromosomal contig using two main metrics: mean identity and worst-100-bp identity (the minimum identity observed among a 100-bp sliding window).
Comparing only the long-read assemblers to each other (Flye, Miniasm/Minipolish and Raven), it was clear that Flye performed best (Additional file 1: Fig. S2). This was true both before Pilon polishing with short reads (mean identity Q41 vs Q38; mean worst-100-bp-identity 95.8% vs 50.8–90.9%) and after Pilon polishing (mean identity Q57 vs Q42–Q55; mean worst-100-bp identity 96.1% vs 50.8–95.7%). Our main results therefore exclude Miniasm/Minipolish and Raven, leaving only the best-performing long-read assembler: Flye.
Figure 2 shows the mean assembly identities and worst-100-bp assembly identities from each approach, using 10 simulated read sets. In both metrics, Trycycler reliably produced higher-quality assemblies than Flye (mean identity Q51 vs Q41; mean worst-100-bp identity 99.5% vs 95.8%). This result also held true for long-read-first hybrid assemblies, where Trycycler+Pilon outperformed Flye+Pilon (mean identity Q74 vs Q57; mean worst-100-bp identity 99.9% vs 96.1%). Unicycler’s short-read-first hybrid assemblies performed notably worse than the long-read-first hybrid approaches (mean identity Q25; mean worst-100-bp identity 76.5%).
Performance on real reads
Since simulated reads cannot perfectly emulate real sequencing [32], we also tested assembly methods with real-read sets. We chose seven bacterial isolates for this study (Additional file 3: Genomes), each belonging to a different bacterial species with clinical relevance. The challenge with real reads is the absence of a clear ground truth against which to compare assemblies. To circumvent this issue, we instead produced two independent sets of long+short (ONT+Illumina) reads for each test organism. In brief, a single DNA extraction from each organism was used to prepare two ONT libraries (one ligation and one rapid), and a single Illumina library (the results of which were divided into two non-overlapping read sets); full details are described in the “Methods” section. For each assembly method, we compared the assembly from read set A to the assembly of read set B, differences between them indicating assembly errors. While this approach could suffer from false negatives if both assemblies contained the same error, it cannot suffer from false positives, as wherever two assemblies of the same genome differ, at least one of the two must be in error.
We tested the same assemblers as were used in the simulated-read tests but added an additional long-read polishing step with Medaka, an ONT-specific polishing tool. We therefore produced unpolished long-read-only assemblies (with Miniasm/Minipolish, Raven, Flye, and Trycycler), polished long-read-only assemblies (the same assemblers plus Medaka), long-read-first hybrid assemblies (the same assemblers plus Medaka and short-read polishing with Pilon), and short-read-first hybrid assemblies (with Unicycler). Each assembly approach was used on both read set A and read set B for each of the test organisms.
Assembly accuracy was quantified using the metrics from the simulated-read tests: mean identity and worst-100-bp identity. Instead of being based on an assembly-to-reference alignment (as was done for the simulated-read tests), these metrics used an alignment of the read-set-A-assembled chromosome to the read-set-B-assembled chromosome. For the Serratia marcescens genome, read set B failed to produce a complete chromosome with most assembly methods (due to long genomic repeats and a short read N50, see Additional file 3: Reads), so this genome was excluded, leaving six genomes in the analysis. As was the case for the simulated-read tests, Flye assemblies were higher quality than Miniasm/Minipolish and Raven assemblies at all polishing stages (Additional file 1: Fig. S3): unpolished (mean identity Q34 vs Q28–Q32; mean worst-100-bp identity 81.8% vs 20.2–21.8%), Medaka-polished (mean identity Q40 vs Q30–Q35; mean worst-100-bp identity 94.7% vs 28.2–38.0%) and Medaka+Pilon-polished (mean identity Q56 vs Q31–Q37; mean worst-100-bp identity 94.7% vs 28.2–40.0%). Flye was also the only long-read assembler to produce complete chromosomes for both read sets of all six genomes, so Miniasm/Minipolish and Raven were excluded from our main results.
Since the mean identity and worst-100-bp identity metrics could fail to identify all assembly errors in the real-read tests, we also used two other approaches for assessing the quality of de novo assemblies. The first was ALE [33], which uses short-read alignments to the assembled sequence to produce a likelihood score for that assembly (higher scores being better), which we normalized for each genome to produce a z-score. Mapping accuracy, evenness of read depth, and evenness of insert size extracted from the short-read alignments are all used by ALE to generate a likelihood score. The second de novo assessment approach was IDEEL [34, 35], which compares the length of predicted proteins in the assembly to a database of known proteins. Indel errors in the assembly cause frameshifts in coding sequences leading to truncations, so an error-prone assembly will tend to have predicted proteins which are shorter than their best-matching known proteins. We quantified the fraction of predicted proteins in each assembly which were ≥95% the length of their best-matching known protein (higher fractions being better).
Figure 3 shows the real-read results: mean identity, worst-100-bp identity, ALE z-scores, and IDEEL full-length proteins. In the mean identity metric, Trycycler performed better than Flye at all levels of polishing (Q37 vs Q34 before polishing; Q42 vs Q40 after Medaka polishing; Q62 vs Q56 after Medaka+Pilon polishing). This advantage was also apparent in the worst-100-bp identity metric (96.7% vs 81.8% before polishing; 97.0% vs 94.7% after Medaka polishing; 98.3% vs 94.7% after Medaka+Pilon polishing). Both long-read-first hybrid approaches (Flye+Medaka+Pilon and Trycycler+Medaka+Pilon) outperformed Unicycler’s short-read-first hybrid assemblies (mean identity Q34 and worst-100-bp identity 23.5%). The ALE results are consistent with the identity metrics: Trycycler assemblies had higher mean ALE z-scores than Flye assemblies at all polishing levels (–1.031 vs –1.873 before polishing; 0.419 vs 0.235 after Medaka polishing; 0.828 vs 0.806 after Medaka+Pilon polishing) and long-read-first hybrid assemblies were superior to Unicycler assemblies (mean ALE z-score of 0.617). IDEEL results showed the same trend, with Trycycler assemblies having more full-length proteins than Flye assemblies (78.3% vs 72.3% before polishing; 93.8% vs 91.8% after Medaka polishing), but all hybrid assemblies performed equivalently in this metric (97.6%).
While the above results used Medaka polishing after Trycycler (i.e., Trycycler+Medaka), it is also possible to run Medaka polishing on Trycycler’s input assemblies (i.e., Medaka+Trycycler) or on both Trycycler’s input assemblies and its final assembly (i.e., Medaka+Trycycler+Medaka). We tried these alternative approaches using the real-read data and found that while all performed similarly (mean identity Q41–Q42), the best results were achieved when Medaka was the final step in the process (Additional file 3: Medaka order). We therefore recommend the Trycycler+Medaka approach both for its simplicity and accuracy.
Type and location of errors
Additional file 1: Fig. S4 shows the positions of errors in the assemblies of each of the 16 genomes (10 simulated and six real), with repetitive regions of the genomes indicated. Errors in long-read-only assemblies (Flye, Flye+Medaka, Trycycler, and Trycycler+Medaka) were distributed across the genomes, occurring in both repeat and non-repeat sequences. Long-read-first hybrid assemblies (Flye+Medaka+Pilon and Trycycler+Medaka+Pilon) usually had higher error rates in repeat sequences, and in many cases, there were no errors in the non-repeat sequences of the genome. Short-read-first hybrid assemblies (Unicycler) often had clusters of errors which occurred in both repeat and non-repeat sequences. Indel errors were more common than substitution errors for all assemblers: 44% of total errors were insertions, 47% were deletions, and 9% were substitutions. For the real reads, Flye assemblies sometimes had local spikes in error rates (indicating a more serious error or a cluster of errors) before Medaka polishing, but these spikes were not present after Medaka polishing. Trycycler assemblies did not suffer from this same problem. Flye assemblies often had errors at the position corresponding to the original start/end of the contig.
The Flye errors at the start/end of the contig were caused by imperfect circularization: missing or duplicated bases at the start/end of a circular contig, a phenomenon we described in greater detail in a previous benchmarking study of long-read assemblers [13]. These errors were not corrected by Medaka or Pilon because those tools are not aware of contig circularity, i.e., that the contig’s last base should immediately precede its first base. Since our analysis involved normalizing all assemblies to a consistent starting position (required for global alignment), missing/duplicated bases at the start/end of a contig registered as a middle-of-the-sequence indel error in our tests. These indel errors reduced the mean identity and, if large enough, the worst-100-bp identity as well.
To assess the effect of circularization errors on Flye accuracy, we manually fixed the circularization of all Flye assemblies using the original reference sequence (in the simulated-read tests) or the Trycycler+Medaka+Pilon assembly (in the real-read tests). Of the 22 Flye assemblies (10 from simulated reads, 12 from real reads), four had perfect circularization, five had duplicated bases, and 13 had missing bases. The worst Flye circularization error was a 13-bp deletion, and the mean magnitude of Flye circularization errors was 3.7 bp (Additional files 2 and 3: Flye circularization). We then reran our analyses using the fixed-circularization version of Flye assemblies, and the results are shown in Additional file 1: Fig. S5 for simulated reads and Additional file 1: Fig. S6 for real reads. Flye performed better in these results, especially in the worst-100-bp identity metric, indicating that in many cases, the circularization error was the largest single error in the Flye assembly. However, Trycycler still produced more accurate assemblies than Flye at each polishing stage (unpolished, Medaka-polished, and Pilon-polished).
Consistency of Trycycler results
Trycycler is not a fully automated pipeline—it requires human judgment and intervention. This raised the question of how well it performs in the hands of different users. To answer this, we recruited five researchers who were experienced in bioinformatics but not involved in Trycycler development. They were given an ONT read set for each of the six genomes used in the real-read tests and tasked with producing a Trycycler assembly without any assistance from the Trycycler developer (using only the Trycycler documentation to guide them). We then compared the resulting assemblies, looking at both presence/absence of contigs as well as chromosomal sequence identity (Fig. 4).
The main source of variation between different users’ Trycycler assemblies was the inclusion/exclusion of plasmid contigs (Fig. 4A). Small plasmids often pose problems for long-read assemblers, and this caused them to sometimes be excluded by Trycycler users. Contaminant plasmid contigs (e.g., cross-barcode contamination) were sometimes included in Trycycler assemblies. Replicons with a large-scale error or misassembly occurred in many of the single-assembler assemblies (from Miniasm/Minipolish, Raven, and Flye). These errors included fragmented replicons (e.g., splitting one replicon sequence between two contigs), doubling a replicon in a single contig (e.g., assembling a 6-kbp plasmid into a 12-kbp contig), large-scale circularization problems (e.g., 80 kbp of start/end overlap), and redundant contigs (e.g., producing five contigs for a single replicon). This type of error was very rare in the Trycycler assemblies (present in only one case). Detailed descriptions of all such errors are in Additional file 4: Matrix.
To assess the consistency of assembled sequences, we built a neighbor-joining tree (based on pairwise alignment distances) of the assembled chromosomes for each of the six genomes (Fig. 4B). The developer’s Trycycler+Medaka+Pilon assembly was included as a reference sequence, as the real-read test results (Fig. 3) indicate these to be the most accurate representation of the genomes. For each test isolate, the Trycycler assemblies generated by different users were closer to the reference sequence than any of the (automated) single-assembler assemblies (Fig. 4C), and there were comparatively few differences between Trycycler assemblies from different users (Fig. 4D). All differences between Trycycler assemblies generated by different users were small-scale: most were only single-bp differences, and the largest difference was a 4-bp indel in a tandem repeat (Additional file 4: Trycycler vs Trycycler). The most common difference was a 1-bp discrepancy in the length of a homopolymer sequence (accounted for 78.5% of all Trycycler-vs-Trycycler sequence differences).