Ariadne: synthetic long read deconvolution using assembly graphs

Synthetic long read sequencing techniques such as UST’s TELL-Seq and Loop Genomics’ LoopSeq combine 3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$'$$\end{document}′ barcoding with standard short-read sequencing to expand the range of linkage resolution from hundreds to tens of thousands of base-pairs. However, the lack of a 1:1 correspondence between a long fragment and a 3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$'$$\end{document}′ unique molecular identifier confounds the assignment of linkage between short reads. We introduce Ariadne, a novel assembly graph-based synthetic long read deconvolution algorithm, that can be used to extract single-species read-clouds from synthetic long read datasets to improve the taxonomic classification and de novo assembly of complex populations, such as metagenomes.


Background
Next generation sequencing technologies underpin the large-scale genetic analyses of complex mixtures of isolates, such as microbiomes.However, the simultaneous reconstruction of multiple distinct and discrete genomes, especially in communities where the number of species is unknown, is much more computationally demanding than assembling a single isolate genome.
Though standard Illumina short-read sequencing is the most popular platform for metagenome-wide characterizations due to its sequencing depth to cost ratio, the length of Illumina short-reads limits the linkage information that can be extracted from sequencing libraries.To address this, researchers are increasingly using nucleotide barcode-based, chromatin conformation-based, or long-read sequencing technologies to approximate single-cell resolution from complex mixtures of microorganisms.Longread sequencing technologies, such the range of options from Oxford Nanopore Technologies, are capable of resolving a limited number of high-quality circularized draft genomes from metagenomic sequencing data [1].However, high-quality assemblies of long reads are reliant on large amounts of input DNA to adequately span all of the species in a metagenomics sample, which may not be available depending on the size of the individual genomes and the alpha diversity [2].The assembly graph contains many ambiguous junctions comprised of sequence material from multiple genetically similar species and/or strains, and at low sequencing coverage, real sequences cannot be easily distinguished from misassemblies [1].

Overview of synthetic long read (SLR) technology
Alternative approaches that rely on short-read sequencing while still generating longrange linkage information include Hi-C and synthetic long-read (SLR) technologies [3].While similar ambiguous junctions are inherent in short-read assemblies due to limitations in sequencing insert size, de novo assemblies generated by short-read and synthetic long read (SLR) are inherently larger and more comprehensive representations of the sample composition due to the relative volume of genomic coverage.Since SLR library preparations require less input material than long reads, it is much more suitable for extracting sequencing information from low-volume samples [1,[4][5][6].
SLR technologies, starting with Illumina TruSeq Synthetic Long reads, [7], associate short reads originating from the same extracted genomic fragment with some type of nucleotide-based unique molecular identifier (UMI).UMIs are colloquially referred to as "barcodes" in the context of SLR sequencing.Briefly, input (meta)genomic DNA is sheared into long fragments of 5-100 kbp.After shearing, a UMI (usually 16-20 basepairs long) is ligated to short-reads from the fragments such that short-reads from the same fragment share the same UMI.Linked-read UMIs are unrelated to the 5′ UMI used for sample multiplexing.We refer to the set of reads that share a UMI as a read cloud.Finally, the short-reads are sequenced using standard sequencing technologies (e.g., Illumina HiSeq).SLRs offer additional long-range information over standard short-reads.Reads with matching UMIs are more likely to have emerged from the same long fragment of DNA than two randomly sampled reads, which extends the relative positional information encoded in the read's short 50-250 bp sequence past the standard limitations of a short-read insert, which are typically several hundreds of base-pairs.There is a slight coverage tradeoff due to the size of the UMIs and associated library preparation costs relative to standard short-read sequencing.
SLR techniques are differentiated by the biochemical mechanism that associates UMIs with genomic fragments or short reads.10x linked-read sequencing uses oil-based droplets to encapsulate a few genomic fragments and a UMI, which are subsequently splintered into short reads which are amplified with the UMI [8].While 10x Genomics' linked reads has been discontinued after nearly half a decade of widespread usage, a variety of other UMI-based SLR methods such as TELL-Seq, LoopSeq, and BGI's long fragment reads (stLFR) have been commercialized recently with the promise of (near-)single-molecule resolution along with simplified library preparation procedures and compatibility with standard Illumina sequencing machines.TELL-Seq uses proprietary TELL beads, which both capture genomic fragments and use transposon-based reactions to insert the UMI sequence throughout the fragment [9].TELL fragments are approximately 20-40 kbp long [9], which are similar in length to the (on average) 10 kbp fragments generated by 10x [10].LoopSeq similarly uses an intramolecular enzyme-based distribution method but does not use beads to partition genomic fragments at all [11].One major advantage of both TELL-Seq and LoopSeq over 10x and BGI's long fragment reads is their single-tube reaction chemistry and compatibility with standard Illumina sequencing machines.10x library preparation depended on costly instruments for droplet management [8], and BGI's long fragment reads are incompatible with Illumina sequencing.For detailed explanations of each method's capabilities and library preparation procedures, we direct the reader to [8,9,11,12].In this study, we also provide a first head-tohead comparison of multiple SLR technologies to profile their strengths and weaknesses on metagenomics datasets of similar complexity.
The central drawback of SLR technologies is that the UMI-based linkage information must be efficiently interpreted from the sequencing data to simulate long read resolution and increase the average contiguity (i.e., NA50) of de novo assembly.The need for novel algorithms to leverage this information has been partially met by the proliferation of SLR-based assemblers, such as Supernova, Athena, and cloudSPAdes [13][14][15].However, additional novel algorithmic efforts are needed to achieve the desired contiguity and reduce the number of observed assembly errors [15].

Applications of SLR sequencing in metagenomics
Many taxonomic lineages identified in large-scale studies are not represented in reference sequence databases and are not associated with isolated cultures [16][17][18].Metagenomic analyses that rely on existing reference genomes, such as read-based taxonomic classification, will inherently bias the genomic reconstruction of a mixed population, generally towards the most common and already well-characterized species within the sample [19].Thus, reference-based analysis is not ideal as a first step for samples with large variations in species compositions or samples from substrates that are poorly represented in reference databases, such as under-studied or extreme environments [18,20].
SLR sequencing has been shown to resolve species compositions of metagenomics samples in both 16S-based and shotgun whole-genome-based analyses.Because of its low UMI multiplicity, LoopSeq was able to identify multiple copies of the 16S rRNA gene as belonging to a single strain [21,22].When paired with additional sources of information, such as targeted amplification or longitudinal sequencing, SLRs are capable of resolving metagenomics to the strain level.In conjunction with high-throughput qPCR, LoopSeq has been used to associate antimicrobial resistance genes with specific species in environmental samples [23].The long-read-like linkage information encoded in read clouds has been used to track single nucleotide variants in the human gut microbiome longitudinally, demonstrating that prioritizing depth of coverage over strict read length can be an optimal analysis strategy especially when the number of strains/haplotypes is an unknown variable [10].In another study, SLRs have also been used to identify the presence of structural variation on bacterial chromosomes [24].

Challenges posed by SLR sequencing
Despite the additional linkage information offered by SLR sequencing, there are new computational challenges involved in applying barcoded reads to de novo assembly.Because metagenomic samples are intrinsically multiplexed samples of multiple species, long-range linkage information is confounded by the multiplicity of fragments assigned to 3 ′ UMIs.Existing systems employ on the order of 10 6-7 3 ′ UMIs [25].In previous studies with the 10x Genomics system, it was observed that there were 2-20 long fragments of DNA per 3 ′ UMI [26].The larger the barcoded read cloud, the more likely that reads tended to originate from multiple fragments.Our analyses suggest that at least 97% of read clouds are composed of ≥ 2 fragments, with the exception of a LoopSeq dataset.In the absence of additional information about the sample, it is difficult to distinguish the genomic origin of a random assortment of reads with the same UMI.Furthermore, each fragment of DNA is only fractionally covered by reads (typically 10-20%).Because overall coverage is reduced, SLRs provide long-range information at the expense of shortrange blocks of contiguous sequence.

The barcode deconvolution problem
The barcode deconvolution problem, previously described in [26,27], is defined as the assignment of each read with a given 3 ′ UMI to a subgroup such that every read in the subgroup originates from the same fragment or contiguous genomic segment.A solution to the barcode deconvolution problem for a set of read clouds would be a map from each read cloud to a function which solves the UMI deconvolution problem for that read cloud.Reads with the same 3 ′ UMI that are highly likely to have originated from the same metagenomic fragment are more likely to co-occur in one another's sequence space within the assembly graph than reads from different fragments.
Though the multiplicity of genomic fragments to 3 ′ UMIs has been problematic since the inception of SLR sequencing, the UMI deconvolution problem has only been addressed recently by two computational tools: EMA [27] and Minerva [26].The EMA approach augments read alignment for barcoded reads based on alignment to a reference sequence.In the process of generating probabilistic alignments, reads from a single read cloud are sub-grouped into sets of reads that map close to each other.EMA is particularly applicable for highly repetitive regions where a barcoded read can align to multiple locations within the genome [27].However, the EMA approach relies on the user to supply reference sequence(s) as a priori information about the input sample.We also demonstrate that EMA (as of the date of publication) is unable to recognize > 99.9% of SLR UMIs, even with UMI whitelists tailored for each dataset.Additionally, [28] has proposed an extension of the EMA approach that makes a graph of UMIs using the degree of read connectivity between UMIs as edges to deconvolve read clouds.While the species composition of popular sampling sites such as the human gut are well-characterized, such reference-based methods are not designed for microbial samples from under-studied environments such as urban landscapes [20].Minerva does not require a reference genome, using instead k-mer similarities between read clouds to approximately solve the UMI deconvolution problem for metagenomic samples [26].However, Minerva is memory-inefficient and requires extensive parameter optimization to deconvolve all of the read clouds in a dataset.
In this paper, we present Ariadne as an advanced approach to tackle UMI deconvolution.Instead of using read alignments to reference sequences, which are unknown for poorly characterized environmental microbial samples [20], or relying on computationally expensive string-based graphs, Ariadne leverages the linkage information encoded in the full de Bruijn-based assembly graph generated by a de novo assembly tool such as cloudSPAdes [15] to generate up to 37.5-fold more read clouds containing only reads from a single fragment, improve the summed NA50 by up to 500 kbp, and maintain a proportional rate of misassembly relative to de novo assembly without prior UMI deconvolution.Searching through the pre-made assembly graph for reads in the neighboring sequence space makes the search for co-occurring reads computationally tractable, generalizable, and scalable to large datasets.

Benchmarking datasets
We used real data sets from four microbial mock communities which we refer as MOCK5 10x, MOCK5 LoopSeq, MOCK20 10x, and MOCK20 TELL-Seq throughout this manuscript.See Table 1 for an overview of the mock microbiome datasets and Table 2 for the relative species abundances.Two of the datasets, MOCK5 LoopSeq and MOCK20 TELL-Seq, are analyzed for the first time in this work.Note that while both comprised of 5 species, the MOCK5 10x and MOCK5 LoopSeq datasets share only one species-Escherichia coli-and thus cannot be considered as a comparison of 10x and LoopSeq technology.The original MOCK20 10x dataset is approximately 330 million reads, but for efficiency, a random subset of 100 million reads was used in the subsequent analyses.Similarly, the original MOCK20 TELL-Seq dataset is approximately 210 million reads large but was also subsetted to 100 million reads.Reference genome sequences can be downloaded here.

Gold-standard ("reference") cloud deconvolution
The use of mock communities allowed us to infer the genomic fragment of origin of the barcoded reads.These gold-standard fragment assignments served as a benchmark to compare read clouds and assemblies with and without deconvolution.We mapped the reads to the reference sequences of the species that were known to form the mock communities using Bowtie2 [30] (version 2.3.4.1).Using the read mapping positions along the reference genome, we further subsetted the reads into gold-standard read clouds such that the left-and right-most starting positions of reads in the same cloud were no further than 200 kbp apart, in case multiple fragments from the same genome were present in the same read cloud.We termed this method 'reference deconvolution' as it represents the database/reference-sequence-based inference of the genomic fragments that originated reads tagged with the same 3 ′ UMI.reads that have likely originated from different SLR fragments (Table 3).Larger clouds are more likely to contain reads of multiple species origins and thus lower purity.The exception to this trend is the MOCK5 LoopSeq dataset (Fig. 1 top row).Since the goal of linked reads is to approximate the linkage range of long fragments, having mixedorigin clouds as a result of fragment-to-UMI multiplicity confounds downstream applications such as taxonomic classification.Ariadne deconvolution reduces the proportion of multi-origin read clouds by at least 2-fold in the MOCK5 LoopSeq dataset, up to 7.5fold in the MOCK20 10x dataset (Table 3 column 7).Since Ariadne deconvolution generally decreases this trend (except in the case of LoopSeq), it is unlikely that our results have been inflated by a large number of small and trivially pure clouds.Furthermore, reference-based deconvolution generates a similar number of clouds of a similar size to Ariadne, indicating that search-distance based subgrouping models genomic fragment boundaries sufficiently.The exception to this trend is the MOCK20 10x dataset, where Ariadne generated twice the number of read clouds that are slightly less than half the size of clouds in the reference-deconvolved dataset (Table 3).In general, Ariadne produces enhanced read clouds that are smaller than reference-deconvolved read clouds, indicating that, on average, reads that should be included in the cloud are "missing" and that the search-distance-based method does not exhaustively cluster all reads from the same fragment.Ariadne at least doubles the proportion of completely deconvolved read clouds, which represent the entirety of the reads from a single inferred genomic fragment, except in the case of the MOCK5 LoopSeq dataset (Table 3 column 8).In comparison, under-deconvolved clouds are clouds comprised of reads that originate from multiple fragments, whereas over-deconvolved clouds are comprised of single-origin reads that are a subset of all of the reads from that fragment.The proportion of single-origin read clouds, the sum of complete and over-deconvolved read clouds, has increased between 2-and 37.5-fold.
While the 10x and MOCK20 TELL-Seq size-to-purity distributions are similar, the MOCK5 LoopSeq distribution peaks at P = 0.96 , which explains why Ariadne deconvo- lution has minimal effect on read cloud quality.For the rest of the datasets, with respect to size, the relative purity of Ariadne-enhanced read clouds is significantly larger than that of the original read clouds (Fig. 1 bottom row).The ideal deconvolution based on reference-mapped positions is shown in the Fig. 1 middle row, where the vast majority of clouds have P = 1 .This is also represented by the proportion of complete clouds or deconvolved clouds that contain the maximal set of reads from the same original cloud that map to a single reference genome.For example, in the MOCK5 10x dataset, 91% of reference-deconvolved read clouds are complete (Table 3 row 2).Even with deconvolution, there are still large read clouds ( ≥ 100 bp) in the deconvolved set, indicating that Ariadne is capable of maintaining the integrity of existing single-origin read clouds through a limited search of the assembly graph (Fig. 1).Overall, applying Ariadne deconvolution to SLR datasets generates enhanced read clouds that closely resemble the size-to-purity distribution of the reference deconvolution, thereby approximating the ideal deconvolution without a priori information about the microbial composition of the originating data.
We have additionally demonstrated the effect of increased search distance on overall dataset quality metrics (Additional file 1: Table S1).The number of read clouds and the average entropy increase as the search distance increases, with minimal increases in the read cloud purity along with the proportion of single-origin (complete and overdeconvolved read clouds).Given with the large increases in computational runtime, only results with the deconvolution search distance of 5 kbp are included going forward in the main text.Similar results-smaller and on average fewer-origin read clouds-can be observed with the Shannon entropy measure (Additional file 1: Fig. S1).

Enhanced read clouds improve metagenomic assembly
Since Ariadne was able to deconvolve the original SLR dataset into high-quality and non-trivial enhanced read clouds, we applied Ariadne to the full 97M-read MOCK5 10x dataset, the full 75M-read MOCK5 LoopSeq dataset, and 100M randomly sampled reads from each of the MOCK20 10x and MOCK20 TELL-Seq datasets to generate enhanced read clouds as input to cloudSPAdes [15] in metagenomics mode.The assembly quality of the resulting scaffolds with reference and Ariadne deconvolution was compared to that without prior deconvolution.cloudSPAdes generates de novo assemblies a wide variety of sequencing data into contigs and scaffolds, and has been benchmarked on the MOCK5 10x and MOCK20 10x datasets previously [15].As such, we have used similar metaQUAST metrics to evaluate and compare the quality of the assemblies [31].
The largest improvements are in the overall assembly contiguity and the largest alignment, demonstrating that enhanced read clouds with increased fragment specificity generate higher-quality assemblies (Fig. 2).NA50 is another measure of assembly contiguity, reporting the length of the aligned block such that using longer or equal-length contigs produces half of the bases of the assembly.de novo assemblies generated from reference-or Ariadne-deconvolved read clouds are significantly more contiguous than those generated from the original, non-deconvolved read clouds (Fig. 2).To calculate the overall performance improvements when assembling each dataset, the assembly statistic for each species was summed and the relative difference between reference-or Ariadneenhanced scaffolds and no-deconvolution scaffolds calculated.For NA50 and largest alignment, this means the no-deconvolution summed NA50 were subtracted from that of the reference-or Ariadne-enhanced scaffolds.For the relative misassembly rate, the number of misassembled bases was divided by the total assembly length.
While the differences between the deconvolution methods and no deconvolution were minimal in the MOCK5 10x dataset (Fig. 2 first column), there were positive differences in NA50 and largest alignment in all other datasets.While Ariadne generally produces shorter NA50 and largest alignments than the ideal genomic deconvolution, Ariadne assemblies significantly outperform no-deconvolution assemblies in terms of contiguity statistics (Fig. 2 top and middle).However, in the case of the two MOCK5 datasets, the Ariadne assembly NA50 gains were larger than the reference Fig. 2 Read cloud deconvolution improves metagenomic assembly compared to raw SLR data.We compared assemblies built from raw linked reads (no deconvolution) to assemblies built from reads deconvolved using two methods: reference deconvolution, which maps reads to reference genomes, and deconvolution using Ariadne.Top row: The NA50 of assemblies for each species in each sample between deconvolved reads and raw reads.Larger numbers indicate better performance.Middle row: The largest alignment of assemblies for each species in each sample between deconvolved reads and raw reads.Larger numbers indicate better performance.Bottom row: The proportion of misassembled bases p miss is the number of bases in misassembled contigs over the total number of assembled bases deconvolution assembly gains, suggesting that the most ideal read cloud partition does not always generate the most contiguous assemblies (Fig. 2

top first and second panels).
In terms of misassembly rate, Ariadne assemblies largely match no-deconvolution assemblies and significantly outperform reference deconvolution.With the MOCK20 10x dataset, the third quartile of misassembly rate by reference deconvolution is 8-fold larger than that of no deconvolution, whereas Ariadne assemblies contained up to 2-fold more at maximum (Fig. 2 bottom third panel).Ariadne underperforms assembly without deconvolution in terms of largest alignment and misassembly rate in MOCK20 TELL-Seq (Fig. 2 bottom last panel).However, both Ariadne-and reference-enhanced assembly obtain NA50 for 4 species that the no-deconvolution assembly failed to capture, as well as some substantial differences between Ariadne and reference-based deconvolution with respect to no deconvolution (Table 4).In other datasets, there may be species that are more easily reconstructed with read cloud deconvolution that cloudSPAdes would not otherwise find long, contiguous reference subpaths for through the assembly graph.Outlier values of NA50, largest alignment, and misassembly rates were omitted from Fig. 2 for visual clarity were all from the reference deconvolution scaffolds and can be found in Additional file 1: Table S2.There were no major changes in the fraction of reference bases that were reconstructed, the total aligned length, the mean number of mismatches, and the number of contigs.

Comparison of SLR sequencing technologies
While both comprised of 5 species, the only species common to the MOCK5 10x and MOCK5 LoopSeq datasets is Escherichia coli.As such, while they are comparable in terms of input community complexity, they cannot be treated as a comparison of SLR Table 4 Both Ariadne and reference deconvolution increase the summed NA50 of de novo assembled metagenomes.The second-to-rightmost column shows the difference between the summed NA50 of assemblies obtained from deconvolved reads and the summed NA50 of nondeconvolved assembly.The asterisk (*) indicates that the assembly did not have sufficient sequence material that was alignable to the reference sequence for an NA50 to be calculated technologies with respect to de novo assembly.However, the two MOCK20 datasets were generated from the same 20-species mock community product from Zymo, and as such, it is possible to compare assembly quality metrics (Table 5).
The difference between the accurately recovered fraction of each species' genome is small (10x is on average 8% larger).There were some species (e.g., Bacteroides vulga- tus, approx.abundance is 0.03% ) that both no-deconvolution and reference-decon- volved 10x and TELL-Seq assemblies struggled to reconstruct because of their low abundances.It is noteworthy that the Ariadne-deconvolved 10x assembly managed to reconstruct a small contig from Schaalia odontolytica, which both no-deconvolution and reference-deconvolved assemblies completely missed.The difference between the total alignable lengths of the assemblies is larger ( 13.3% ), with all 10x contig-to-ref- erence genome alignments roughly the same or significantly longer than TELL-Seq alignments.At the species level, however, there was some variability in terms of the SLR technology that reconstructed the larger alignment.This variation was probably in part due to fluctuations in coverage as well as downsampling the full sequencing read dataset for efficient comparison.For example, in the case of Bacillus cereus, the longest alignable 10x contig was 166,772 bp and the longest alignable TELL-Seq contig was 2,079,803 bp.However, the recovered genome fraction for both technologies was 97.03% and 98.92% respectively, indicating that while the 10x assembly of B. cereus was more fragmented, it was still by and large complete and gaps were likely due to coverage variability.Similar reversals can be found where the TELL-Seq assemblies of a species are more fragmented but similarly complete.Nonetheless, the 10x assembly contains far fewer contigs than the TELL-Seq assembly-2481 to 7729-which is also reflected in its smaller number of larger read clouds (Table 3).
As expected with larger assemblies, the 10x assembly has significantly more misassemblies and misassembled content than the TELL-Seq data, which is probably due to the fact that more assembled bases are contained in fewer contigs.The amount of unalignable sequence content was quite small in both, and comprised < 0.2% of both assemblies.

Table 5
Comparison of 10x vs. TELL-Seq de novo assembly statistics as calculated by MetaQUAST.MOCK20 datasets were made using 10x and TELL-seq library preparation and sequencing protocols, reference-deconvolved, and de novo assembled using cloudSPAdes

Titrating the maximum fragment length for reference-based deconvolution
To explore whether our estimate of the maximum fragment length affects assembly quality, we reference-deconvolved and assembled two datasets made from different linked-read technologies-MOCK5 LoopSeq and MOCK20 TELL-Seq-with the parameter set to 100 kbp and 400 kbp (Fig. 3, Additional file 1: Table S4, Additional file 1: Fig. S2).In both cases, the purity, entropy, cloud size, and assembly summary statistics are extremely similar to the 200 kbp reference deconvolution.In fact, there seems to be an upper limit for fragment size estimates in terms of deconvolution performance.
Increasing the maximum fragment length past 200 kbp slightly decreases the average purity of MOCK20 TELL-Seq read clouds (Additional file 1: Table S4 row 6).
While the median relative misassembly rate of the MOCK5 LoopSeq 100 kbp-deconvolved dataset was much lower than that of the 200 kbp-deconvolved dataset (Fig. 3 rightmost panel), there was no other notable differences.It is possible that the range of values that we have explored does not generate meaningfully different enhanced cloud read compositions, and if the maximum fragment length were set lower (ex.50 kbp), we would observe more distinctions in both the cloud properties and assembly statistics.However, the specific setting would depend on the user's familiarity with the linked-read fragmentation protocol, while our focus was to include the maximal number of reads from a single reference sequence.

Barcode deconvolution on simulated datasets
To evaluate the utility of barcode deconvolution on higher-complexity datasets, we simulated 100 million 10x linked reads for sets of 20, 50, or 100 species using LRSim [32].The 50-and 100-species' reference genomes were randomly selected from the United Human Gut Genome [17], and the 20-species reference genomes are the same as the Zymo MOCK20 mock microbiome stock.Afterwards, we conducted reference-and Ariadne-based deconvolution (search distance of 5 kbp) on the simulated datasets, then Fig. 3 Halving and doubling the maximum fragment length does not meaningfully change the quality of de novo assembly using reference-deconvolved linked-reads.Shown here are the NA50, largest alignments, and relative misassembly rate of the MOCK5 LoopSeq reference-deconvolved assembly.As before, we compared assemblies built from raw linked reads (no deconvolution) to assemblies built from reads deconvolved using two methods: reference deconvolution with maximum fragment lengths set to 100 kbp (Ref_100), 200 kbp (Ref_200), and 400 kbp (Ref_400), and deconvolution using Ariadne assembled the non-deconvolved, reference-deconvolved, and Ariadne-deconvolved datasets using cloudSPAdes.As previously, there is a sizable increase in the average purity of Ariadne-deconvolved clouds relative to non-deconvolved reads in the 50-species dataset (Additional file 1: Table S5).As expected, with a larger number of species, the non-deconvolved read clouds are less pure and higher in entropy than both MOCK20 datasets, and the number of initial read clouds is approximately the same as the 10x dataset but not the TELL-Seq one.While Ariadne deconvolution does improve the overall purity and reduce the read cloud size, with the exception of the 20-species dataset, the differences are not as stark as with the mock microbiome datasets (compare Additional file 1: Table S5 to Additional file 1: Table S2).Instead of the trend observed above, where Ariadne-deconvolved read clouds are smaller than the reference-deconvolved clouds, here, they are larger instead (see columns called "Avg.Cloud Size").This may be indicative of an upper bound as to search distance-based deconvolution in terms of being able to specifically cluster sequences originating from a large number of species.When comparing read cloud summary statistics between the real to the simulated 20-species datasets (Table 3 vs.Additional file 1: Table S5), we can see that there are fewer read clouds in the real 10x dataset (503,205 vs. 1,720,220), the read clouds are generally larger (186 vs. 58.13reads on average per cloud) and slightly more pure (0.38 vs. 0.24).These differences are probably due to mismatches between the default LRSim parameters, which are estimates of wet laboratory library prep outcomes, and the real 10x sequencing library's properties.
In terms of de novo assembly, there were minimal differences between the NA50, largest assembly, and relative misassembly rates of non-, reference-, and Ariadne-deconvolved simulated datasets for 50 and 100 species (Additional file 1: Fig. S3), which was unusual given the trends in the real datasets.This could possibly be due to the average coverage of the species in the simulated datasets, which is very high-with 100 million reads to cover 160,092,218 bp and 334,517,571 bp in the 50-and 100-species datasets respectively, the depth per base is approximately 94X and 45X respectively.This hypothesis is supported by the fact that the assembly results of the simulated 20-species dataset (Additional file 1: Fig. S3 A, B, and C) are similar to the real dataset (Fig. 2 column 3).Thus, the effects of barcode deconvolution will likely be dependent on the number of isolates in the original sample (and by extension, the fragment coverage of the metagenome).Furthermore, simulated datasets may not be very representative of actual linkedread sequencing data, where missing genomic information due to physical or chemical wet-lab protocols are less random and less correctable through assembly.Real datasets that are not generated from mock communities may share similar missing genomic information issues as the mock communities at a larger scale.

Using synthetic long reads to improve taxonomic classification
Read cloud deconvolution improves short-read taxonomic classification by using, if available, the majority consensus classification of the reads in a read cloud to "promote" poorly classified reads or reclassify them at a lower-ranked taxon.Classification improvements were previously demonstrated on a test-sized dataset using Minervadeconvolved read clouds [26] but can now be observed at the scale of full synthetic long read datasets because of Ariadne's improvements in deconvolution runtime.The reference-deconvolved dataset is present only as a comparison; if the species composition were already known, the classification step is largely unnecessary, except in the case of strain discovery.
Whereas the non-deconvolved MOCK5 LoopSeq dataset only promoted 873 reads from root to species level, the Ariadne-deconvolved dataset promoted 20,214 reads from essentially being unclassified to the species level, which is over 22 times more than no deconvolution (Fig. 4A, Table 6).Similar results can be observed with the MOCK20 TELL-Seq dataset (Fig. 4B, Additional file 1: Table S6).The difference between the number of reads that the Ariadne-deconvolved vs. non-deconvolved datasets are able to promote decreases as we consider lower initial ranks such as order, family, etc., despite the absolute number of reads promoted using linkage information increasing with or without Ariadne's assistance (compare rows from top to bottom in Table 6, Additional file 1: Table S6).
Ariadne deconvolution creates smaller purer clouds (Fig. 1, bottom row), which are more likely to have originated from a constrained genomic region of a single species.Since the taxonomic classifications of Ariadne-deconvolved read clouds are more constrained than large mixed-origin non-deconvolved read clouds, poorly classified reads are much more likely to be promoted to a lower consensus rank, even if there is disagreement between the reads at the species level (Table 6 rows where the promoted-to rank is not S, or species, Additional file 1: Table S6).There are some cases where the non-deconvolved dataset promoted more reads from rank i to rank j, where j is lower in rank than i (for example, R → D and F → G in tab6 and Additional file 1: Table S6).In these cases, the reads that would have been promoted to rank j were promoted to lower ranks instead.For example, compare the number of reads promoted from family to genus (F→ G, non-deconvolved 1,096,859 vs. Ariadne 1,094,013) in tab6 and Additional file 1: Table S6 to the number promoted to species (F→ S, non-deconvolved 1,354,793 vs. Ariadne 1,365,901).For a species-specific example of this trend, see Additional file 1: Table S7 to observe the tendency of enhanced read clouds to promote reads to lower ranks in Rhodobacter sphaeroides.Fig. 4 Read cloud deconvolution improves the specificity of short-read taxonomic classification, especially from high ranks such as root (R), kingdom/domain (D), and phylum (P) to low ranks such as species (S).A MOCK5 LoopSeq.B MOCK20 TELL-Seq, for which the y-axis has been truncated at 120,000 promoted reads for display purposes.There were 683,635 domain-to-species promotions with reference-based deconvolution

Runtime and performance
In most cases, Ariadne consumes approximately the same amount of RAM and takes less time than de novo assembly, with the exceptions of some search distances in combination with MOCK5 LoopSeq and MOCK20 TELL-Seq (Table 7).In general, increasing the search distance increases the runtime by a few minutes to an additional hour, likely due to the fact that more neighboring vertices are considered per read with larger search distances, and a larger amount of time is taken to find intersections between larger sets of neighboring vertices.The runtime's dependency on assembly graph connectivity is illustrated in by difference in runtime between the MOCK20 10x and TELL-Seq datasets.While they both have the same number of barcoded and paired-end reads (100 million), the TELL-Seq deconvolution takes nearly 5 times longer to complete than the 10x deconvolution.Crucially, in all cases (except for the Table 6 Read cloud deconvolution specifically promotes reads to low taxonomic ranks such as genus and species in the MOCK5 LoopSeq dataset.The column "Promotion" indicates the promotion of a paired read i from initial rank X to rank Y as "X→ Y" using deconvolved read cloud information.The abbreviations are as follows: root (R), kingdom/domain (D), phylum (P), class (C), order (O), family (F), genus (G), species (S).The columns "Prop.Improvement" are calculated by taking the difference between the number of reads promoted using the enhanced read clouds and the number of reads promoted using the original read clouds, divided by the latter

EMA, Longranger and Lariat, and Minerva
EMA takes SLR UMIs into account to align linked reads to a reference sequence(s) using a latent variable model, thereby serving as an alternative to Bowtie2-based read alignment to generate gold-standard read cloud deconvolution.However, the first step of the EMA pipeline (last downloaded: May 7, 2021), ema count, which counts UMIs and partitions the original FastQ file into a number of deconvolution bins, is unable to recognize most to all of the UMIs, even when custom whitelists with the exact list of UMIs in the dataset are directly provided (Additional file 1: Table S8).In one case, ema count failed to detect any UMIs in the MOCK5 LoopSeq dataset altogether.In the best-case scenario with the MOCK20 10x dataset, ema count recognized 199,488 barcoded reads.The core deconvolution module in the pipeline, ema align, would have been able to deconvolve at maximum 0.002% of a 94-million read dataset (Additional file 1: Table 7 Read cloud deconvolution specifically promotes reads to low taxonomic ranks such as genus and species in the MOCK5 LoopSeq dataset.The column "Promotion" indicates the promotion of a paired read i from initial rank X to rank Y as "X→ Y" using deconvolved read cloud information.The abbreviations are as follows: root (R), kingdom/domain (D), phylum (P), class (C), order (O), family (F), genus (G), species (S).The columns "Prop.Improvement" are calculated by taking the difference between the number of reads promoted using the enhanced read clouds and the number of reads promoted using the original read clouds, divided by the latter Table S8).To deconvolve reads that are not recognized as barcoded, the EMA pipeline uses the same procedure as our reference deconvolution pipeline, with bwa as the read aligner instead of Bowtie2.Due to the paucity of aligned reads and the subsequent lack of deconvolution, EMA-enhanced read clouds were not featured in this analysis.For ease of use, we decided to use our own reference-based deconvolution pipeline, which automates all of the read alignment, read-subgrouping, and FastQ generation steps with a single submission command.Similar UMI recognition issues were encountered with the Longranger align pipeline and the Lariat aligner it was based on.Similar to EMA, Lariat incorporates UMI information to align linked reads to a reference sequence(s).However, the available FastQ files for all four of the datasets were not comprised of raw Illumina BCL files or the raw output of longranger mkfastq.As such, it was not possible to apply Longranger align or Lariat to the four datasets.Minerva was similarly unable to deconvolve a sufficient number of read clouds ( 0.002% at best with the MOCK5 LoopSeq dataset), and it too was not featured in this full analysis.However, both Ariadne and Minerva completed a deconvolution test-run of 20-million reads from the MOCK5 10x dataset.In summary, while Minerva generated nearly 100% single-species read clouds, it was only able to deconvolve 4% of the reads in total (Additional file 1: Fig. S3 and S4, Additional file 1: Table S9).Performance-wise, the Minerva runs were 4 times as long and consumed 3 times as much memory.The results are explained in greater detail in the Supplementary Materials (pg.8 and 9).

Discussion
Ariadne deconvolution generates enhanced read clouds that are up to 37.5-fold more likely to be single-origin, which will improve downstream applications that depend on approximating the long-range linkage information from a single species, such as taxonomic classification and de novo assembly.In terms of classification, Ariadne-deconvolved read clouds are by and large (with the exception of the LoopSeq dataset) much smaller and more specific in terms of read origins than non-deconvolved read clouds.This allows unclassified short reads from a cloud where other short reads have been successfully classified to be assigned to the appropriate low-ranked taxon.For the MOCK5 LoopSeq dataset, there was a 22-fold, 15-fold, and 6-fold improvement of short reads promoted from root, kingdom/domain, and phylum respectively to the appropriate species classification.Indeed, de novo assemblies of mock metagenome communities are significantly more contiguous with Ariadne-enhanced read clouds, without the outsized increase in misassembly rate as observed with the ideal deconvolution strategy.In terms of the dataset-specific results, increasing the number of species did not change the shape of the size-to-purity distribution greatly, reflecting the inter-technology similarities in 3 ′ UMI-to-fragment multiplicity.As such, Ariadne is capable of improving assembly results and read cloud composition across all of the SLR strategies and, unlike EMA or the Longranger pipeline, easily facilitates re-analyses of existing SLR datasets for higher-resolution de novo assembly and taxonomic classification without pre-existing knowledge of the originating species.As with all other assembly-based algorithms, the degree of assembly contiguity will large depend on (i) the true sample composition, which determines the intrinsic genetic heterogeneity to be resolved, and (ii) the amount of raw sequencing data generated, which is a function of DNA extraction efficiency (i.e., genomic fragment size) and sequencing coverage [33].Not only that-there seems to be significant variability between sequencing runs too even in simulation studies, as observed in with the 50-and 100-species simulated 10x linked read datasets.Though Ariadne relies on cloudSPAdes parameters to generate the assembly graph (e.g., iterative k-mer sizes), the program by itself only has two: search distance and size cutoff.The maximum search distance determines the maximum path length of the Dijkstra graphs surrounding the focal read.Since each read is modeled as the center of a genomic fragment, the search distance can be thought of as the width of the fragment.As such, it should be set as the mean estimated fragment length, as determined using other means such as a priori knowledge of shearing duration and intensity or mapping reads with the same UMI to known reference genomes.While the estimated mean length of a metagenomic fragment according to [15] is approximately 40 kbp, the most balanced results were obtained with a significantly shorter search distance.Of the distances tried in this analysis, to balance the highest-quality assembly results, single-origin read clouds, and computational efficiency, we recommend that the user set a search distance 5 kbp or shorter to generate their own enhanced read clouds.If the user has additional information that their SLR data is comprised of larger read clouds that are nearly single-origin, such as in the MOCK5 LoopSeq dataset, then larger size cutoffs should be tried.Similarly, if the goal is to generate as much alignable genetic material as possible without as much concern for synteny or the relative spacing of genes along the genome, then larger search distances can be tried.Reference-based deconvolution, which directly approximates genomic fragments by mapping reads to the known reference sequence composition of the sample, is better for large contiguous aligned blocks.However, the high misassembly rate is detrimental to the overall integrity of the assembly and should be carefully applied even in cases where the species composition of the sample is known.For some SLR technologies, the average genomic fragment size must be estimated prior to library preparation to ensure the correct balance of DNA and reagent molarity [8,9,11].This average fragment size serves as a useful upper bound of an appropriate search distance.
This study is the first to compare the performances of multiple SLR technologies on well-characterized mock microbiomes.While 10x and TELL-Seq assemblies on a 20-species community are comparable in terms of the fraction of recovered species genomes, the 10x assembly was consistently larger, more accurate, and more contiguous than the TELL-Seq assembly while only slightly more prone to misassembly.LoopSeq is an interesting case where the overall fraction of recovered species genomes was low but the overall read cloud purity was extremely high, leading to smaller but extremely contiguous (i.e., large alignable contigs) assemblies.While 10x is a well-characterized and thoroughly validated all-around choice, it may be advantageous to use LoopSeq on microbiomes composed of a few bacteria with small genomes.As with setting search distances, we advise researchers comparing SLR methods to consider the known complexity of their samples and to choose accordingly.Across all technologies, however, a single 3 ′ UMI/barcode can be associated with short reads from 2.3 to 6.7 fragments.We arrived at this estimate by dividing the average size of the non-deconvolved read clouds by the reference-deconvolved read clouds.Even with only 5 species, the odds of a single read cloud containing reads from ≥ 1 species is at least 1 − (0.2) 1.3 = 0.880.
There have been other recent developments in the SLR space, intended to extract as much linkage information from barcoded reads as possible in order to maximize recovery of input genomic information.Instead of tackling the UMI deconvolution problem, Guo et al. [34] and Weng et al. [35] innovate downstream of the de novo assembly problem.By using SLRs in combination with graph-or k-mer-based methods, SLRsuperscaffolder and IterCluster attempt to generate longer and higher-quality assemblies.Either of these, paired with the largely single-origin read clouds generated by Ariadne, could potentially improve the NA50 and average alignment size generated by cloudSPAdes.Hybrid sequencing and analysis strategies in the future may still take advantage of the coverage and linkage depth of SLR datasets, even when used in combination with long reads [36,37].

Conclusion
We have developed Ariadne, a novel SLR deconvolution algorithm based on assembly graphs that addresses the 3 ′ UMI deconvolution problem for metagenomics and enables the complete usage of the linkage information in SLR technology.Ariadne deconvolution has the largest impact when the input microbial community is large and complex, especially for taxonomic classification, which is ideal for environmental microbial samples with minimal prior characterization [20].Further algorithmic improvements to maximize the correspondence between 3 ′ UMIs and fragments may bring about significant improvements in assembly quality, and increase its scalability to other large-scale SLR problems, such as haplotype phasing.

Algorithm overview
We have developed Ariadne, an algorithm that approximately solves the UMI deconvolution problem for SLR sequencing datasets.Ariadne deconvolves read clouds by positioning each read on a de Bruijn-based assembly graph, and grouping reads within a read cloud that are located on nearby edges of the assembly graph.The grouped reads are termed enhanced read clouds.Currently, Ariadne is implemented as a module of cloudSPAdes version 3.12.0.

The usage of UMI information in de novo assembly
The following is a summary of cloudSPAdes' usage of UMIs to identify contigs from assembly graphs.For a more detailed description, see the Materials and Methods section of [15].cloudSPAdes constructs and iteratively simplifies a de Bruijn assembly graph using sequencing reads and several sizes of k-mers (by default, 21, 33, and 55 bp).The goal is to recover genomic cycles through the assembly graph that correspond to whole chromosomes.Due to read error, incomplete sequencing coverage, limited sequencing depth, and biological phenomena such as repetitive regions, de novo assemblers employ a number of heuristics to find the optimal unbranching paths through the graph [13][14][15].In the case of SLRs and cloudSPAdes, UMIs provide one mechanism of identifying edges that are likely part of the same genomic cycle [15].If read i carrying UMI b aligns to edge i, then edge i is said to be associated with UMI b.The UMI similarity between two edges is the proportion of associated UMIs in common.By using UMI similarities to identify edges that likely originated from the same genomic region, cloudSPAdes connects edges within the assembly graph to approximate chromosomes.
Due to fragment-to-UMI multiplicity, barcoded metagenomic read clouds are likely comprised of reads from a few species.Since these reads are all carrying the same UMI b, the long edges associated with UMI b may be erroneously connected to form contigs with genomic material from multiple species.Chimeric contigs are hazardous for downstream analyses, since as the binning of contigs into metagenomic-assembled genomes [6,38].

The assembly graph conception of a fragment
Searching through the cloudSPAdes-generated assembly graph for reads in the neighboring sequence space is equivalent to approximating the sequence content of a genomic fragment given the entirety of the sequence content in the input read dataset.By necessity, reads originating from the same fragment must (i) have the same 3 ′ UMI and (ii) be no more further apart than the total length of the fragment.The user can specify the search distance according to a priori knowledge about the median size of genomic fragments generated in the first step of SLR sequencing.
This process is diagrammed in Fig. 5. Reads can be mapped to a de Bruijn graph.A read (Fig. 5A blue and red bars) is said to coincide with an edge of the assembly graph (Fig. 5B) if the read's sequence aligns to the edge's sequence.The read can also be said to coincide with the vertices bordering the edge it coincides with.For example, in Fig. 5C, read i, which consists of string 'TGA CTG C' coincides with an edge i that also contains the string "TGA CTG C. "

The Dijkstra graph conception of a fragment
The Dijkstra graphs (Fig. 5D) comprising the nearby assembly graph for each read represent the potential sequence space that the read-originating fragment occupies on that assembly graph.A Dijkstra graph contains the shortest paths from the source node to all vertices in the given graph.The Dijkstra graphs for a read are comprised of the vertices bordering assembly graph edges that are reachable within the maximal search distance.The maximum search distance is a user-provided parameter limiting the size of the Dijkstra graph, or the search space to be considered in the assembly graph.The maximum search distance reflects the user's a priori knowledge of the size of a fragment.

The Ariadne algorithm
Ariadne requires a prior step to locate reads within the (meta)genome of the sample.Where EMA requires an alignment step using the bwa read mapper to generate initial mappings for its barcoded reads [27,39], Ariadne uses the raw assembly graph generated by any of the SPAdes family of de novo assemblers to identify the locations of reads relative to one another in the sample's sequence space.Though Ariadne is intended to be a standalone tool, in the future, it will be possible to integrate Ariadne as an intermediary step in the de novo assembly procedure to improve the assembly graph in situ.
The assembly graph is first generated by applying cloudSPAdes to the raw SLR metagenomics reads as described in [40].For a more detailed explanation of the process of forming the assembly graph, we refer the reader to [40,41].Subsequently, the Ariadne deconvolution algorithm is applied to the raw reads, with the cloudSPAdes-generated assembly graph supplying potential long-range linkage connections derived from the dataset.

Step 1: Extract the assembly graph from a cloudSPAdes run.
A simple representation of an assembly graph is depicted in (Fig. 5B).In the cloudSPAdes assembly procedure, the assembly graph is obtained from the raw assembly graph by condensing each non-branching path into a single edge and closing gaps, removing loops, bulges, and redundant contigs.In the process of constructing the assembly graph, cloudS-PAdes also maps each read to the assembly graph, generating the mapping path of a read, or the edges in the assembly graph either partially or fully spanned by the read (Fig. 5C read i).The mapping path P i , or finite walk, of read i is comprised of the set of edges e from graph G that read i covers (Eq.1).
(1) P i = {e 1 , e 2 , . . ., e n } The mapping path can equivalently be described as a vertex sequence V i , which is com- posed of the set of vertices that border each of the edges in P i (Eq.2).
Importantly, the vertices and edges in the assembly graph are unique numeric indices replacing their sequence content.Instead of having to compare the sequences comprising the assembly graph edges, reads sharing edges can be identified by the indices in their mapping paths.As such, for the purposes of UMI deconvolution, reads do not need to be re-mapped along the assembly graph.Instead, the index-based mapping paths and vertex sequences of each read are used to locate the read.This represents a considerable speed-up over the Minerva procedure, which relies on hashing string-based sequence comparisons between reads and read clouds.Steps 2 and 3 are trivially parallelized such that the deconvolution procedure processes as many read clouds as there are threads available.
Step 2: Generating connected read-sets for each read i.
If the number of reads in the read cloud is greater than the user-set size cutoff, the read cloud (i.e., tagged with the same 3 ′ UMI) is loaded into memory (Fig. 5A).For each read i, the following steps are conducted to identify other reads with the same 3 ′ UMI that potentially originated from the same fragment.
Step 2A: Generating forward and reverse Dijkstra graphs.In Fig. 5, read i aligns to an edge in the de Bruijn assembly graph.Assuming that read i is oriented in the 5 ′ → 3 ′ direction, the 3 ′ -terminal vertex is the 3 ′ -most sequence that read i is contiguous with.By finding edges reachable within search distance d of the 3 ′ -terminal vertex, we will be able to all reads with the same UMI that are likely to originate from the same fragment.Reads that are oriented in the 3 ′ → 5 ′ direction can be reverse-complemented to apply this same procedure.
To facilitate this search, a forward Dijkstra graph (Fig. 5D) is constructed starting at read i's 3 ′ -terminal vertex.The forward Dijkstra graph D f ,i comprises the set of vertices v k , which are all vertices in the assembly graph reachable within the maximum search distance d from the 3 ′ -terminal vertex v j (Eq.3).
The process of constructing the Dijkstra graph, which represents the nearby sequence space of read i, is as follows.The tentative distances to all downstream vertices in the assembly graph are set to the maximum search distance d.From the 3 ′ -terminal vertex, the tentative distances to these vertices are calculated from lengths of the edges connecting the vertices.This value is compared to the current distance value, and the smaller of the two is assigned as the actual distance.The vertex with the smallest actual distance from the current node is selected as the next node from which to find minimal paths, and this process is repeated.This process concludes for any path from v j when the small- est tentative distance to vertices v k is the maximum search distance.
Correspondingly, a reverse Dijkstra graph is constructed with the goal node set as read i's 5′-terminal vertex, the start-vertex of the edge that the start of the read coincides (2) with.For the reverse Dijkstra D r,i , the 5′-terminal vertex is treated as the terminal node of Dijkstra graph construction, and the set of distances in D r,i is comprised of the verti- ces v g such that the minimal edge-length distance between the 5′-terminal vertex and v g is less than the maximum search distance.
Step 2B: Identifying other reads potentially originating from the same (meta)genomic fragment.Due to long-range linkage, reads from the same genomic fragment should occur in both the nearby assembly graph and each others' Dijkstra graphs.As such, all other reads j in the same read cloud are evaluated to see if they map to edges that are covered by the forward and reverse Dijkstra graphs of read i.In other words, for every other read j in the read cloud, if any vertex in the vertex sequence of read j, V j (or its reverse complement sequence) is found in read i's forward or reverse Dijkstra graphs, then read j is added to the connected read-set of read i (Fig. 5D).
If read j's vertices cannot be found in the forward or reverse Dijkstra graphs of read i, the read is likely to have originated from a different fragment than read i (Fig. 5E) despite being tagged with the same 3 ′ UMI, and thus no connecting component is built.
The complexity of the overall read cloud deconvolution, which consists of Dijkstra graph construction and binary searches through sets of vertices is where G is the Djikstra subgraph induced on the overall assembly graph with read i as the focal read, d as the maximum search distance, E is the number of edges in the Djikstra subgraph G, and V is the number of vertices in the Djikstra subgraph G.The number of vertices searched for connectivity (Step 2B) depends on technical properties intrinsic to the dataset-such as the number of reads, the read coverage, error rate, number of repeats in the metagenome, and number of chromosomes-and the user-set search distance parameter.Ariadne thus deconvolves read clouds by dynamically identifying the maximal sets of connected reads within each cloud.

Step 3: Output maximal sets of connected reads, or enhanced read clouds
Each enhanced read cloud is the set of reads from the same read cloud that are part of the same search-distance-limited connected subgraph, as identified in step 2B.These enhanced read clouds are output as solutions to the UMI deconvolution problem (Fig. 5F).Each original read cloud is either subdivided into two or more enhanced read clouds or kept intact if all of the reads were found within the search distance d of each other.To avoid a preponderance of trivial-sized enhanced read clouds, if there is an enhanced read cloud generated that is composed solely of a pair of reads, the two reads are instead added to a separate set of "disconnected" reads that includes read-pairs that are not connected components of any other reads or do not map to the assembly graph.

Selection of search distances
The maximum search distance was selected two quantities: i) the expected size of the physical genomic fragment that generates the sequencing reads and the ii) the likelihood of observing reads some N base-pairs (bp) away from a focal read, where N is an integer.Tolstoganov et al. [15] had previously estimated several dataset parameters to accurately model metagenomic fragments as paths through the assembly graph.To estimate the average fragment length, a method termed single linkage clustering was used to partition reads into clusters corresponding to alignments of likely fragments to the known reference genomes of the species comprising the sample.For example, the expected fragment size of the MOCK5 10x dataset was estimated to be 39,139 bp.For a complete analysis of the genomic fragment length estimates, the number of genomic fragments per 3 ′ UMI, and overall coverage of MOCK5 10x, see the Supplementary Materials of [15].
Setting a maximum search distance of d = 5 kbp seemed to model linked-read genomic fragments with an expected length of 40 kbp reasonably.To limit the occurrence of under-deconvolution, the search distance is set conservatively, as a fraction of the expected genomic fragment length.Each read i is used as a focal read to search the assembly graph.If read j truly originated from the same genomic fragment of read i, it is significantly more likely for read j to occur within d of read i than a full 40 kbp away.Several search distances smaller than the estimated fragment length-5, 10, 15, and 20 kbp-were tried for this study, and it was found that search distance d = 5 kbp provided the best balance of deconvolution accuracy, assembly quality, and computational efficiency (Additional file 1: Table S1).

Mathematical justification of fragment dissimilarity
Danko et al. [26] provides a short summary of the mathematical model that justifies the modeling of fragments as search-distance limited connected subgraphs within a de Bruijn assembly graph.The details of the mathematical model for drawing fragments of DNA from a SLR sequencing sample are fully described in [26], the article describing Minerva, the previous iteration of Ariadne.In brief, random fragments of genomic sequence on the scale of kilo-base-pairs from a large number of species with genomes that are many mega-base-pairs in length are unlikely to be similar in terms of sequence to overlap on the assembly graph, even with k-mers as small as 22 bp.For metagenomic datasets, at least 99% of read clouds consist of fragments that do not overlap in this model [26].This is important because it means that overlapping fragments, and thus spurious connections, are uncommon and will not hinder deconvolution in most read clouds.The algorithm is theoretically capable of uniquely deconvolving 99% of read clouds by searching the assembly graph surrounding the read for connecting reads.The only drawback of this model is that it does not account for the fact that individual fragments may have similar sequences, such as those resulting from repetitive regions or highly conserved genes.

UMI-based promotion of taxonomic classification
Each sequencing read is taxonomically classified using Kraken2 [42].If the read is classifiable, at each canonical taxonomic rank, the read is assigned to taxon R n , where R ∈ {k, p, o, f , g, s} and k stands for the rank of kingdom, p for phylum, c for class, and so on.Each subscript n indicates the taxon that a read i was classified to at rank R. For example, let us consider a read cloud containing 7 reads with the following classifications: For each read cloud, a taxonomic tree is built from all read classifications.The taxonomic tree of this 7-read cloud, as represented by nested dictionaries in the Python script, would be: Since all reads from the same read cloud should, theoretically, originate from a single genomic fragment from a single isolate genome, they should all have the same taxonomic classification.Thus, the taxonomic classification shared by a majority of the reads in the cloud should shared by all reads.Read that are classified at higher ranks by Kraken2 can be "promoted" to that lower-rank, more specific taxon.In this example, the lowest rank where the taxonomic classification is consistent across all reads in the read cloud is at the family rank, where a majority of the reads in this read cloud are classified as f 0 .There are classifications at lower ranks but there are disa- greements as to the taxon of the cloud at those ranks (ex. 2 different species identified-s 0 and s 1 which are both from genus g 0 , and 2 different genera identified-g 0 and g 1 which are both from family f 0 ).
Thanks to the majority classification at the family taxon f 0 , 2/7 reads in the read cloud-reads 6 and 7 classified as o 0 and k 0 respectively-can be "promoted" to a classification of f 0 .The other 5/7 reads, having already been classified at ranks equal to or lower than that of family, are not promoted.Unclassified reads in the same read cloud are not considered as part of the taxonomic tree (i.e., unclassified is not its own "kingdom, " so to speak) and cannot be promoted.

Fig. 1
Fig. 1 The size-weighted purity of SLR read clouds increases after applying deconvolution methods.All graphs were generated from 40,000 randomly sampled clouds from the dataset.Top row: No deconvolution.Middle row: Reference deconvolution based on read alignment to species and then grouping reads in 200 kbp regions.Bottom row: Ariadne deconvolution with a search distance of 5 kbp and a minimum cloud size cutoff of 6

Fig. 5
Fig. 5 Graphical description of the Ariadne deconvolution process.A Reads with the same 3 UMI are in a read cloud.Blue and red reads originate from different fragments.The B de Bruijn assembly graph is generated by cloudSPAdes, and C a focal read is mapped to one of its edges.From a read's 3 -terminal vertex, D a Djikstra graph (indicated by a large black circle) is created from all edges and vertices within the maximum search distance from the 3 -terminal vertex.These vertices and edges (within the black circle) comprise read i's search-distance-limited connected subgraph within the whole assembly graph.Reads aligning to edges in this connected subgraph are added to read i's connected set.E Reads originating from different fragments likely coincide with non-included vertices.F Connected read-sets with at least one intersection (i.e., one read in common) are output together as an enhanced read cloud G i,d log(V G i,d ))

Table 1
Overview of mock microbiome linked-read datasets.Asterisk ( * ) symbol indicates that MOCK20 10x and TELL-Seq were generated from the same mock microbiome community product

Table 2
[29]tive species abundance in each mock microbiome community as calculated by the tool CoverM[29].Abundance based on read coverage of reference genome sequence adjusted for genome size

Table 3
Read cloud summary statistics.
For Ariadne deconvolution, we used a search distance of 5 kbp.The 3-5th columns contain the average and standard deviations of read cloud statistics.Prop.Under, Comp., and Over refer to the proportion of total original or deconvolved read clouds that were over-or under-deconvolved, or completely and exactly comprised of all of the reads from a single inferred genomic fragment