LeafGo: Leaf to Genome, a quick workflow to produce high-quality de novo plant genomes using long-read sequencing technology

Currently, different sequencing platforms are used to generate plant genomes and no workflow has been properly developed to optimize time, cost, and assembly quality. We present LeafGo, a complete de novo plant genome workflow, that starts from tissue and produces genomes with modest laboratory and bioinformatic resources in approximately 7 days and using one long-read sequencing technology. LeafGo is optimized with ten different plant species, three of which are used to generate high-quality chromosome-level assemblies without any scaffolding technologies. Finally, we report the diploid genomes of Eucalyptus rudis and E. camaldulensis and the allotetraploid genome of Arachis hypogaea. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02475-z.

: Plants used in this study 9 Table S2: Oxford Nanopore Technology Sequencing Results for the two Eucalyptus species 10 Table S3: Impact of two size selection methods on Oxford Nanopore Sequencing 11 Table S4: PacBio sequencing results for eight different plant species 12 Table S5: Correlation between library loading and throughput and N50 14 Table S6: Total (not purged) assembly stats comparison between a selection of HiFi-enabled assemblers 15 Table S7: Comparison of computational resources utilization for the assemblers 16 Table S8: Haplotype-separated assembly stats and BUSCO scores 17 Supplementary Figures S1-S12 18 Figure S1: Size selection of ONT libraries 18 Figure S2: Capillary electrophoresis of the HMW DNA from ten plants 19 Figure S3: Pulse field gel electrophoresis of extracted plant HMW DNA 21 Figure S4: Capillary electrophoresis and Long-read Sequencing output of CLR libraries 22 Figure S5 Correlation N50 vs P0% 23 Figure S6: Capillary electrophoresis of prepared HiFi libraries 24 Figure S7: LongQC plots of HiFi data for the two Eucalyptus species and Arachis 25 Figure S8: LongQC plots of CLR data for two Eucalyptus species 26 Figure S9: Genome profiling of the two Eucalyptus species and A. hypogaea based on HiFi data 27 1 Figure S10: In silico Taxonomic classification of the two Eucalyptus species 28 Figure S11: Timeline from sample extraction to assembly for A. hypogaea 29 Figure S12: Summarised workflow for the sequencing data: from platform to purged haploid assembly 30

References 31
Supplementary Results

Oxford Nanopore Technology Sequencing
We sequenced the two Eucalyptus species with four GridION ONT flow cells. Results demonstrated the suitability of our HMW extraction protocol for Oxford Nanopore sequencing ( Fig. S1 and Table S2). We called a total of 881,946 passed reads constituting a total of 12

Genome profiling on unassembled data
LeafGo used the software package GenomeScope 2.0 [4] to infer the size and the heterozygosity level of the two Eucalyptus and the Arachis genomes from unassembled HiFi sequenced data. The rationale is that if the genome is heterozygous, then the k-mer profile will exhibit a characteristic bimodal distribution, as shown in HiFi genome sequencing data (Fig. S9).
We observed two peaks that were centred at coverage values of 20x and 40x in the case of E. rudis and 25x and 50x in the case of E. camaldulensis (Fig. S9A For the allotetraploid peanut genome, two distinct ancestral species underwent a hybridization event and four k-mers topologies are expected (a:a:b:b) [4]. The heterozygosity of Arachis hypogaea (expressed as not aaaa) is 10.4% with a k-mer genome size of about 635 Mb (Fig. S9C).

Phenotypic and in silico identification of the Eucalyptus species
Phenotypic identification of the two Eucalyptus species indicated that the two sequenced trees belonged to Eucalyptus camaldulensis subspecies obtusa and Eucalyptus rudis subspecies rudis. The species were identified by examination of leaves, flower buds, seed pods, seeds, bark and tree morphology. The phenotypic identification was then confirmed to species level by in silico DNA analysis. In silico DNA metabarcoding based on 336 complete ITS sequences from different Eucalyptus genera indicated that the best hits of our query sequences were E. camaldulensis and E. rudis for each dataset (Fig. S10).
In order to carry out in silico identification we obtained 335 complete ITS sequences representing 304 different Eucalyptus genera and 31 unknowns from the PLANiTS dataset version 29-03-2020 [5]. As the dataset did not include the ITS sequence for E. rudis we downloaded it separately from NCBI Nucleotide (accession: KT631323.1) bringing the total number of our Eucalyptus ITS sequences to 336. We also replaced the ITS sequence for E. camaldulensis with the longer version deposited in the NCBI nucleotide database (accession: AF190363.1) which, like the ones for E. rudis, also includes partial 18S, full 5.8S and partial 26S rRNA genes.
Once assembled we aligned both E. camaldulensis and E. rudis genomes against the ITS dataset using blastn v2.7.1 [6] with an E-value cut-off set to 1e-5. We assessed the top hits taking into account percentage identity, length of alignment, mismatches and gaps.

Estimated assembly ploidy
The total assembly size in any genome assembling exercise depends on the ploidy of the genome, level of heterozygosity and how well the assembly tool is able to disentangle the different haplotypes. In an ideal assembly scenario where ploidy is P, we should expect a total assembly size (AS) as AS = Genome Size x P. The purging step, whether part of the assembler or a subsequent analysis, is designed to split the haplotypes as best as possible. In an ideal scenario, primary haploid assembly size should equal that of the alternative set. For homozygous genomes, the assemblers will struggle to distinguish between the homologous contigs and the resulting assembly will be the collapsed haploids with an assembly size equating the genome size. For heterozygous genomes, the purging step is able to recover the alternate haplotypes. Where this is possible, the assembly ploidy is an additional metric that, along with other quality metrics such as BUSCO scores, can determine the completeness of an assembly.
We estimated the assembly ploidy (AP) of the assembly using the following basic formulae: . The smaller and bigger haploids refer to either = ℎ / ℎ the primary or the alternative set of contigs from the haplotigs purging step. This remains an estimate and should be taken with caution as we naively assume that the larger set of haplotigs constitutes a complete haploid n.

Evaluation of four long-read assemblers on PacBio HiFi data
We report a comparison of four long-read sequencing assemblers for performance applied on HiFi PacBio data on three different plant species, E. rudis, E. camaldulensis and A. hypogaea.
For each plant species we de novo assembled the genome using hifiasm v0.8 [7], HiCanu v2 [8], Flye v2.8.1 [9] and Wtdbg2 v2.5 [10]. Parameters used by each assembler are listed at the bottom of Tables S6 and S7. The assemblies were mainly compared for contiguity, and computational requirements in terms of time and memory consumption. Results are shown in Tables S6 and S7 revealing that hifiasm outperforms other assemblers.

Genome assembly: computational resources
To help the reader understand the different stages where significant compute resources are needed, we refer to the diagram in Fig. S12. First, we will explain the main resource-intensive steps in the CLR mode of sequencing then compare them with the HiFi ones. Compute resources used by the different HiFi assemblers are also listed in Table S7.

CLR mode Data Transfer
Data transfer from the sequencer to the compute resource will depend on 1) file sizes (usually 1-2 TB) and 2) the speed of the underlying network, if onsite, and/or the internet speed, if data analysis will be carried out in the cloud. In our case, with an onsite computer cluster a network connectivity of 40 Gb/s & 10Gb/s ports, a typical SMRT cell data transfer takes 3-5 hrs.

Assembly
The biggest bottleneck while assembling CLR data is the error correction and the overlap/trimming step. This step can require significant computational resources especially when coverage is high. With large genomes, this can become intractable. For E. rudis (~600 Mb), with a sequencing coverage of 50x, the Canu assembler took about 20 hours to finish in cluster mode with a total of 3,444 CPU hours and a maximum RSS memory request peaking at 87 Gb. For E. camaldulensis (~600 Mb), with a sequencing coverage of 230x, Canu, in cluster mode, took 548 hours (1116 hours with queue wait and debugging) with over 72,491 CPU hours (error correction: 15,280, consensus: 50979, assembling: 6231) and maximum RSS request peaking at 69 Gb.

HiFi mode Data Transfer
This step is similar to the respective section under CLR as Sequel II generates files of comparable sizes in HiFi and CLR mode.

CCS/HiFi generation
For HiFi, the consensus and error correction are carried out before the assembly step. The improved chemistry and the resulting long polymerase reads have simplified the consensus generation and error correction significantly. This is carried out solely at the level of the ZMW read without needing to overlap reads from other ZMWs. An added advantage is the ability to split/chunk the raw file and distribute the workload on as many compute nodes as physically available without the need for complex mpi coding.
For E. rudis, with a single SMRT cell, HiFi generation required 4.5 hours in cluster mode consuming 4,532 CPU hours over 10,305 cores. For E. camaldulensis, also with a single SMRT cell, the HiFi step took 5 hours over 11,458 total cores using 7,164 CPU hours. For A. hypogaea, with 8 SMRT cells, the HiFi step consumed a cumulative total of 40,425 CPU hours (average~5000 CPU hr per SMRT cell) over a total of 138,632 cores (compute cluster). However, CCS calling was done while sequencing was taking place and overall, all HiFi reads from the eight SMRT cells were ready 5 hours after the last cell was transferred (Fig. S11).

Assembly
Assembling the HiFi data with hifiasm needed 1.20 hours (80 minutes) for E. rudis (40x coverage) and 2 hours (120 minutes) for the E. camaldulensis (51x coverage). We ran hifiasm on a single 40 core node where it consumed 53 and 81 CPU hours for E. rudis and E. camaldulensis, respectively. In terms of memory, the former peaked at 52 Gb and the latter at 64 Gb.
For A. hypogaea (genome size 2.6 Gb; 74x coverage), hifiasm assembled all 8 SMRT cells in 29.8 hours using identical hardware as the two eucalypts. Hifiasm consumed a total of 1,081 CPU hours with maximum memory peaking at 317GB. Refer to Table S7 for a detailed breakdown of HiFi assembly, compute requirements and hardware specifications.

Time & computational resources: E. rudis & E. camaldulensis vs E. pauciflora assemblies
A hybrid assembly strategy based on Oxford Nanopore long reads and Illumina short reads sequencing was used to recently assemble the E. pauciflora genome [11]. This genome for size and complexity is comparable to the two Eucalyptus genomes presented in this study (E. pauciflora: 595 Mb; E. rudis: 549 Mb; E. camaldulensis: 532 Mb).
The two Eucalyptus genomes generated in this study were generated consuming about hundred less computational resources than the one produced by the hybrid assembly. For E. rudis and E. camaldulensis were used complessively~2,200 CPU hours (error correction by CCS) and 53/81 CPU hours (hifiasm) respectively. The E. pauciflora genome needed~200,000 CPU hours (error correction by Canu) plus 21,000 CPU hours (MaSuRCA assembler).

LeafGo: cost estimates
For a plant with a one Gb genome, we estimate that within approximately seven days a high quality draft genome assembly can be produced from plant tissue for an estimated consumables and compute cost of US$3,000-US$4,000 [12]. For bigger genome sizes, more libraries and SMRT cells will be required, with an additional estimated consumables cost of at least US$1,300-US$1,800 per SMRT cell.
ø HiFi assembly was purged using purge_dups as the hifiasm purging module did not produce the best results.
⋐ Duplicate BUSCOs in the purged A. hypogaea assembly are high. This is expected and is in support of a high quality assembly and purging step. The source of the duplication is the two subgenomes.

Figure S5 Correlation N50 vs P0%
Correlation between ZMW occupancy or library underloading (P0%) and subread N50 of CLR libraries sequenced with Sequel I. Spearman's rho = 0.42, p = 0.0499.    k-mer profiles, fitted models, and estimated parameters for the diploid genomes of (A) E. rudis (heterozygosity = 1.5%, repeat = 37.4%),and (B) E. camaldulensis (heterozygosity = 2.1%, repeat = 38.3%), and (C) A. hypogaea (heterozygosity = 12.5%, repeat = 82.9%). It should be noted that Genomescope2 reports the genome size of a single chromosome set. So the estimated genome size is len multiplied by 4 giving 2.54Gb. Legend: len is the estimated genome size. Uniq is the size of the non-repeat content. aa,ab: homo and heterozygosity level for a diploid genome. aaaa, aabb, aabc, abcd refer to heterozygosity for an allotetraploid genome. Kcov is estimated coverage for heterozygous k-mers. Err: read error rate. Dup: PCR duplication. P: genome ploidy.   Major differences between CLR and HiFi modes are: 1) the "CCS Generation" step in the HiFi mode (right) which is essentially error correction and consensus calling steps on produced sequencing data (often performed during assembly).
2) The assembly stage in the HiFi mode (right) is more simplified with a very fast error correction step as the total HiFi bp size is smaller than CLR subread size. The haplotig purging step is often a separate step from assembly except for hifiasm.