Gapless assembly of maize chromosomes using long read technologies

Creating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We used long read technologies and an optical map based approach to produce a maize genome assembly composed of only 63 contigs. The B73-Ab10 genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), multiple highly repetitive centromeres and heterochromatic knobs, and 53 Mb of the Ab10 meiotic drive haplotype.


3
A new maize inbred, B73-Ab10, was created by backcrossing a line containing Ab10 to the B73 inbred six times and selfing it an additional five times (BC 6 F 5 ). We used DNA from this line to prepare an optical map with the Bionano Saphyr system and sequenced it to high coverage using both PacBio and Nanopore technologies. We then implemented a genome assembly workflow based around the optical map (Suppl. Figure 1). Briefly, the PacBio data were assembled using Canu 13 , the Nanopore data assembled using miniasm 14 and the two independent assemblies merged with miniasm and integrated with the optical map as hybrid scaffolds. Hybrid scaffolds were then used to guide further gap closing and create a pseudomolecule assembly ( Figure 1A) where the superior PacBio assembly provides the core and Nanopore contigs help to seal the gaps. Our approach of one-step contig merging and error correction using optical maps as a reference differs from other methods that rely on local assemblies to fill gaps and correct errors 15,16 and performs well in difficult repeat-rich regions ( Figure 1B). Alignments of the optical map to the independent assemblies 17 and standard genome completeness measures demonstrate that the approach is highly accurate (Suppl. Table 1 and 2).
The final assembly has a contig N50 of 162 Mb (Table 1), which far exceeds the contiguity of any prior maize genome assembly 2,5 . Of particular note is the complete 236 Mb assembly of chromosome 3, which was assembled gaplessly without manual intervention -a first for any chromosome from a large complex genome. While the human X-chromosome was also assembled gaplessly 18 , this outcome required extensive manual inspection and correction.
The entire B73-Ab10 genome is represented by 63 contigs where 90% are longer than 20.4 Mb (the N90). In addition to the expected gaps in repeat arrays, there were two gaps associated with residual heterozygosity on chromosome 9. Regions of heterozygosity reduce effective coverage and lead to assembly chimeras that are broken during hybrid scaffolding. We filled these heterozygosity-associated breaks by choosing the dominant Bionano path and performing local assemblies over the gaps. Nanopore reads were also used to span a gap within a CentC array to complete the chromosome 9 telomere-to-telomere assembly. Aside from these manual interventions, some efforts to manually improve within-knob assemblies, and a correction to the Kindr gene complex region of Ab10, the assembly was automated.
Seven of the ten functional centromeres as defined by ChIP-seq of CENP-A/CENH3 7 were assembled without gaps (Suppl. Table 3). These include centromere 3, where CENH3 localizes over a 771-kb CentC array interspersed with transposons. Alignment of partial BACbased assemblies of B73 centromeres showed excellent agreement overall (Suppl. Figure 2).
Prior maize assemblies have succeeded in obtaining only small fragments of knob repeat 4 arrays. In contrast, a knob180-rich knob on chromosome 9 (850 kb), a TR-1-rich knob on chromosome 4 (1.3 Mb) and three TR-1-rich knobs (4.2 Mb, 2.6 Mb, and 2.1 Mb) on Ab10 were fully assembled in the B73-Ab10 assembly. The internal structure reveals that knobs, like centromeres 2,7 , often contain more transposons than tandem repeats ( Figure 1C). Centromeric Retrotransposons target areas with CENP-A/CENH3 2,7 and occupy on average 31.9% of functional centromeres, including within CentC arrays ( Figure 1A and Suppl. Table 3 and 5).
The new knob assemblies reveal that the Cinful-Zeon family of Gypsy elements 19 preferentially target knobs. Cinful-Zeon elements occupy 27.0% of the assembled TR-1-rich knobs and 8.2% of the knob180-rich knobs, but only 3.8% percent of CentC arrays ( Figure 1A and Suppl. Table   5 and 6). While Centromeric Retrotransposons are centromere-specific, Cinful-Zeon elements are also abundant in other heterochromatic regions throughout the genome ( Figure 1A).
In addition to revealing the internal structure of knobs, the data provide the first complete view of the Ab10 haplotype that provides the selective force for the accumulation and maintenance of knobs 10 . The meiotic drive haplotype on Ab10 contains three fully assembled TR-1 knobs, a much larger knob180 knob that is not assembled, and two large inversions (4.4 and 8.3 Mb) that are homologous to normal chromosome 10 ( Figure 1C). These major structural differences help to explain why recombination between the Ab10 haplotype and normal chromosome 10 is suppressed 20 . Ab10 also contains 22.4 Mb of novel sequence with no synteny to other regions of the maize genome or related grass genomes. Within this domain is the complete cluster of nine Kindr genes that are integral components of the drive system 10 , as well as hundreds of other expressed genes, many of which have only one exon or overlap with transposons and are likely non-functional (Suppl. Table 7). Additional meiotic drive functions associated with the movement of knobs at meiosis and their delivery to egg cells 21 remain to be identified in this newly discovered sequence.
Gapless genome assemblies remove all uncertainty about the order, spacing and orientation of genes and their regulators. We have shown that this can be achieved using long reads and well-known assembly algorithms, with significant improvements in contiguity obtained by integrating independent assemblies around an optical map scaffold. Given that most contigs end in telomeres, centromeres or knobs, we presume that virtually all of the genes and associated regulatory information are represented in this genome assembly. The assembly merging pipeline also revealed the internal structure of repetitive domains that were previously known only by cytological techniques, thereby opening these regions to annotation and future epigenomic profiling. Similar results should be achievable other complex genomes, although 5 higher sequence coverage, longer reads, and/or additional scaffolding information may be needed for species with polyploidy or higher levels of heterozygosity. Figure 1. Assembly of the B73-Ab10 genome. A) Whole-genome view. For each chromosome, the top to bottom tracks are: gene density, cinful-zeon retrotransposon density, Gypsy superfamily retrotransposon density in 10 Kb sliding windows, repeat location (knob180 in blue, TR-1 in red, 45S rDNA in teal, CentC in magenta), and the distribution of gapless contigs. CENH3 ChIP-seq peaks identifying centromeres are marked by orange rectangles. The inset shows the centromere on chromosome 3, TR-1-rich knob on chromosome 4, and knob180-rich knob on chromosome 7. The five most common retroelement families are shown for each panel, along with Centromeric Retrotransposons (CRM) for the centromere. B) The impact of assembly merging over a CentC-rich region on chromosome 9. Seven contigs (orange, above) from the PacBio assembly were originally misassembled, as can be seen in the alignment to the Bionano map (connecting lines show matching sites). CentC tracts and gaps are annotated. Assembly merging corrected the output, leaving an 11 kb gap that was filled with nanopore reads. C) Sequence alignment between normal chromosome 10 from B73 (N10) (140Mb-152Mb) and Ab10 (140Mb-195Mb) from B73-Ab10. Annotation is as in A, with Kindr genes marked with black bars in the top track. Links show homologous regions larger than 500bp.

Germplasm, data, and code availability
The B73-Ab10 inbred can be obtained as PI 690316 at the Germplasm Resources Information Network (GRIN), Ames, Iowa. All genomic sequence and Bionano data can be obtained at the NCBI SRA under Bioproject PRJEB35367. The RNA-seq data is deposited in EBI (Accession number E-MTAB-8641). The code used in this study is available at the GitHub repository https://github.com/dawelab/Ab10-Assembly.

PacBio assembly
High molecular weight DNA was extracted from young leaves using the protocol of Illumina sequence by first aligning the reads to the Arrow polished assembly using minimap2 4 , followed by running the assembly tool Pilon 5 (v1.22) to correct individual base errors and small indels using the following parameters: --fix bases --minmq 30.

Nanopore assembly
Two different DNA extraction methods were used to generate high molecular weight (HMW) DNA for Oxford Nanopore (ONT) sequencing. CTAB DNA was prepared as described above for the PacBio assembly. Nuclear DNA was prepared using the protocol of Luo and The data were filtered for reads >10 Kb using seqtk (https://github.com/lh3/seqtk), resulting in an estimated 30x coverage of the maize genome. The resulting reads were aligned (overlap) with minimap2 (v2.13;-x ava-ont -t 64) 4 and an assembly graph (layout) was generated with miniasm (v0.3; -f <reads> <overlaps>) 7 . The resulting graph was inspected using Bandage 8 . A consensus genome assembly was generated by mapping reads >10 Kb to the assembly with minimap2, and then running racon (v1.3.1) 9 ; the consensus process was repeated three times.
The contig assembly was further polished using 73X PE150 Illumina sequence by first aligning the reads to the consensus assembly using minimap2 7 followed by running the assembly tool pilon (v1.18) 5 two times using 73X PE150 Illumina sequence.

Optical map assembly
Ultra high molecular weight DNA was isolated from maize seedlings using a modified

Assembly merging and gap closing
We developed a pipeline to integrate independent contig assemblies and curate assembly errors using Bionano maps as an anchor. The pipeline consists of five steps: 1) conflict resolution, 2) assembly error curation, 3) contig merging, 4) hybrid assembly and contig overlap removal, and 5) manual curation and gap filling (Suppl. Figure 1). The first four steps were automated. A gapless chromosome 3 was generated upon contig merging in the third step, and the complete assembly of chromosome 9 required manual curation. While contig merging with miniasm can be applied to any two sequence assemblies, the availability of de novo assembled Bionano maps is necessary to perform conflict-cutting in step 1, contig error correction in step 2, and hybrid scaffolding in step 4 of the pipeline.

12
Step 1: Conflicts between the optical map and DNA sequence assemblies were resolved using Bionano Solve software (https://bionanogenomics.com/support-page/data-analysisdocumentation/). Sequence assembly can occasionally connect two regions that share a repetitive sequence but do not belong together (making a chimeric contig). These appear as conflicts between bionano maps and sequence assemblies when they are aligned. Optical maps were aligned to in silico digested representations of the DNA sequence assemblies using RefAligner (v3.4.0) and conflicts identified with the AssignAlignType.pl script. Conflicts with chimeric a quality score higher than the default threshold were split using cut_conflicts.pl (using default parameters from optArguments_nonhaplotype_noES_DLE1_saphyr.xml) and a sequence file was produced with custom script cut_conflict_NGS.py. Removing chimeric joins increases the chance of complementary contig merging in Step 3.
Step 2: Assembly errors in the conflict-resolved PacBio contigs were identified and automatically curated with ONT contigs. In this step, PacBio and ONT contigs were aligned to rescaled optical maps and structural discrepancies detected using the structural variant calling pipeline from BionanoSolve (v3.4.0). Homozygous insertions and deletions with a confidence of at least 0.1 and size larger than 1 Kb were classified as true assembly errors in the PacBio contigs. On the condition that no structural discrepancies were found in the corresponding ONT contigs, the ONT contigs were used to replace the erroneous sequences in PacBio contigs using custom script SV_fix.py.

13
Step 4: Bionano maps were integrated with the sequence contigs by hybrid scaffolding using the hybridScaffold.pl script from BionanoSolve (v3.4.0) with default parameters from optArguments_nonhaplotype_noES_DLE1_saphyr.xml. This step orders and orients sequence contigs and facilitates the resolution of remaining overlaps between contigs. As the optical maps are aligned and rescaled with the sequence maps repeatedly during hybrid scaffolding, more accurate overlaps between contigs are identified and annotated as 13N gaps. These overlaps were removed through contig merging with miniasm (v0.3), as described in Step 3. Due to the extreme repetitiveness in the 45S rDNA repeat region on chromosome 6, both the contig assemblies and hybrid scaffolding in this area are erroneous. Therefore, we left the contigs in the NOR un-merged and marked the incorrectness with 13N gaps.
Step 5: Manual curation was performed to correct assembly errors, close gaps in repetitive and heterozygous regions, and assemble telomeres.
Repeat assembly manual curation. In highly repetitive regions, erroneous read joins at the tips of contigs were not detected as conflicts or assembly errors in Steps 1 or 2 due to the limited resolution of Bionano alignment. In these regions, we trimmed and removed the 14 Kindr complex manual curation. The assembly over the ~1 Mb tandem array of Kindr genes (each within an ~100 Kb repeat) was erroneous due to collapsing in the PacBio sequence contig and improper scaffolding. We manually selected the most contiguous ONT contig over this region, carried out hybrid scaffolding for the scaffold containing Kindr, placed an excluded contig in the correct area, and removed an overlap region through contig merging.
The final scaffolds were polished with PacBio subreads using tools from pb-assembly 2 .

AGP construction
The pseudomolecules were constructed from the hybrid scaffolds using ALLMAPS (v0.8.12) 14 . Both pan-genome anchor markers 15  uniquely mapped marker sequences. The mapped markers were merged with the predicted distance information to generate a CSV input file for ALLMAPS. Only scaffolds with more than 20 uniquely mapped markers, with a maximum of 100 markers per scaffold, were used for pseudomolecule construction. The IBM genetic markers were downloaded from MaizeGDB (https://www.maizegdb.org/complete_map?id=887740) 18 and were processed to generate a bed file similar to pan-genome markers. For the markers with coordinates, 50 bp flanking regions were extracted from the B73 v4 genome. For markers without coordinates, marker sequences were used as-is, and those missing both coordinates and sequences were discarded. Mapping of the markers was done similar to the method described above for the pan-genome anchor markers, with all uniquely mapped markers retained. The genetic distance information for these markers was converted to a CSV file before use in ALLMAPS. ALLMAPS was run with default options, and the pseudomolecules were finalized after inspecting the marker placement plot and the scaffold directions. Synteny dotplots were generated using the scaffolds as well as pseudomolecule assemblies against the B73 genome by following the ISUgenomics Bioinformatics Workbook (https://bioinformaticsworkbook.org/dataWrangling/genomedotplots.html) . Briefly, the repeats were masked using RepeatMasker (v4.0.9) 19 and the Maize TE Consortium (MTEC) curated library (https://github.com/oushujun/MTEC) 20 . RepeatMasker was configured to use the NCBI engine (rmblastn) 21 with a quick search option (-q) and GFF as a preferred output. The repeat-masked genomes were then aligned using minimap2 4 (v2.2) and set to break at 5% divergence (-x asm5). The paf files were filtered to eliminate alignments less than 1 Kb and dotplots were generated using the R package dotPlotly (https://github.com/tpoorten/dotPlotly).

RNA-seq
Ten tissues were sampled throughout development for evidence-based gene annotation including: primary root (1) and coleoptile (2)  Total RNA samples were assayed by Bioanalyzer to determine RNA integrity and normalized in 25uL of nuclease-free water prior to library preparation. Sequencing libraries were prepared using KAPA's Stranded mRNA-seq kit (#KK4821) according to the manufacturer's instructions. The mRNA was enriched using oligo-dT beads, fragmented, and converted to 16 double stranded cDNA using random hexamer priming and amplification. Libraries were pooled at equimolar ratios and sequenced on NextSeq 500 instruments using the PE75 protocol.

Gene annotation
For evidence-based predictions, genome-guided transcript assemblies were generated from five different assemblers viz., Trinity (v2.6.6) 21 33 . PASA was run with default options, with a first step of aligning transcript evidence to the masked B73-Ab10 genome using GMAP (v.2018-07-04) 34 and Blat (v.36) 35 . The full length cDNA and Iso-seq transcript ID's were passed in a text file (-f FL.acc.list) during the PASA alignment step. Valid near perfect alignments with 95% identity were clustered based on genome mapping location and assembled into gene structures that included the maximal number of compatible transcript alignments. PASA assemblies were then compared with B73-Ab10 Mikado transcript models using default parameters. PASA updated the models, providing UTR extensions, novel and additional alternative isoforms. PASA generated models were passed through the MAKER-P (v3.0) 36 annotation pipeline as model_gff along with all the transcript and protein sequences to obtain Annotation Edit Distance (AED) 37 scores to assess the quality of annotations. Transposon element (TE) related genes were filtered using the TEsorter tool 21,38 , which uses the REXdb (viridiplantae_v3.0 + metazoa_v3) database of TEs.
Finally the gene annotations were verified for translation errors using the EnsemblCompara pipeline 39 .

BUSCO assessment
The gene space completeness of the B73-Ab10 genome assembly was assessed using the GenomeQC 40 tool, which provides a summary of the number of complete, fragmented and missing Benchmarking Universal Single-Copy Orthologs (BUSCO) in the assembly. The Embryophyta database (embryophyta_odb9; consisting of 1440 conserved, single-copy plant genes) and the genome assembly in the fasta file format were provided as input to the tool to calculate the BUSCO metrics.

TE annotation
The manually curated transposable element library (maizeTE11222019) derived from the Maize TE Consortium (MTEC; https://github.com/oushujun/MTEC) was used as the base TE library. Novel TEs of the maize Ab10 genome not included in the MTEC library were structurally identified using the EDTA pipeline (v1.6.5) 41 with parameters "-species maize -curatedlib maizeTE11222019". The MTEC library augmented with Ab10 specific TEs was used to annotate TE fragments using RepeatMasker. Coding sequences of the maize B73 v4 assembly were