Chromosome-scale comparative sequence analysis unravels molecular mechanisms of genome evolution between two wheat cultivars

Background Recent improvements in DNA sequencing and genome scaffolding have paved the way to generate high-quality de novo assemblies of pseudomolecules representing complete chromosomes of wheat and its wild relatives. These assemblies form the basis to compare the evolutionary dynamics of wheat genomes on a megabase-scale. Results Here, we provide a comparative sequence analysis of the ~700-megabase chromosome 2D between two bread wheat genotypes – the old landrace Chinese Spring and the elite Swiss spring wheat line ‘CH Campala Lr22a’. There was a high degree of sequence conservation between the two chromosomes. Analysis of large structural variations revealed four large insertions/deletions (InDels) of >100 kb. Based on the molecular signatures at the breakpoints, unequal crossing over and double-strand break repair were identified as the evolutionary mechanisms that caused these InDels. Three of the large InDels affected copy number of NLRs, a gene family involved in plant immunity. Analysis of single nucleotide polymorphism (SNP) density revealed three haploblocks of ~8 Mb, ~9 Mb and ~48 Mb with a 35-fold increased SNP density compared to the rest of the chromosome. Conclusions This comparative analysis of two high-quality chromosome assemblies enabled a comprehensive assessment of large structural variations. The insight obtained from this analysis will form the basis of future wheat pan-genome studies.


Background
Bread wheat (Triticum aestivum) was the most widely grown cereal crop in 2016. It serves as a staple food for over 30% of the world's population and provides ∼20% of the globally consumed calories [1]. Wheat is a young allopolyploid species with a genome size of 15.4-15.8 Gb, of which more than 85% is made up of highly repetitive sequences [2]. The allopolyploid genome arose through two recent, natural polyploidization events that involved three diploid grass species. The first hybridization event occurred 0.58 to 0.82 million years ago [3] between the A-genome donor wild einkorn (T. urartu) and a yet unidentified Bgenome donor that was a close relative of Aegilops speltoides. This hybridization created wild tetraploid emmer wheat (Triticum turgidum ssp. dicoccoides; AABB genome) [4]. A second natural hybridization between domesticated emmer and wild goatgrass (Ae. tauschii; DD genome) resulted in the formation of hexaploid bread wheat (AABBDD genome) around 10,000 years ago [5]. The domestication of tetraploid emmer and the limited number of hybridization events with Ae. tauschii represent bottlenecks that resulted in a significant reduction of genetic diversity within the bread wheat gene pool. Natural gene flow between bread wheat and its wild and domesticated relatives as well as artificial hybridizations with diverse grass species partially compensated for this loss in diversity [3,6].
The size, repeat content and polyploidy of the bread wheat genome have represented major challenges for the generation of a high-quality reference assembly. The first 'early' whole genome assemblies of hexaploid wheat and its diploid wild relatives were based on short-read sequencing approaches. These assemblies provided an insight into the gene space of wheat, but they were highly-fragmented and incomplete [7][8][9][10]. The first notable high-quality sequence assembly of wheat was produced from the 1-gigabase chromosome 3B of the hexaploid wheat landrace Chinese Spring. For this, 8,452 ordered bacterial artificial chromosomes (BACs) were sequenced and assembled, which resulted in a highly contiguous assembly (N50 = 892 kb) [11,12]. More recent whole-genome shotgun assemblies had improved contiguousness compared to the 'early' assemblies (N50 = 25 -232 kb) [13][14][15], but they still did not allow to compare the structure of wheat chromosomes on a megabasescale.
Several recent technological and computational improvements however provided a basis to generate de novo assemblies of complex plant genomes with massively improved scaffold lengths and completeness. These advancements included (i) the integration of whole-genome shotgun libraries of various insert-sizes [16] or the use of long-read sequencing technologies such as single-molecule real-time sequencing (SMRT) [17] or nanopore sequencing [18], (ii) the improvement of scaffolding by using chromosome conformation capture technologies [19][20][21][22][23] or optical maps [24] and (iii) the improvement of assembly algorithms [4]. Chinese Spring is an old landrace that was selected for sequencing because it was used in a number of cytogenetic studies, which has resulted in the generation of many important genetic resources from this wheat line, including chromosome deletion lines [26] and aneuploid lines [27].
The understanding of the genetic variation will provide an insight into wheat genome evolution and its impact on agronomically important traits. The continuum of genetic variation ranges from single nucleotide polymorphism (SNPs) to megabase-sized rearrangements that can affect the structure of entire chromosomes [28]. Due to the absence of high-quality wheat genome assemblies, previous comparative analyses were limited in the size of structural rearrangements that could be assessed and typically, structural variants of a few base pairs up to several kb were analysed [29,30]. Consequently, a comprehensive assessment of the extent of large structural rearrangements and their underlying molecular mechanisms is still lacking.
Here, we report on a chromosome-wide comparative analysis of the ∼700 Mb chromosome 2D between the two hexaploid wheat lines Chinese Spring and 'CH Campala Lr22a'. 'CH 5 Campala Lr22a' is a backcross line that was generated to introgress Lr22a, a gene that provides resistance against the fungal leaf rust disease, into the genetic background of the elite Swiss spring wheat cultivar 'CH Campala' [31]. We previously generated a high-quality de novo assembly from isolated chromosome 2D of 'CH Campala Lr22a' by using short-read sequencing in combination with Chicago long-range scaffolding [32]. The resulting assembly had a scaffold N50 of 9.76 Mb. In particular, the focus of our study was on the identification and quantification of large structural variations (SVs). The comparative analysis of the 2D chromosome showed a high degree of collinearity along most of the chromosome, but also revealed SVs such as InDels and copy number variation (CNV). In addition, we found haploblocks with greatly increased SNP densities. We analysed these SVs and gene presence/absence polymorphisms in detail and manually validated them to distinguish true SVs from artefacts that were due to mis-assembly or annotation problems.

Results
Two-way comparison of Chinese Spring and 'CH Campala Lr22a' allows identification of large structural variations Previously, 10,344 sequence scaffolds were produced from isolated chromosome 2D of 'CH Campala Lr22a' by using Chicago long-range linkage [21,32]. In the resulting 'CH Campala Lr22a' pseudomolecule, 7,617 scaffolds were anchored, of which 7,314 were smaller than 5 kb and 90 scaffolds were larger than 1 Mb in size. The pseudomolecule had a scaffold N50 of 8.78 Mb (N90 of 1.89 Mb) and represented 98.92% of the total length of the initial assembly.
To identify large InDels, we compared the Chinese Spring and 'CH Campala Lr22a' pseudomolecules in windows of 10 Mb and performed dot plots. Here, we focused only on InDels larger than 100 kb because such SVs could not be identified with previous wholegenome assemblies. In total, we found 26 putative InDels which were manually validated by evaluating the upstream and downstream sequences for the presence of 'Ns' at the breakpoints. If 'Ns' were found exactly at the breakpoints on both sides of an InDel, we considered it a false positive that was most likely due to the incorrect placement of a scaffolds in either of the pseudomolecules. Based on this criterion, we discarded 22 of the 26 candidate InDels. Three of the remaining four InDels showed good sequence quality and had clear breakpoints at both ends with no 'Ns'. These true InDels were 285 kb, 494 kb and 765 kb in size. An additional 677 kb InDel had a clear break only at one end and 'Ns' on the other end. Interestingly, three of the four large InDels showed CNV for nucleotide binding siteleucine-rich repeat (NLR) genes.
Various molecular mechanisms have been described that lead to SVs. For example, unequal crossing over can occur in regions with extensive sequence similarity. On the other hand, non-homologous end-joining (NHEJ) is associated with DNA repair in regions with no or low sequence similarity. Other causes of SVs include double-strand break (DSB) repair via single-strand annealing or synthesis-dependent strand annealing mechanisms, transposable element (TEs)-mediated mechanisms and replication-error mechanisms [33][34][35][36]. These mechanisms have been well studied in humans, but in plants our understanding of the molecular causes of SVs is limited [33]. To decipher the mechanistic bases of the observed SVs, the sequence of the SV as well as their flanking regions were analyzed to identify signature sequence motifs that could point to the underlying molecular mechanism (e.g. DNA repair, recombination or replication associated mechanisms).
Unequal crossing over is the likely cause of a 285 kb deletion in Chinese Spring The terminal 10 Mb of the short chromosome arm revealed an InDel of 285 kb (Fig. 1a). We extracted and checked the sequences 5 kb upstream and downstream of the breakpoints for the presence of TEs or genes (or any kind of repeated sequence) that could have served as a template for unequal crossing over. Unequal crossing over occurs frequently at repeated sequences that are in the same orientation, leading to duplications or deletions of the region between the two repeats [37]. Indeed, the breakpoints of the InDel contained two NLR genes that shared 96-98% nucleotide identity in 'CH Campala Lr22a'. In contrast, Chinese Spring only carried a single NLR copy (Fig. 1). Thus, it is possible that an unequal crossing over between the two genes occurred in an ancestor of Chinese Spring, leading to the loss of the 285 kb segment between the two NLRs.
In order to test this hypothesis, we further analysed the NLRs that were present at the breakpoint of 'CH Campala Lr22a' and Chinese Spring. Interestingly, the 5' region of the Chinese Spring gene showed greater sequence similarity to NLR1 of 'CH Campala Lr22a', whereas the 3' region was more similar to NLR2 (Fig. 1b). This suggests that these NLRs (NLR1 and NLR2) were indeed the template for an unequal crossing over in an ancestor of Chinese Spring (Fig. 1c). The corresponding 285 kb segment in 'CH Campala Lr22a' only contained repetitive sequences and did not carry any genes.

Double-strand break repair likely mediated a large 494 kb deletion
The second SV was located on a 'CH Campala Lr22a' scaffold of 6.6 Mb in size (Fig. 2a).
We could precisely identify the breakpoints based on the sequence alignment of the two wheat lines. Unlike the case described above, the upstream and downstream sequences contained no obvious sequence template or a typical TE insertion or excision pattern [34] that could have led to a large deletion by unequal crossing over. However, the breakpoints of the InDel contained typical signatures of DSB repair. In 'CH Campala Lr22a' the nucleotide triplet 'CGA' was repeated at both ends of the breakpoint whereas Chinese Spring had only one copy of the 'CGA' triplet (Fig. 2b). The proposed model for this 494 kb deletion is that it was caused through a DSB that was repaired by the single-strand annealing pathway ( Fig.   8 2c). After the DSB that could have occurred anywhere on the 494 kb segment in Chinese Spring, 3' overhangs were produced by exonucleases. Various studies in yeast have shown that these overhangs can be several kb in size [38][39][40] and due to high conservation of DSB repair pathways [41], it is expected that plants would have a similar DSB repair mechanism.
In the case described here, we propose that exonucleases produced overhangs of 200-250 kb, which were then repaired by non-conservative homologous recombination repair (HRR). For this, the generated 3' overhangs annealed in a place of complementary micro-homology, which are typically a few bp in size ('CGA' triplet in this case) [42]. After annealing of the matching motifs, second strand synthesis took place and the overhangs were removed, leading to the observed deletion of the 494 kb sequence in Chinese Spring (Fig. 2c). This 494 kb segment in 'CH Campala Lr22a' contained eight genes coding for an NLR, a serine/threonine protein kinase, a zinc finger-containing protein, a transferase, two cytochrome P450s and two proteins of unknown function. BLAST analysis of these eight genes against all the Chinese Spring chromosmes revealed that the homoeologous segments on the A and B genomes were retained. In other words, the deletion of these eight genes might not have led to a deleterious effect because the homoeologous gene copies on the other two sub-genomes compensate for the D-genome deletion. It has been reported that polyploid species show a higher plasticity compared to diploid species and that they are able to buffer large insertions and deletions on one particular sub-genome [43].

Large diverse haploblocks indicate recurrent gene flow from distant relatives
Comparison of SNP density across the chromosome revealed three large regions (haploblocks a, b and c) with increased SNP density compared to the rest of the chromosome (Fig 3a).
Two of the regions were located on the short arm of the chromosome whereas the largest diverse haploblock of ∼48 Mb was located towards the telomeric end of the long chromosome arm. While the SNP density along most of the chromosome was in the range ∼27 SNPs/Mb (Fig. 3a) the three diverse haploblocks had SNP densities of 2,500 -4,500 SNPs/Mb. The actual number of polymorphisms might be even higher because SNP calling might not have been possible in many parts of the haploblocks because of the high sequence divergence.
The first haploblock (haploblock a) at the distal end of the short chromosome arm contains the Lr22a leaf rust resistance gene that was introduced into hexaploid wheat through an artificial hybridization between a tetraploid wheat line and an Ae. tauschii accession [44].
There are two genetically distant lineages of Ae. tauschii. The D-genome of hexaploid wheat was most likely contributed by an Ae. tauschii population belonging to lineage 2 [45], whereas the donor of Lr22a (Ae. tauschii accession RL 5271) belongs to the genetically diverse lineage 1 [46]. The size of the Lr22a introgression was subsequently reduced through several rounds of backcrossing with hexaploid wheat and the remaining Lr22a-containing segment was bred into elite wheat lines including 'CH Campala Lr22a' to increase resistance against the fungal leaf rust disease [31]. Based on the SNP density, we were able to estimate the size of the remaining, introgressed Ae. tauschii segment to ∼8 Mb. The original donor of the other two haploblocks (haploblocks b and c) could not be traced back and they might be the result of natural gene flow or artificial hybridization. Mapping of independently generated short-read sequences from 'CH Campala', the recurrent parent that was used to produce the near isogenic line 'CH Campala Lr22a', showed that the same haploblocks were also present in 'CH Campala' (Fig. 3a), indicating that these segments were not co-introduced along with the Lr22a segment from RL 5271. In particular, the presence of the large continuous haploblock c on the long chromosome arm was intriguing. Dot plots allowed us to identify the exact breakpoints of the haploblock (Fig. 3b). While there was high sequence homology in both flanking regions, sequence identity in the intergenic regions broke down inside the 1 0 haploblock (Fig. 3b). In contrast, dot plots with haploblocks a and b revealed a good level of collinearity between Chinese Spring and 'CH Campala Lr22a' in intergenic regions despite the increased SNP density (Additional file 1: Figure S1), indicating that haploblock c is the most diverse. Comparison to the recently generated high-quality genome assembly of Ae.
tauschii accession AL8/78 [47], an accession that is closely related to the wheat D-genome and that belongs to lineage 2, suggests that haploblock c represents an interstitial introgression into 'CH Campala Lr22a' (Additional file 2: Figure S2). in the two wheat genotypes showed a high tendency of clustering and they were mostly located in the telomeric regions (Fig. 4a), as it is typically found for this gene class [25].
For 'CH Campala Lr22a', we found that 62 NLR genes resided in seven gene clusters which comprise 38.5% of the total annotated NLRs. The largest cluster contained 19 NLR genes. In Chinese Spring, we found that 71 NLR genes resided in ten clusters which comprise 44.9% of the total annotated NLRs and the largest cluster contained 21 NLRs. A phylogenetic tree revealed that most NLR genes from Chinese Spring had one ortholog in 'CH Campala Lr22a' (Fig. 4b).
On the other hand, we also observed copy number variation for certain regions.
Two regions, CNV1 and CNV2, were of particular interest because there was an extensive variation in the NLR copy number between Chinese Spring and 'CH Campala Lr22a' (Fig.   4b). In the CNV1 region, 'CH Campala Lr22a' had sixteen NLR genes annotated in a 786 kb region. The corresponding region in Chinese Spring contained only two NLRs in a 21 kb interval (Fig. 5a). There was a high degree of gene collinearity flanking the NLR cluster (Fig.   5a). The two NLR copies in Chinese Spring (NLR46 and NLR47) showed 44% sequence identity at the protein level, indicating that they might have arisen from a very ancient gene duplication. The low sequence identity of NLR46 and NLR47 allowed to assign each of the 'CH Campala Lr22a' NLRs to one of the two Chinese Spring copies. This revealed a random pattern, which might be explained by complex duplication and rearrangement events (Fig.   5a). The CNV1 region locates to the diverse haploblock c, which might explain the extent of the CNV found in this region.
The CNV2 region affected a segment of ten paralogous NLR genes situated in a 716 kb region in Chinese Spring. In 'CH Campala Lr22a' there was a 677 kb deletion that affected all but two of the NLRs. For this CNV region we could identify a clear breakpoint at one end whereas the other end had a sequence gap ( Fig. 5b and 5c).

Molecular mechanisms of structural variations
Different genotypes within a plant species can show tremendous genetic diversity. Beside SNPs, SVs have been identified as a major contributor to phenotypic variation in plants, which is why an understanding of large SVs is of importance for breeding [50].  [52]. While short-read sequencing allowed a comprehensive assessment of genome-wide SNPs distributions in cereals [53,54] the identification of SVs, particularly large InDels, has been challenging due to technical limitations. In wheat, the lack of highquality chromosome assemblies from multiple genotypes has prevented such comparisons so far. Even for other cereal crop species like rice, maize, barley and sorghum there are no or only very few high-quality de novo assemblies available beside the reference genotypes [22,[55][56][57]. Here, we compared two high-quality sequence assemblies of bread wheat chromosome 2D that were highly contiguous over megabases, which allowed us to focus on InDels of several hundred kb in size. In total, we found that around 0.3% of the chromosome was affected by the four large InDels. Based on these numbers, we estimate that a comparison of any two wheat genotypes would reveal around 30 large InDels affecting ~15 Mb accross the entire D sub-genome. Not surprisingly, the number of small InDels is much higher than larger structural rearrangements. For example, a comparison of the B73 maize reference assembly to optical maps generated from the two maize inbred lines Ki11 and W22 1 3

revealed around 3,400 insertions and deletions between two maize lines with an average
InDel size of 20 kb [17]. A re-sequencing study in rice revealed a total of 13,045 insertions and 15,151 deletions in the size range of 10-1,000 bp [58]. Large InDels affected multiple genes and can therefore have a deleterious effect, particularly in diploid species.
Unequal crossing over and DSB repair were identified as the molecular mechanisms responsible for large InDels in our study. Analyses in Brachypodium revealed that DSB repair is the most common mechanism for structural rearrangements [34,59]. where they flanked small InDels ranging from 5 bp to 175 bp [60]. Apart from DSB repair, another frequently observed mechanism for SV is unequal crossing over. We found a 285 kb deletion in Chinese Spring where the deletion was a result of an improper alignment of two highly similar NLR genes that served as a template for unequal crossing over. Unequal crossing over has been shown to be one of the main driving forces for genome evolution and has been reported to occur in various disease resistance gene families where they result in novel specificities and haplotypes [37]. For example, unequal crossing over between homologs in the maize rust resistance locus Rp1 led to the formation of recombinant genes with diverse resistance specificities [61,62]. In soybean, unequal crossing over at the RPS locus was associated with loss of resistance to Phytophthora due to the deletion of a NLR-like (NBSRps4/6) sequence [63].

Identification of diverse haploblocks -implications for wheat D-genome evolution
In addition to SVs, the chromosome-scale assemblies also allowed us to assess SNP density across the entire chromosome and to identify large contiguous blocks with strong variation from the average SNP density. This revealed the presence of three large haploblocks that showed a much higher SNP density compared to the rest of the chromosome. One of these haploblocks (haploblock a) could be traced back to an artificial introgression that carries the adult plant leaf rust resistance gene Lr22a [32,64]. Lr22a was introgressed into hexaploid wheat by artificially hybridizing the tetraploid wheat line tetra-Canthatch with the diploid Ae.
tauschii accession RL 5271 [44]. The crossing of tetraploid wheat with diverse Ae. tauschii accessions results in so called synthetic wheat. This is a widely explored strategy in breeding to compensate for the loss of diversity in hexaploid wheat that went along with domestication and modern breeding [65][66][67]. After this initial cross, the resulting synthetic hexaploid wheat the size of haploblocks is expected to be negatively correlated with recombination rates [69].
Since the haploblock c located to the highly-recombining telomeric end of the chromosome, we would expect that its size decreases over time. One explanation for conservation of this haploblock could be that its presence suppresses recombination in this area. In contrast to haploblocks a and b, we observed a breakdown of sequence homology in intergenic regions in haploblock c. On the other hand, the gene order was largely collinear in haploblock c, which should be sufficient for recombination in this chromosome segment. A second explanation is that this haploblock c might be widely present in the wheat gene pool or in particular breeding programs. For example, PCR analysis revealed that the haploblock c was present in multiple CIMMYT wheat lines. This would allow recombination in the haploblock without decreasing its size. In summary, a considerable fraction of the chromosome (10%) was made up of haploblocks with a much greater diversity than the rest of the chromosome.
This highlights the importance of natural gene flow and artificial hybridization as sources for diversity in cereal breeding.

Conclusions
This study provides the first comparison of two pseudomolecules based on high-quality de novo chromosome assemblies. The megabase-sized scaffolds allowed us to focus particularly on InDels of several hundred kb in size. Our analysis revealed that around 0.3% of the chromosome was affected by large InDels between the two wheat lines. Our study also revealed that careful manual validation is required in order not to overestimate the frequency of InDels. In particular, 84% of the InDels that were initially identified were removed after manual curation because they were most likely due to assembly and annotation artefacts. It is conceivable that previous comparative analyses in wheat that were based on short-read resequencing alone could not account for these problems. We therefore highlight the importance of manual data validation in future wheat pan-genome projects.

'CH Campala Lr22a' pseudomolecule assembly
The initial sequence assembly provided by Dovetail Genomics consisted of 10,344 sequence scaffolds (hereafter referred to as Dovetail scaffolds) with and average size of 54.8 kb and an N50 of 9.758 Mb [32]. To anchor these scaffolds, segments of the scaffolds were used in were located in Chinese Spring 2D. Based on this information, Dovetail scaffolds were ordered.
After sequence scaffolds were assembled into a first version of a pseudomolecule, we searched for large-scale breaks in gene collinearity when compared to Chinese Spring chromosome 2D. Here, we focused on blocks of BLASTN hits that mapped to completely different regions of the genome. If the end of a non-collinear block coincided with the end of a Dovetail scaffold, this was interpreted as an assembly artefact. The approximate location of the mis-assembly was identified and the respective Dovetail scaffold was then split into segments. We identified ten putatively chimeric Dovetail scaffolds with assembly errors.
These were split into 24 segments (some Dovetail scaffolds contained multiple misassemblies) which were then anchored individually to Chinese Spring chromosome 2D.  [76]. For each of the InDels observed, we analysed the sequence alignments to identify the region where the sequence similarity broke down and this region was called breakpoint. We spliced out 5 kb sequence upstream and downstream of these breakpoints and performed BLASTN search [77] against the repeat database to identify transposable elements and also against the Brachypodium distachyon coding sequence database [78] to identify genes in the flanking regions to understand the molecular mechanism underlying the observed SVs.
To identify NLR CNV, we compared the NLR clusters in Chinese Spring and 'CH Campala Lr22a' and identified the breakpoints as described above. The sequences upstream and downstream of breakpoints were used to identify the collinear genes using BLAST search against the annotated 'CH Campala Lr22a' and Chinese Spring genes. Putative start and stop codons of the annotated NLRs were identified based on the orthologs of these NLRs in Brachypodium distachyon. The coding sequence of these Brachypodium distachyon NLRs was taken from the Brachypodium distachyon coding sequence database [78] and was used for the dot plot alignment to identify the coding sequence of the Chinese Spring and 'CH Campala Lr22a' NLRs. Pseudogenes were predicted on the basis of frameshift mutations, premature stop codon or insertion of a transposable element resulting in a pseudogene.

9
Haploblock analysis and validation For the identification of the haploblock region, we mapped previously generated Illumina reads of 'CH Campala Lr22a' and 'CH Campala' [32]   An unequal crossing over event involving two NLR genes (shown in blue and orange) led to the formation of the recombinant NLR in Chinese Spring which shares sequence homology with NLR1 (blue) and NLR2 (yellow) and a deletion of the intervening 285 kb sequence.