Multiple reference genome sequences of hot pepper reveal the massive evolution of plant disease resistance genes by retroduplication

Transposable elements (TEs) provide major evolutionary forces leading to new genome structure and species diversification. However, the role of TEs in the expansion of disease resistance gene families has been unexplored in plants. Here, we report high-quality de novo genomes for two peppers (Capsicum baccatum and C. chinense) and an improved reference genome (C. annuum). Dynamic genome rearrangements involving translocations among chromosome 3, 5 and 9 were detected in comparison between C. baccatum and the two other peppers. The amplification of athila LTR-retrotransposons, members of the gypsy superfamily, led to genome expansion in C. baccatum. In-depth genome-wide comparison of genes and repeats unveiled that the copy numbers of NLRs were greatly increased by LTR-retrotransposon-mediated retroduplication. Moreover, retroduplicated NLRs exhibited great abundance across the angiosperms, with most cases lineage-specific and thus recent events. Our study revealed that retroduplication has played key roles in the emergence of new disease-resistance genes in plants.

Tables 11-13). Hence, a large proportion (~18%) of the NLRs were amplified by LTR-Rs, with the 6 structures indicating that their retroduplicated origin is still intact. Most of these NLRs (~71%) were 7 in the CNL-G2 category, indicating that the copy number of specific NLR subfamilies was 8 particularly expanded (Fig. 3a). Furthermore, most of the retroduplicated NLRs (~71% of the total 9 and ~66% of the CNL-G2 type) were inside non-autonomous LTR-Rs that contained no gag or pol 10 protein coding potential (Supplementary Table 14). This suggests that all steps for the retroduplication, 11 presumably including the initial sequence capture process, had to be provided in trans. When we 12 compared the retroduplicated and other NLRs in the CNL-G2 category, the number and length of 13 exons were significantly fewer and longer in the retroduplicated NLRs, but not all of these were 14 single-exon genes (Fig. 3b-c). In total, ~44% of the retroduplicated NLRs in each species had multiple 15 exons but all of those had a reduced number of introns compared with their predicted parental 16 sequences (Supplementary Tables 13 and 15). 17 Earlier analyses of the tomato, potato and rice genomes indicated that retroduplication is a general 18 feature of genome evolution in the plant kingdom (Supplementary Table 11). We found that 22, 103 19 and 30 (8, 23, and 6%) of NLRs were inside LTR-Rs in tomato, potato and rice, respectively. Of these, 20 we identified parental sequences with multiple exons for 18, 88 and 22 of the NLRs inside LTR-Rs in 21 tomato, potato and rice, respectively, thus confirming their origin by retroduplication (Supplementary Table 16). Similar to the peppers, NLRs in particular subgroups were primarily retroduplicated in 23 potato but the duplicated subgroups were distinct in each species (Supplementary Table 11). These 24 results indicate that LTR-Rs played a key role in the expansion of NLRs by retroduplication 25 throughout the plant kingdom, and that the detected events are both recent and lineage-specific.
(Supplementary Table 17). In total, a range from 1,398 genes in rice to 3,898 genes in potato genomes 1 were found to be inside LTR-Rs, suggesting that 4 to 10% of all the genes in these plant species 2 emerged by LTR-R-driven retroduplication. On average, ~45% of them had functional domains 3 including highly amplified families such as MADS-box TFs, cytochrome P450s, and protein kinases, 4 and ~42% of those genes were expressed in one or more investigated tissues by RNA-Seq analysis 5 (Supplementary Table 17). 6 7 Evolutionary mechanisms for the emergence of disease resistance genes in Solanaceae 8 The L genes encoded by the NLRs are known to render resistance in peppers against 9 Tobamoviruses and they belong to the CNL-G4 category, along with I2 in tomato that provides 10 resistance to race 2 of Fusarium oxysporum f. sp. lycopersici and R3a in potato that provides 11 resistance to the late blight pathogen, Phytophthora infestans 32-34 . Each of those is a single exon gene 12 encoding a peptide of ~1,300 amino acids. Synteny analysis and sequence comparison among pepper, 13 potato and tomato genomes suggested L, I2, and R3a are orthologous genes and the genomic regions 14 containing L, I2 and R3a were tightly linked (Extended Data Fig. 7a and Supplementary Table 18). 15 These results suggest the possibility that the genes originated by an early retroduplication, and then 16 underwent divergent evolution in each lineage. 17 We examined the evolutionary history of L genes with their parents in the pepper genomes (Fig 4,  18 Extended Data Fig. 7b and alignment coverage to L genes. All candidate parental sequences contained multiple exons. When 21 candidate parental sequence P1 was compared with L in Annuum, the results suggested that L was 22 derived from retroduplication in the ancestral lineage of Capsicum spp. ~8.9 MYA (Fig. 4). Because L 23 has 6.7 kb coding exons, with only an intron in the 3' UTR, and the presence of both flanking direct 24 repeat sequences and a poly(A) "tail" encoded in the DNA, our analysis suggests that L emerged 25 through capture and reverse transcription by a long interspersed nuclear element (LINE)-driven genomes and L4 in C. chacoense revealed that the L genes were diversified by accumulation of 1 lineage-specific sequence mutations after speciation within Capsicum (Fig. 4 and Supplementary  2   Table 21). Consequently, our results suggest that the ancestor of the L genes was derived from 3 retroduplication and that subsequent divergent evolution has led to race-specific resistance against 4 diverse strains of Tobamovirus in each species of Capsicum after speciation (Fig. 4). 5 To analyse the evolutionary processes acting on R3a of potato, we first performed a genome-wide 6 search for the R3a as well as for candidate parental sequences. Because R3a is absent in the current 7 potato reference genome 35 , we could not carry out accurate comparisons of R3a and their homologues. 8 However, R3a and its clustered genes originated from wild species, Solanum demissum 36 were 9 available in a public database. So, we compared these sequences with their closest homologs in the 10 reference potato genome 33 . Our analyses revealed that intronless sequences of the ancestral potato R3a  pepper resistance against C. capsici was located on chromosome 9 39 , however, we found that QTL is located on chromosome 3 due to translocation in Baccatum and Annuum (Fig. 1c). We obtained 35 1 Baccatum-specific NLRs (27 in CNL-G2, 5 in CNL-G10 and 3 in CNL-G10) from the 64 NLRs by 2 sequence comparison among the three pepper genomes (Fig. 5). Considering the gene duplication 3 history, 15 of the 35 genes appear to have emerged after generation of the Baccatum lineage and all of 4 them belong to the CNL-G2 category. Transcriptome evidence indicated that 10 of those 15 genes are 5 expressed in one or more tissues ( Fig. 5 and Supplementary Table 23). Furthermore, half of the 15 6 genes appear to have emerged by retroduplication (Fig. 5). Consequently, our results suggest that 7 retroduplication along with tandem and segmental duplications, has played a major role in the 8 emergence of anthracnose resistance genes in the Baccatum lineage. 9 10

Conclusions 11
In this study, we generated new and improved genome resources for three Capsicum species. Our 12 data provide an accurate and updated gene model of the pre-existing reference pepper genome based 13 on annotation with accumulated knowledge, highlighting the importance of genome improvement 14 after the completion of sequencing project. High-quality pseudo-chromosomes constructed from three 15 pepper genomes enabled precise comparisons of genome structures and evolutionary analyses, 16 providing new insights into interspecies diversification via genome rearrangements and lineage-17 specific evolution of LTR-Rs and genes in the genus Capsicum. Furthermore, we found evidences of 18 the massive evolution of NLRs by LTR-mediated retroduplication in dicot Solanaceae and monocot 19 rice, suggesting that such phenomena are ubiquitous in angiosperms. Our results suggested that at 20 least 6 to 23% of plant NLRs were emerged by LTR-R-driven duplication (Supplementary Table 11). 21 A notable feature of this retroduplication is that distinct subfamilies of NLRs were highly 22 retroduplicated in different plant lineages. We unveiled the emergence of functional disease resistance 23 genes in the Solanaceae family by retroduplication and the subsequent neofunctionalisation of those 24 genes by dynamic evolutionary processes including lineage-specific sequence mutation and tandem 25 duplication. Our data suggest that a large proportion of all genes (~4 to 10%) in plant species might 26 have emerged by LTR-R-driven retroduplication. We observed the lineage-specific amplification of specific gene families by LTR-Rs in various plant species, including such genes as those encoding 1 cytochrome P450s in potato and MADS-box TFs in Baccatum (Supplementary Table 17). Taken 2 together, our results provide new insights into the evolution of functional plant disease-resistance 3 genes that belong to the NLR family as well as other high copy number gene families that are present 4 in the plants.  Table 1). To remove unnecessary sequences for genome assembly, 4 preprocessing analysis was performed as previously described 25 (Supplementary Table 2). The de novo genome 5 assembly of each species was performed with SOAPdenovo2 40 using the filtered raw sequences with parameters 6 K=77 and K=81 for Chinense and Baccatum, respectively (Supplementary Table 3 To estimate the gene duplication times of the annotated genes in the pepper genomes, we constructed a 1 computational pipeline by modifying a previously described method 53 . We first performed gene clustering 2 analysis using OrthoMCL 54 to classify the gene family. We assumed that the genes in the same clusters were in 3 the same family and performed all-by-all alignments of the coding sequences within the clusters in each species 4 using PRANK 55 . For each alignment result, the Ks values were calculated using KaKs Calculator, 56 and single-5 linkage clustering for the Ks values was performed using the hclust function in the R package. The molecular 6 clock rate (r) was calculated to be 6.96 × 10 −9 substitutions per synonymous site per year 57 . The duplication time 7 was estimated using the formula, Ks value / 2r.

36
To predict whole genes inside LTR-Rs in the six plant genomes, we performed genome-wide identification of 37 possible structure of LTR-Rs using LTRHarvest, taking into account rapid sequence change between the LTRs (-38 -similar 75%, minltrlen 100). Like annotation of the NLRs inside LTR-Rs, we extracted genes within directly repeated LTR regions as putatively retroduplicated genes. For the genes inside LTR-Rs, the number of expressed 1 genes in one or more tissues was counted using RNA-Seq data, as described in Supplementary Table 12 and in   2   previous analyses 25 Tables 19-20). All of the closest homologs in each species were found to contain multiple exons and a gene that 9 we named P1 was identified as the most likely parental sequence. By comparison of the sequence divergence 10 between P1 and its closest homologs in the other genomes, we confirmed that P1 was Annuum-specific 11 (Extended Data Fig. 7c). The 5' and 3' UTRs of L1a annotated based on RNA-Seq evidence were also compared 12 to the UTRs of P1 (Fig. 4).

13
For R3a in the potato, we aligned the coding sequences of R3a and genes within its cluster downloaded from