Transposable elements (TEs) are major components of large plant genomes and main drivers of genome evolution. The most recent assembly of hexaploid bread wheat recovered the highly repetitive TE space in an almost complete chromosomal context and enabled a detailed view into the dynamics of TEs in the A, B, and D subgenomes.
The overall TE content is very similar between the A, B, and D subgenomes, although we find no evidence for bursts of TE amplification after the polyploidization events. Despite the near-complete turnover of TEs since the subgenome lineages diverged from a common ancestor, 76% of TE families are still present in similar proportions in each subgenome. Moreover, spacing between syntenic genes is also conserved, even though syntenic TEs have been replaced by new insertions over time, suggesting that distances between genes, but not sequences, are under evolutionary constraints. The TE composition of the immediate gene vicinity differs from the core intergenic regions. We find the same TE families to be enriched or depleted near genes in all three subgenomes. Evaluations at the subfamily level of timed long terminal repeat-retrotransposon insertions highlight the independent evolution of the diploid A, B, and D lineages before polyploidization and cases of concerted proliferation in the AB tetraploid.
Even though the intergenic space is changed by the TE turnover, an unexpected preservation is observed between the A, B, and D subgenomes for features like TE family proportions, gene spacing, and TE enrichment near genes.
Transposable elements (TEs) are ubiquitous components of genomes and one of the major forces driving genome evolution . They are classified into two classes: retrotransposons (class 1), transposing via reverse transcription of their messenger RNA (mRNA), and DNA transposons (class 2), representing all other types of elements . TEs are small genetic units with the ability to make copies of themselves or move around in the genome. They do not encode a function that would allow them to be maintained by selection across generations; rather, their strategy relies on their autonomous or non-autonomous amplification. TEs are subject to rapid turnover, are the main contributors of intraspecific genomic diversity, and are the main factor explaining genome size variations. Thus, TEs represent the dynamic reservoir of the genomes. They are epigenetically silenced , preventing them from long-term massive amplification that could be detrimental. The dynamics of TEs in genomes remains unclear, and it was supposed that they may escape silencing and experience bursts of amplification followed by rapid silencing. Their impact on gene expression has also been documented in many species (for a review, see ). In addition, they play a role at the structural level, as essential components of centromeric chromatin in plants [3, 5]. Plant genomes are generally dominated by a small number of highly repeated families, especially class I Gypsy and Copia long terminal repeat retrotransposons (LTR-RTs) [6,7,8,9,10]. Most of our knowledge about TE dynamics and their impact on gene expression in complex plant genomes comes from maize [10,11,12,13,14]. At the whole genome level, Makarevitch et al. have shown that four to nine maize TE families, including all major class I superfamilies (Gypsy, Copia, long interspersed nuclear elements (LINEs)), and DNA transposons, are enriched (more than twofold) in promoters of genes being up-regulated in response to different abiotic stresses . This study also suggested that TEs are a major source of allelic variations explaining differential response to stress between accessions.
The genome of bread wheat (Triticum aestivum L.), one of the most important crop species, has also undergone massive TE amplification with more than 85% of it being derived from such repeat elements. It is an allohexaploid comprising three subgenomes (termed A, B, and D) that have diverged from a common ancestor around 2–3 million years ago (Mya) (according to molecular dating of chloroplast DNA ) and hybridized within the last half million years. This led to the formation of a complex, redundant, and allohexaploid genome. These characteristics make the wheat genome by far the largest and most complex genome that has been sequenced and assembled into near-complete chromosomes so far. They, however, also make wheat a unique system in which to study the impact of TE activity on genome structure, function, and organization.
Previously only one reference sequence quality wheat chromosome was available, which we annotated using our automated TE annotation pipeline (CLARITE) [17, 18]. However, it was unknown whether the TE content of chromosome 3B was typical of all wheat chromosomes and how TE content varied between the A, B, and D subgenomes. Therefore, in this study, we address the contribution of TEs to wheat genome evolution on a chromosome-wide scale. We report on the comparison of the three A-B-D subgenomes in terms of TE content and proliferation dynamics. We show that, although rounds of TE insertions/deletions have completely modified the TE space since A-B-D diverged, the proportion of each TE family remained stable between subgenomes. In addition, the specific TE landscape in the direct vicinity of genes is very similar between the three subgenomes. Our results strongly suggest that TEs play a role at the structural level likely under selection pressure. We also identified TE families that are over-represented in promoters compared to the rest of the genome but did not reveal a strong association between particular TE families and nearby gene expression pattern or a strong stress-response association.
Results and discussion
TE content and distribution along the 21 bread wheat chromosomes
Building from a decade-long effort from the wheat genomics community, we used the accumulated knowledge about TEs to precisely delineate the TE repertoire of the 21 chromosomes based on a similarity search with a high-quality TE databank: ClariTeRep  which includes TREP . This represents 3050 manually annotated and curated TEs carried by the three subgenomes and mainly identified on bacterial artificial chromosome (BAC) sequences obtained during map-based cloning or survey sequencing projects, especially on chromosome 3B . CLARITE was used to model TEs in the sequence and their nested insertions when possible . This led to the identification of 3,968,974 TE copies, belonging to 505 families, and representing 85% of RefSeq_v1.0. Overall, the TE proportion is very similar in the A, B, and D subgenomes, as they represented 86%, 85%, and 83% of the sequence, respectively. However, the sizes of the subgenomes differ: with 5.18 Gb, the B subgenome has the largest assembly size, followed by the A subgenome (4.93 Gb) and the smaller D subgenome (3.95 Gb). The repetitive fraction is mostly dominated by TEs of the class I Gypsy and Copia and class II CACTA superfamilies; other superfamilies contribute very little to overall genome size (Table 1, Fig. 1a).
At the superfamily level, the A, B, and D subgenomes have similar TE compositions (Fig. 1a). The smaller size of the D subgenome (~ 1 Gb smaller than A and B) is mainly due to a smaller amount of Gypsy (~ 800 Mb less; Fig. 1a). The A and B subgenomes differ in size by only 245 Mb (~ 5%), and nearly half of this (106 Mb) is not due to known TEs but rather to low copy sequences. Since the amount of coding DNA is very conserved (43, 46, and 44 Mb, respectively), this difference is mainly due to parts of the genome that remained un-annotated so far. This un-annotated portion of the genome may contain degenerated and unknown weakly repeated elements.
Similar to other complex genomes, only six highly abundant TE families represent more than half of the TE content: RLC_famc1 (Angela), DTC_famc2 (Jorge), RLG_famc2 (Sabrina), RLG_famc1 (Fatima), RLG_famc7 (Sumana/Sumaya), and RLG_famc5 (WHAM), while 486 families out of 505 (96%) each account for less than 1% of the TE fraction. In terms of copy number, 50% (253) of the families are repeated in fewer than 1000 copies at the whole genome level, while more than 100,000 copies were detected for each of the seven most repeated families (up to 420,639 Jorge copies).
Local variations of the TE density were observed following a pattern common to all chromosomes: the TE proportion is lower (on average 73%) in the distal regions than in the proximal and interstitial regions (on average 89%). However, much stronger local variations were observed when distributions of individual TE families were studied. Figure 1b shows TE distributions using chromosome 1A as a representative example. Distributions for selected TE families on all chromosomes are shown in Additional file 1: Figures S1–S11. The most abundant TE family, RLC_famc1 (Angela) was enriched towards telomeres and depleted in proximal regions. In contrast, highly abundant Gypsy retrotransposons RLG_famc2 (Sabrina, Fig. 1b) and RLG_famc5 (WHAM, not shown) were enriched in central parts of chromosome arms and less abundant in distal regions. CACTA TEs also showed a variety of distribution patterns. They can be grouped into distinct clades depending on their distribution pattern, as suggested earlier based on chromosome 3B TE analyses . Families of the Caspar clade  are highly enriched in telomeric regions, as is shown for the example of the DTC_famc1 (Caspar) whereas DTC_famc2 (Jorge) showed the opposite pattern (Fig. 1b).
Centromeres have a specific TE content. Previous studies on barley and wheat reported that the Gypsy family RLG_famc8.3 (Cereba) is enriched in centromeres [22, 23]. It was speculated that Cereba integrase can target centromere-specific heterochromatin due to the presence of a chromodomain that binds specifically to centromeric histones . We found that wheat Cereba elements are concentrated in centromeric regions but absent from the rest of the genome (Fig. 1b, Additional file 1: Figure S8), as are their closely related subfamilies RLG_famc8.1 and RLG_famc8.2 (Quinta). We identified new TE families that are also highly enriched in centromeres. The family RLG_famc39 (Abia) is a relative of Cereba, although there is very little sequence DNA conservation between the two. However, at the protein level, Cereba is its closest homolog. Abia and Cereba have an extremely similar distribution (Fig. 1b, Additional file 1: Figures S8 and S9). Interestingly, on chromosome 6A Cereba is more abundant, while on 3B, Abia is more abundant, suggesting that the two TE families are competing for the centromeric niche. Abia seems to be a wheat-specific TE family, as it was not present in the recently published barley genome . A recent study on the barley genome reported on a novel centromeric Gypsy family called Abiba . We identified a homolog in wheat: RLG_famc40 (Abiba), with two distinct subfamilies RLG_famc40.1 and RLG_famc40.2, corresponding to the putatively autonomous and non-autonomous variants. Abiba is enriched in central parts of chromosomes but with a broader spreading compared to Abia and Cereba (Additional file 1: Figures S10 and S11). At a higher resolution, we identified large tandem arrays of Cereba and Abia elements that correspond to the high k-mer frequencies observed at the centromeres (Fig. 2d), which might be the signature of functional centromeres (Additional file 1: Figure S12).
Similarity and variability of the TE content between the A, B, and D subgenomes
A genome-wide comparative analysis of the 107,891 high-confidence genes predicted along the A, B, and D subgenomes (35,345, 35,643, and 34,212, respectively) was described in detail in . It revealed that 74% of the genes are homeologs, with the vast majority being syntenic. Thus, gene-based comparisons of A-B-D highlighted a strong conservation and collinearity of the genes between the three genomes. However, outside the genes and their immediate surrounding regions, we found almost no sequence conservation in the TE portions of the intergenic regions (Fig. 2a). This is due to the “TE turnover” , which means that intergenic sequences (i.e., sequences that are not under selection pressure) evolve through rounds of TE insertions and deletions in a continuing process: DNA is produced by TE insertions into intergenic regions and removed by unequal crossovers or deletions that occur during double-strand repair . Previous studies showed that this process occurs at a pace implying that intergenic sequences are completely turned over within a few million years [27, 28]. Consequently, we found practically no conserved TEs (i.e., TEs that were inserted in the common ancestor of the A, B, and D genome donors). Thus, although the repetitive fraction in A, B, and D genomes is mostly composed of the same TE families (see below), their individual insertion sites and nesting patterns are completely different.
Analysis of the k-mer content of RefSeq_v1.0 showed that 20-mers occurring 100× or more cover around 40% of the wheat genome sequence (Fig. 2c). For 60-mers, this value decreases to only 10%. This pattern was strongly similar between subgenomes, although a slight difference was observed: repeated k-mers covered a larger proportion of the subgenome D > A > B. This lower proportion of repeats in the B subgenome is also obvious using a heat-map of 20-mer frequencies (Fig. 2d), showing that the B genome contains a smaller proportion of high copy number perfect repeats.
We then compared the A, B, and D subgenomes at the TE family level. We did not find any TE families (accounting > 10 kb) that are specific for a single subgenome or completely absent in one subgenome (only two cases of subgenome-specific tandem repeats were found: XXX_famc46/c47). More surprisingly, the abundance of most TE families is similar in the A, B, and D subgenomes. Indeed, among the 165 families which represent at least 1 Mb of DNA each, 125 (76%) are present in similar proportions in the three subgenomes; i.e., we found less than a twofold change of the proportion between subgenomes. Figure 2b represents the proportions of the 20 most abundant families in the three subgenomes which account for 84% of the whole TE fraction. Their proportion is close to the relative sizes of the three subgenomes: 35%, 37%, 28% for A, B, D, respectively. This highlighted the fact that not only are the three subgenomes shaped by the same TE families, but also that these families are present in proportions that are conserved. Consistent with this, we identified only 11 TE families (7%) that show a strong difference (i.e., more than a threefold change in abundance) between two subgenomes, representing only 2% of the overall TE fraction.
Thus, despite the near-complete TE turnover that has occurred independently in the A-B-D diploid lineages (Fig. 2a), and although TEs have transposed and proliferated very little since polyploidization (0.5 Mya, see below), the TE families that currently shape the three subgenomes are the same, and more strikingly, their abundance remained very similar. We conclude that almost all families ancestrally present in the A-B-D common ancestor have been active at some point and their amplification has compensated their loss by deletion, thus suggesting a dynamic in which families are maintained at equilibrium in the genome for millions of years. This evolutionary scenario differs from the model where TEs evolve by massive bursts of a few families leading to rapid diversification . For example, Piegu et al. showed that an amplification burst of a single retrotransposon family led to a near doubling of the genome size in Oryza australiensis . In wheat, by contrast, many TE families contribute to the genome diversification, as suggested for plants with very large genomes (> 30 Gb) .
Strong differences in abundance between the A, B, and D genomes were observed at the subfamily level (Fig. 3). For example, the highly abundant RLC_famc1 (Fatima) family has diverged into at least five subfamilies (1.1 to 1.5). Only RLC_famc1.1 contains potentially functional reverse transcriptase (RT) and integrase (INT) genes, while RLC_famc1.4 and RLC_famc1.5 contain gag and protease open reading frames (ORFs). RLC_famc1.2 and RLC_famc1.3 appear to be non-autonomous, as they do not contain any intact ORFs. We suggest that RLC_famc1.1 provides functional RT and INT proteins, while protease and GAG are provided by other subfamilies. Their contrasted abundance revealed that RLC_famc1.4 and RLC_famc1.5 proliferated specifically in the B and A lineages, respectively (Fig. 3a).
In total, we identified 18 different subfamilies (belonging to 11 different families) which show subgenome-specific over- or under-representation (Table 2). Here, we only considered TE families that contribute more than 0.1% to the total genome and are at least threefold over- or under-represented in one of the subgenomes. This illustrated that these 11 highly abundant families did not show a bias between A-B-D at the family level, but are composed of several subfamilies that were differentially amplified in the three diploid lineages. The CACTA family DTC_famc10.3 (Pavel) is much more abundant in the D subgenome than in the A and B subgenomes (Additional file 1: Figure S1). Interestingly, the Pavel subfamily also seems to have evolved a preference for inserting close to centromeres in the D subgenome, while this tendency is not obvious in the A and B subgenomes (Fig. 3b). Generally, subfamilies were enriched in a single genome (Table 2). In only four cases, a subfamily was depleted in one subgenome while abundant at similar levels in the other two. Three of these cases were found in the D subgenome. This is consistent with the smaller D subgenome size, and differences in highly abundant elements contribute to this difference.
Dynamics of LTR retrotransposons from the diploid ancestors to the hexaploid
The largest portion of plant genomes with size over 1 Gb consists of LTR-RTs. Intact full-length elements represent recently inserted copies, whereas old elements have experienced truncations, nested insertions, and mutations that finally lead to degenerated sequences until they become unrecognizable. Full-length LTR-RTs (flLTR-RTs) are bordered by two LTRs that are identical at the time of insertion and subsequently diverge by random mutations, a characteristic that is used to determine the age of transposition events . In previous genome assemblies, terminal repeats tended to collapse, which resulted in very low numbers of correctly reconstructed flLTR-RTs (triangles in Additional file 1: Figure S13). We found 112,744 flLTR-RTs in RefSeq_v1.0 (Additional file 1: Table S1, Figure S13), which was in line with the expectations and confirmed the linear relationship between flLTR-RTs and genome size within the Poaceae. This is two times higher than the number of flLTR-RTs assembled in TGAC_v1 , while almost no flLTR-RTs were assembled in the 2014 gene-centric draft assembly .
We exploited this unique dataset to gain insights into the evolutionary history of hexaploid wheat from a transposon perspective. flLTR-RTs are evenly distributed among the subgenomes, with on average 8 elements per Mb (Additional file 1: Table S1). Among them, there were two times more Copia (RLC) than Gypsy (RLG) elements, although Gypsy elements account for 2.8× more DNA. This means that the proportion of young intact elements is higher for the Copia superfamily than for the Gypsy superfamily. Indeed, the median insertion ages for Copia, Gypsy, and RLX (unclassified LTR-RTs) are 0.95, 1.30, and 1.66 million years (Myr). RLXs lack a protein domain, preventing a straightforward classification into Gypsy or Copia. The missing domains can most likely be accounted for by their older age and, thus, their higher degree of degeneration. RLX elements are probably unable to transpose on their own, but the occurrence of such very recently transposed elements suggests that they are non-autonomous, as described for the Fatima subfamilies (Fig. 3a). Between the A and B subgenomes, all flLTR-RT metrics are very similar, whereas the D subgenome stands out with younger insertions. In any case, age distributions of flLTR-RTs show that most of the identified full-length elements inserted after the divergence of the three subgenomes, thereby reflecting the genomic turnover that has removed practically all TEs that were present in the A-B-D ancestor (see above).
We analyzed the chromosomal distributions of the flLTR-RTs (Additional file 1: Figure S14). The whole set of elements is relatively evenly scattered along the chromosomes with high density spots in the distal gene-rich compartments. The most recent transpositions (i.e., copies with two identical LTRs) involved 457 elements: 257 Copia, 144 Gypsy, and 56 RLXs. They are homogeneously distributed along the chromosomes (Additional file 1: Figure S14B), confirming previous hypotheses stating that TEs insert at the same rate all along the chromosome but are deleted faster in the terminal regions, leading to gene-rich and TE-depleted chromosome extremities .
The current flLTR-RT content is the outcome of two opposing forces: insertion and removal. Therefore, we calculated a persistence rate, giving the number of elements per 10,000 years that have remained intact over time, for the 112,744 flLTR-RTs (Fig. 4a). It revealed broad peaks for each superfamily, with maxima ranging from 0.6 Mya (for Copia in the D subgenome) to 1.5 Mya (for RLX in the A and B subgenomes). The D subgenome contained on average younger flLTR-RTs compared to A and B, with a shift of activity by 0.5 Myr. Such peaks of age distributions are commonly interpreted in the literature as transposon amplification bursts. We find the “burst” analogy misleading, because the actual values are very low. For wheat, it represents a maximal rate of only 600 copies per 10,000 years. A more suiting analogy would be the formation of mountain ranges, where small net increases over very long time periods add up to very large systems. In the most recent time (< 10,000 years), after the hexaploidization event, we did not see any evidence in our data for the popular “genomic shock” hypothesis, postulating immediate drastic increases of transposon insertions [34,35,36]. For the A and B subgenomes, a shoulder in the persistence curves around 0.5 Mya (Fig. 4a), the time point of tetraploidization, was observed. We suggest that counter-selection of harmful TE insertions was relaxed in the tetraploid genome; i.e., the polyploid could tolerate insertions which otherwise would have been removed by selection in a diploid.
To elucidate the TE amplification patterns that have occurred before and after polyploidization, we clustered the 112,744 flLTR-RTs based on their sequence identity. The family level was previously defined at 80% identity over 80% sequence coverage (80/80 clusters) . We also clustered the flLTR-RTs using a more stringent cutoff of 90/90 and 95/95 to enable classification at the subfamily level (Fig. 4b). The 80/80 clusters were large and contained members of all three subgenomes. In contrast, the 90/90 and 95/95 clusters were smaller, and a higher proportion of them are specific to one subgenome. To trace the polyploidization events, we defined lifespans for each individual LTR-RT subfamily as the interval between the oldest and youngest insertion (Fig. 4c). Subfamilies specific to either the A or B subgenome amplified until about 0.4 Myr, which is consistent with the estimated time of the tetraploidization. Some of the D subgenome-specific subfamilies inserted more recently, again consistent with the very recent hexaploidization.
These results confirmed that the three subgenomes were shaped by common families present in the A-B-D common ancestor that have amplified independently in the diploid lineages. They evolved to give birth to different subfamilies that, generally, did not massively amplify after polyploidization and, thus, are specific to one subgenome. To confirm this hypothesis, we explored the phylogenetic trees of the three largest 90/90 clusters color-coded by subgenome (Fig. 5 and Additional file 1: Figures S15–S17 for more details). The trees show older subgenome-specific TE lineages which have proliferated in the diploid ancestors (2–0.5 Mya). However, the youngest elements (< 0.5 Mya) were found in clades interweaving elements of the A and B subgenomes, corresponding to amplifications in the tetraploid. Such cases involving the D subgenome were not observed, showing that flLTR-RTs from D have not yet transposed in large amounts across the subgenomes since the birth of hexaploid wheat 8000–10,000 years ago. We further noticed several incidences in the trees where D lineages were derived from older B or A lineages, but not the reverse. This may be explained by the origin of the D subgenome through homoploid hybridization between A and B .
There are two proposed models of propagation of TEs: the “master copy” model and the “transposon” model . The “master copy” model gives rise to highly unbalanced trees (i.e., with long successive row patterns) where one active copy is serially replaced by another, whereas the “transposon” model produces balanced trees where all branches duplicate with the same rate . To better discern the tree topologies, we plotted trees with equal branch length and revealed that the three largest trees (comprising 15% of flLTR-RTs) are highly unbalanced (Additional file 1: Figure S18), while the smaller trees are either balanced or unbalanced (Additional file 1: Figure S19). Taken together, both types of tree topologies exist in the proliferation of flLTR-RTs, but there is a bias towards unbalanced trees for younger elements, suggesting that TE proliferation followed the “master copy” model.
In summary, our findings give a timed TE atlas depicting detailed TE proliferation patterns of hexaploid wheat. They also show that polyploidization did not trigger bursts of TE activity. This dataset of well-defined transposon lineages now provides the basis to further explore the factors controlling transposon dynamics. Founder elements may help us obtain better insights into common patterns which could explain how and why amplification starts.
A stable genome structure despite the near-complete TE turnover in the intergenic sequences
As described above, intergenic sequences show almost no conservation between homeologous loci. That means they contain practically no TEs that have inserted already in the common ancestor of the subgenomes. Instead, ancestral sequences were removed over time and replaced by TEs that have inserted more recently. Despite this near-complete turnover of the TE space (Fig. 2a), the gene order along the homeologous chromosomes is well conserved between the subgenomes and is even conserved with the related grass genomes (sharing a common ancestor 60 Mya ). Most interestingly and strikingly, not only gene order but also distances between neighboring homeologs tend to be conserved between subgenomes (Fig. 6). Indeed, we found that the ratio of distances between neighboring homeologs has a strong peak at 1 (or 0 in log scale on Fig. 6), meaning that distances separating genes tend to be conserved between the three subgenomes despite the TE turnover. This effect is non-random, as ratio distribution curves are significantly flatter (p = 1.10− 5) when gene positions along chromosomes are randomized. These findings suggest that distances between genes are likely under selection pressure.
We found this constrained distribution irrespective of the chromosome compartments, i.e., distal, interstitial, and proximal, exhibiting contrasted features at the structural (gene density) and functional (recombination rate, gene expression breadth) levels [25, 26]. However, constraints applied on intergenic distances seem relaxed (broader peak in Fig. 6) in proximal regions where the meiotic recombination rate is extremely low. At this point, we can only speculate about the possible impact of meiotic recombination as a driving force towards maintaining a stable chromosome organization. Previous studies have shown that recombination in highly repetitive genomes occurs mainly in or near genes . We hypothesize that spacing of genes is preserved for proper expression regulation or proper pairing during meiosis. Previous studies on introgressions of divergent haplotypes in large-genome grasses support this hypothesis. For instance, highly divergent haplotypes which still preserve the spacing of genes have been maintained in wheats of different ploidy levels at the wheat Lr10 locus .
Enrichment of TE families in gene promoters is conserved between the A, B, and D subgenomes
The sequences flanking genes have a very distinct TE composition compared to the overall TE space. Indeed, while intergenic regions are dominated by large TEs such as LTR-RTs and CACTAs, sequences surrounding genes are enriched in small TEs that are usually just a few hundred base pairs in size (Fig. 7). Immediately upstream and downstream of genes (within 2 kb), we identified mostly small non-autonomous DNA transposons of the Harbinger and Mariner superfamilies, referred to as Tourist and Stowaway miniature inverted-repeat transposable elements (MITEs), respectively , SINEs, and Mutators (Fig. 7). At the superfamily level, the A, B, and D subgenomes exhibit the same biased composition in gene surrounding regions (Additional file 1: Figure S20). We then computed, independently for each subgenome, the enrichment ratio of each TE family that was present in the promoter of protein-coding genes (2 kb upstream of the transcription start site (TSS)) compared to their overall proportion (in copy number, considering the 315 TE families with at least 500 copies). The majority (242, 77%) showed a bias (i.e., at least a twofold difference in abundance) in gene promoters compared to their subgenome average, confirming that the direct physical environment of genes contrasts with the rest of the intergenic space. Considering a strong bias, i.e., at least a threefold over- or under-representation in promoters, we found 105 (33%) and 38 (12%) families, respectively, that met this threshold in at least one subgenome. While it was previously known that MITEs were enriched in promoters of genes, here we show that this bias is not restricted to MITEs but rather involves many other families. Again, although TEs that shaped the direct gene environment have inserted independently in the A, B, and D diploid lineages, their evolution converged to three subgenomes showing very similar TE composition. To go further, we showed that the tendency of TE families to be enriched in, or excluded from, promoters was extremely conserved between the A, B, and D subgenomes (Fig. 8), although TEs are not conserved between homeologous promoters (inserted after A-B-D divergence), except for a few cases of retained TEs (see below). In other words, when a family is over- or under-represented in the promoter regions of one subgenome, it is also true for the two other subgenomes. We did not find any family that was enriched in a gene promoter in one subgenome while under-represented in gene promoters of another subgenome.
Superfamily is generally but not always a good indicator of the enrichment of TEs in genic regions (Fig. 8). For instance, 83% (25/30) of the LINE families are over-represented in the promoter regions, while none of them is under-represented (considering a twofold change). We confirmed that class 2 DNA transposons (especially MITEs) are enriched in promoters, while Gypsy retrotransposons tend to be excluded from the close vicinity of genes. Indeed, among the 105 families strongly enriched in promoters (threefold change), 53% (56) are from class 2 and 21% (22) are LINEs, and only 5% (5) are LTR-RTs. Contrary to Gypsy, Mutator, Mariner, and Harbinger, families belonging to CACTA and Copia superfamilies do not share a common enrichment pattern: some TE families can be either over- or under-represented in promoters (Fig. 8). This confirmed previous results about CACTAs annotated along the 3B chromosome , revealing that a part of the CACTA families is associated with genes while the other follows the distribution of Gypsy. Our results showed that this is also true for Copia.
Thus, the TE turnover did not changed the highly organized genome structure. Given that not only proportions, but also enrichment patterns, remained similar for almost all TE families after A-B-D divergence, we suggest that TEs tend to be at the equilibrium in the genome, with amplification compensating their deletion (as described in ), and with families enriched around genes having remained the same.
No strong association between gene expression and particular TE families in promoters
We investigated the influence of neighboring TEs on gene expression. Indeed, TEs are so abundant in the wheat genome, that genes are almost systematically flanked by a TE in the direct vicinity. The median distance between the gene TSS and the closest upstream TE is 1.52 kb, and the median distance between the transcription termination site (TTS) and the closest downstream TE is 1.55 kb, while the average gene length (between TSS and TTS) is 3.44 kb. The density as well as the diversity of TEs in the vicinity of genes allow us to speculate on potential relationships between TEs and gene expression regulation. We used the gene expression network built by  based on an exhaustive set of wheat RNA-seq data. Genes were clustered into 39 expression modules sharing a common expression profile across all samples. We also grouped unexpressed genes to study the potential influence of TEs on neighbor gene silencing. For each gene, the closest TE upstream was retrieved, and we investigated potential correlations through an enrichment analysis (each module was compared to the full gene set). Despite the close association between genes and TEs, no strong enrichment for a specific family was observed for any module or for the unexpressed genes.
We then studied the TE landscape upstream of wheat homeolog triplets, focusing on 19,393 triplets (58,179 genes) with a 1:1:1 orthologous relationship between A, B, and D subgenomes. For each triplet, we retrieved the closest TE flanking the TSS and investigated the level of conservation of flanking TEs between homeologs. For 75% of the triplets, the three flanking TEs belong to three different families, revealing that, even in the close vicinity of genes, TEs are in majority not conserved between homeologs due to rapid turnover. This suggests that most TEs present upstream of triplets were not selected for by the presence of common regulatory elements across homeologs. However, for 736 triplets (4%), the three homeologs are flanked by the same element, constituting a conserved noncoding sequence (CNS), suggesting that part of this element is involved in the regulation of gene expression. These TE-derived CNSs are on average 459 bp, which is three times smaller than the average size of gene-flanking TE fragments (on average 1355 bp), suggesting that only a portion of the ancestrally inserted TEs are under selection pressure. They represent a wide range (149 different families) of diverse elements belonging to all the different superfamilies.
The majority of homeolog triplets have relatively similar expression patterns [26, 44], contrary to what was found for older polyploid species like maize . In synthetic polyploid wheat, it was shown that repression of D subgenome homeologs was related to silencing of neighbor TEs . Thus, we focused on triplets for which two copies are coexpressed while the third is silenced. However, enrichment analysis did not reveal any significant enrichment of specific TE families in promoters of the silenced homeologs. We also examined transcriptionally dynamic triplets across tissues . Again, no TE enrichment in promoters was observed. These results suggest that recent changes in gene expression are not due to specific families recently inserted in the close vicinity of genes.
The chromosome-scale assembly of the wheat genome provided an unprecedented genome-wide view of the organization and impact of TEs in such a complex genome. Since they diverged, the A, B, and D subgenomes have experienced a near-complete TE turnover, although polyploidization did not massively reactivate TEs. This turnover contrasted drastically with the high level of gene synteny. Apart from genes, there was no conservation of the TE space between homeologous loci. But surprisingly, TE families that have shaped the A, B, and D subgenomes are the same, and unexpectedly, their proportions and intrinsic properties (gene-prone or not) are quite similar despite their independent evolution in the diploid lineages. Thus, TE families are somehow at equilibrium in the genome since the A-B-D common ancestor. These novel insights contradict the previous model of evolution with amplification bursts followed by rapid silencing. Our results suggest a role of TEs at the structural level. TEs are not just “junk DNA”; our findings open new perspectives to elucidate their role in high-order chromatin arrangement, chromosome territories, and gene regulation.
TE modeling using CLARITE
The Triticum aestivum cv. Chinese Spring genome sequence was annotated as described in . Briefly, two gene prediction pipelines were used (TriAnnot: developed at GDEC Institute [INRA-UCA Clermont-Ferrand] and the pipeline developed at Helmholtz Center Munich [PGSB]), and the two annotations were integrated (pipeline established at Earlham Institute ) to achieve a single high-quality gene set. TE modeling was achieved through a similarity search approach based on the ClariTeRep curated databank of repeated elements , developed specifically for the wheat genome, and with the CLARITE program that was developed to model TEs and reconstruct their nested structure . ClariTeRep contains sequences present in TREP, i.e., a curated library of Triticeae TEs from all three subgenomes (originating from BACs sequenced during map-based cloning or survey sequencing projects) and TEs manually annotated in a previous pilot study of chromosome 3B . For the annotation, we used the ClariTeRep naming system, which assigns simple numbers to individual families and subfamilies; e.g., RLG_famc1.1 and RLG_famc1.2 are subfamilies of RLG_famc1. Since many TE families have been previously named, we provided this previous name in parentheses.
Detection and characterization of full-length LTR retrotransposons
Identification of flLTR-RTs was based on LTRharvest . For RefSeq_v1.0, LTRharvest reported 501,358 non-overlapping flLTR-RT candidates under the following parameter settings: “overlaps best -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 3000 -maxdistltr 25000 -similar 85 -mintsd 4 -maxtsd 20 -motif tgca -motifmis 1 -vic 60 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3”. All candidates where annotated for PfamA domains with hmmer3  and stringently filtered for canonical elements by the following criteria: (1) presence of at least one typical retrotransposon domain (RT, RH, INT, GAG); (2) removal of mis-predictions based on inconsistent domains, e.g., RT-RH-INT-RT-RH; (3) Absence of gene-related Pfam domains; (4) strand consistency between domains and primer binding site; (5) tandem repeat content below 25%; (6) long terminal repeat size <= 25% of the element size; (7) N content < 5%. This resulted in a final set of 112,744 high-quality flLTR-RTs. The Copia and Gypsy superfamilies were defined by their internal domain ordering: INT-RT-RH for RLC and RH-RT-INT for RLG . When this was not possible, the prediction was classified as RLX. The 112,744 flLTR-RTs were clustered with vmatch dbcluster  at three different stringencies: 95/95 (95% identity over 95% mutual length coverage), 90/90, and 80/80, as follows: vmatch “-dbcluster 95 95 -identity 95 -exdrop 3 -seedlength 20 -d”, “-dbcluster 90 90 -identity 90 -exdrop 4 -seedlength 20 -d” and “-dbcluster 80 80 -identity 80 -exdrop 5 -seedlength 15 -d”. Subgenome specificity of clusters was defined by the following decision tree: (1) assignment of the respective subgenome if > = 90% of the members were located on this subgenome; (2) assignment to two subgenomes if members from one subgenome < 10%, e.g., AB-specific if D members < 10%; (3) Assignment of the remaining clusters as ABD common. Muscle was used for multiple alignments of each cluster  in a fast mode (-maxiters 2 -diags1). To build phylogenetic trees, we used tree2 from the muscle output which was created in the second iteration with a Kimura distance matrix, and trees were visualized with ete3 toolkit . The date of flLTR-RT insertions was based on the divergence between the 5′ and 3′ LTRs calculated with emboss distmat, applying the Kimura 2-parameter correction. The age was estimated using the formula: age = distance/(2*mutation rate) with a mutation rate of 1.3*10–8 . The lifespan of an individual LTR-RT subfamily was defined as the 5th to 95th percentile interval between the oldest and youngest insertions. The densities for the chromosomal heat-maps were calculated using a sliding window of 4 Mb with a step of 0.8 Mb.
Comparative analysis of distances separating neighbor genes between homeologous chromosomes
For the comparison of distances separating neighbor genes, homeologous triplets located in the three chromosomal compartments (distal, interstitial, and proximal; Additional file 1: Table S2) were treated separately. This was done because gene density is lower in interstitial and proximal regions, and because the latter show a lack of genetic recombination. Furthermore, we considered only triplets where all three homeologous genes are found on the homeologous chromosomes. Comparison of homeologous gene pairs from distal regions was done in two ways, both of which yielded virtually identical results. Distances were measured from one gene to the one that follows downstream. However, there were many small local inversions between the different subgenomes. Thus, if a gene on the B or D subgenome was oriented in the opposite direction compared to its homeologous copy in the A subgenome, it was assumed that that gene is part of a local inversion. Therefore, the distance to the preceding gene on the chromosome was calculated. The second approach was more stringent, based only on triplets for which all three homeologs are in the same orientation in the three subgenomes. The results obtained from the two approaches were extremely similar, and we presented only the results from the second, more stringent, approach. For the control dataset, we picked a number of random positions along the chromosomes that is equal to the number of homeologs for that chromosome group. Then, homeologous gene identifiers were assigned to these positions from top to bottom (to preserve the order of genes but randomize the distances between them). This was done once for all three chromosomal compartments. Histograms of the distributions of the distance ratios between homeologs were produced with rstudio (rstudio.com). The significance of the differences between the largest group of actual and randomized gene positions (peak of the histogram) was established with a chi-square test.
Analyses of TEs in the vicinity of genes and enrichment analyses
We developed a Perl script (gffGetClosestTe.pl ) to retrieve gene-flanking TEs from the feature coordinates in the GFF file. It was used to extract the closest TE on each side of every predicted gene (considering “gene” features that include untranslated regions). It was also used to extract all predicted TE copies entirely or partially present within 2 kb upstream of the “gene” start position, i.e., the TSS. Enrichment analyses were then automated using R scripts.
Enrichment of TE families in gene promoters (2 kb upstream)
Independently for the three subgenomes, we retrieved all TE copies present within 2 kb upstream of the TSSs of all gene models and calculated the percentage of the number of copies assigned to every family (%famXpromoter). We also calculated the percentage of the number of copies of each family at the whole subgenome level (%famXwhole_subgenome). One enrichment log2 ratio was calculated for each A, B, and D subgenome using the formula log2(%famXpromoter/%famXwhole_subgenome). Only families accounting for 500 copies or more in the whole genome were considered.
TE families and expression modules
Here, we retrieved the closest TE present in 5′ of the TSS for all genes and calculated the percentage of each TE family for each expression module and the unexpressed genes (considered as a module), and compared them to the percentage observed for the whole gene set using the formula log2(%famXgenes_moduleX/%famXall_genes). The log2 ratio was calculated only for expression modules representing at least 1000 coexpressed genes, and we considered only log2 ratio values for families accounting for 500 copies or more. A similar approach was taken for the 10% stable, 80% middle, and 10% dynamic genes as defined by .
Comparison of TE families in the promoter of homeologs
Here, we also retrieved the closest TE in 5′ of every gene and identified homeologous triplets for which the closest element in 5′ belongs to the same family for the three copies. For that, we developed a Perl script (getTeHomeologs.pl ) in order to integrate the information of homeologous genes and the data of the closest TE in 5′ of genes. Only “1–1-1” homeologs were considered.
Conserved non-coding sequence
Full-length long terminal repeat retrotransposon
Long interspersed nuclear element
Long terminal repeat
Miniature inverted-repeat transposable element
Open reading frame
Short interspersed nuclear element
Transcription start site
Transcription termination site
Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002;3:329–41.
Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82.
International Barley Genome Sequencing Consortium, Mayer KF, Waugh R, Brown JW, Schulman A, Langridge P, Platzer M, Fincher GB, Muehlbauer GJ, Sato K, et al. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491:711–6.
Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–6.
Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon JM, Westerman RP, Sanmiguel PJ, Bennetzen JL. Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genet. 2009;5:e1000732.
Makarevitch I, Waters AJ, West PT, Stitzer M, Hirsch CN, Ross-Ibarra J, Springer NM. Transposable elements contribute to activation of maize genes in response to abiotic stress. PLoS Genet. 2015;11:e1004915.
Middleton CP, Senerchia N, Stein N, Akhunov ED, Keller B, Wicker T, Kilian B. Sequencing of chloroplast genomes from wheat, barley, rye and their relatives provides a detailed insight into the evolution of the Triticeae tribe. PLoS One. 2014;9:e85761.
Daron J, Glover N, Pingault L, Theil S, Jamilloux V, Paux E, Barbe V, Mangenot S, Alberti A, Wincker P, et al. Organization and evolution of transposable elements along the bread wheat chromosome 3B. Genome Biol. 2014;15:546.
Choulet F, Alberti A, Theil S, Glover N, Barbe V, Daron J, Pingault L, Sourdille P, Couloux A, Paux E, et al. Structural and functional partitioning of bread wheat chromosome 3B. Science. 2014;345:1249721.
Mascher M, Gundlach H, Himmelbach A, Beier S, Twardziok SO, Wicker T, Radchuk V, Dockter C, Hedley PE, Russell J, et al. A chromosome conformation capture ordered sequence of the barley genome. Nature. 2017;544:427–33.
Piegu B, Guyot R, Picault N, Roulin A, Sanyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, Panaud O. Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Res. 2006;16:1262–9.
Kelly LJ, Renny-Byfield S, Pellicer J, Macas J, Novak P, Neumann P, Lysak MA, Day PD, Berger M, Fay MF, et al. Analysis of the giant genomes of Fritillaria (Liliaceae) indicates that a lack of DNA removal characterizes extreme expansions in genome size. New Phytol. 2015;208:596–607.
Yaakov B, Meyer K, Ben-David S, Kashkush K. Copy number variation of transposable elements in Triticum-Aegilops genus suggests evolutionary and revolutionary dynamics following allopolyploidization. Plant Cell Rep. 2013;32:1615–24.
Isidore E, Scherrer B, Chalhoub B, Feuillet C, Keller B. Ancient haplotypes resulting from extensive molecular rearrangements in the wheat A genome have been maintained in species of three different ploidy levels. Genome Res. 2005;15:526–36.
Ramírez-González R, Borrill P, Lang D, Harrington SA, Brinton J, Venturini L, Davey M, Jacobs J, Van Ex F, Pasha A, et al. The transcriptional landscape of polyploid wheat. Science. 2018. https://doi.org/10.1126/science.aar6089.
Pophaly SD, Tellier A. Population level purifying selection and gene expression shape subgenome evolution in maize. Mol Biol Evol. 2015;32:3226–35.
Li A, Liu D, Wu J, Zhao X, Hao M, Geng S, Yan J, Jiang X, Zhang L, Wu J, et al. mRNA and small RNA transcriptomes reveal insights into dynamic homoeolog regulation of allopolyploid heterosis in nascent hexaploid wheat. Plant Cell. 2014;26:1878–900.
The research leading to these results has received funding from the French Government managed by the Research National Agency (ANR) under the Investment for the Future programme (BreedWheat project ANR-10-BTBR-03), from FranceAgriMer (2011–0971 and 2013–0544). This work was also supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) through grants (BB/P016855/1, BB/P013511/1, and BB/M014045/1). TW was funded by the University of Zurich. KFXM, HG, and MS would like to acknowledge funding from the German Federal Ministry of Food and Agriculture (2819103915) and the German Ministry of Education and Research (de.NBI 031A536).
Availability of data and materials
In terms of data availability, IWGSC RefSeq_v1.0 assembly , gene and TE annotation , as well as related data are available on the IWGSC Data Repository hosted at URGI  and the source code of perl scripts developed for this study . The raw sequencing data is in the Sequence Read Archive under accession number SRP114784 .
The reviewer reports with the author’s response to the reviewers are available as Additional file 2.
Thomas Wicker and Heidrun Gundlach contributed equally to this work.
Authors and Affiliations
Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
PGSB Plant Genome and Systems Biology, Helmholtz Center Munich, German Research Center for Environmental Health, Neuherberg, Germany
Heidrun Gundlach, Manuel Spannagl & Klaus F. X. Mayer
Department of Crop Genetics, John Innes Centre, Norwich Research Park, Colney, Norwich, NR4 7UH, UK
Cristobal Uauy, Philippa Borrill & Ricardo H. Ramírez-González
GDEC, INRA, UCA (Université Clermont Auvergne), Clermont-Ferrand, France
Romain De Oliveira, Etienne Paux & Frédéric Choulet
School of Life Sciences, Technical University Munich, Munich, Germany
TW, HG, and FC conceived and designed the study. FC coordinated the study. TW performed analyses of TE distribution along chromosomes, comparisons of intergenic distances between subgenomes, centromeric TE analyses, and also analyzed the TE landscape around genes. HG carried out the k-mer analyses and the flLTR-RT identification, dating, clustering, and phylogenetic analyses. FC carried out the A-B-D comparisons of TE families, enrichment analyses in gene promoters, comparisons between homeologs, and association with expression profiles. CU, RHRG, and PB built the RNA-seq-based gene expression network. TW, HG, and FC wrote the manuscript. RDO contributed text and figures for the manuscript. MS, CU, RHRG, PB, KFXM, RDO, and EP provided comments and corrections on the manuscript. All authors read and approved the final manuscript.
Reviewer reports and Author’s response to reviewers. (DOCX 30 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.