SMRT sequence and assembly of the T. sinense genome
To analyze the genome of T. sinense, we first generated 82 G Illumina paired-end reads (150 bp). The genome size of the sequenced individual was estimated to be ~ 1.12 Gb, and its heterozygosity rates were estimated to be 0.22%, based on k-mer analysis (k-mer frequency distributions) (Additional file 1: Fig. S1). We then used a combination of single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) and high-throughput chromosome conformation capture (Hi-C) to produce the final sequenced and assembled genome of T. sinense. In total, we generated 14.8 M PacBio single-molecule long reads (average read length 8 kb, longest read 84 kb), for 118 Gb total sequence, giving ~ 100× coverage of the assembled genome. De novo assembly yielded 3389 contigs, with a contig N50 length of 1.99 Mb, which was further improved to 2.8 Mb with Gapcloser software [33]. The total assembly size was 1.17 Gb, which is consistent with the genome size estimated by k-mer analysis (Additional file 1: Fig. S1).
To further refine the T. sinense assembly, Hi-C libraries were constructed and sequenced. The Hi-C read pairs were mapped onto the draft assembly (Additional file 1: Fig. S2), yielding an assembly with a scaffold N50 of 45 Mb. The final reference assembly comprised 24 chromosome-scale pseudomolecules hereafter referred to as chromosomes (Additional file 1: Fig. S2b), with maximum and minimum lengths of 65 Mb and 33 Mb, respectively. The total length of the chromosomes accounted for 92.24% (1.07 Gb) of the assembled genome size of 1.17 Gb. The assembled chromosome number is equal to previous estimates of the haploid chromosome number of T. sinense (2n = 48). BUSCO [34] assessment showed that 93.4% complete genes were captured.
Repeat and gene annotation
We identified 787 Mb of repetitive sequences in the genome of T. sinense, accounting for 67.8% of the genomic assembly. Transposable elements were the predominant component (65.9%), with the long terminal repeat (LTR) family being the largest part (32.2%) of these transposons. Within the LTR family, the Gypsy subfamily was the most abundant, making up 20.5% of the genome, followed by the Copia subfamily (11%) (Additional file 1: Table S1).
Based on the remaining repeat-masked T. sinense genome, we annotated 32,690 protein-coding gene models by combining ab initio gene predictions and evidence-based gene annotations using the MAKER pipeline. These gene models had an average transcript length of 1492 bp and an average coding-sequence length of 1304 bp. The gene models contained an average of 5.4 exons per gene with average exon and intron lengths of 153 bp and 283 bp, respectively. In addition, we identified 166 rRNAs, 811 tRNAs, and 1062 non-coding RNAs (ncRNAs). A BUSCO assessment based on conserved plant gene models identified 93.8% complete gene models.
Phylogenomic and gene family evolution analyses
To ascertain the phylogenetic position of T. sinense and the Trochodendrales, we performed phylogenomic analysis using 214 single-copy gene families (orthogroups) from 19 sequenced angiosperm genomes identified by OrthoFinder v2.3.1 [35]. These orthologous groups with single-copy genes were concatenated to construct a maximum likelihood (ML) phylogenetic tree using IQ-TREE. As shown in the ML tree (Fig. 1), the major lineages of angiosperms (monocots, magnoliids, and eudicots) were all recovered as monophyletic groups with 100% support. In eudicots, the core eudicot species formed a strongly supported monophyletic clade (100% bootstrap support). The Ranunculales, Proteales, and Trochodendrales formed successive sister clades to the core eudicots with 100% support, consistent with previous analyses [6, 7, 36].
We further estimated the divergence times of eudicot lineages using the MCMCTree [37] and BEAST 2 [38] programs; these two methods yielded consistent divergence times with only small differences (Fig. 1; Additional file 1: Fig. S3). According to the MCMCTree analysis (Fig. 1), crown-group eudicots arose 136 million years ago (MYA; 129–140 MYA, 95% highest posterior density). Apart from the Ranunculales, the earliest divergences within the eudicots based on our analyses were the split of the Proteales and Trochodendrales from the rest of the eudicots at ~ 131 (123–137) MYA and ~ 125 (117–132) MYA, respectively. In our analysis, clades corresponding to the core eudicots originated at ~ 113 (105–123) MYA. Within the Trochodendrales, the divergence of its two species was estimated to have occurred at 31 (11–84) MYA.
We also investigated gene family evolution during the early diversification of eudicots. According to a CAFE analysis [39], five times as many gene families expanded in the lineage leading to all eudicots (551), as compared to core eudicots (110). However, twice as many gene families contracted in the lineage leading to core eudicots compared with the lineage leading to all eudicots (Fig. 1; Additional file 1: Fig. S4), consistent with a number of characteristics shared by eudicots. We found that in the lineages leading to eudicots, core eudicots, and the most recent ancestor of Trochodendrales, some defense gene families evolved rapidly. In the lineage leading to core eudicots, the TM3 MAD-box gene family that includes the important flowering time gene SUPPRESSOR OF OVEREXPRESSION OF CONSTANS1 has expanded (Additional file 1: Fig. S5a).
In the lineage leading to the common ancestor of the Trochodendrales, there was much more gene family expansion (3524) compared to gene family contraction (938). To the genome of T. sinense, 2229 and 1784 gene families had undergone expansion and contraction, respectively. Among these families, 168 evolved rapidly. In contrast, more gene families expanded (3058) and fewer gene families (1153) contracted in T. trochodendron. The rapidly evolving gene families of T. sinense include many genes enriched in functions related to defense, toxin catabolic process, response to salicylic acid, and oxidative stress (Table S2). These genes mainly belong to gene families, such as the defense-related NBS-LRR (nucleotide-binding site leucine-rich repeat), lectin RLK (receptor-like kinase), and GST (glutathione transferase) gene families (Additional file 1: Fig. S5b). The gene families involved in terpenoid biosynthesis also evolved rapidly (Additional file 1: Table S2). Furthermore, the ALG24 MAD-box genes expanded rapidly in T. sinense (Additional file 1: Fig. S5c). In agreement with the enrichment of GO terms, the KEGG enrichment suggested that genes involved in plant–pathogen interaction, glutathione metabolism, and terpenoid biosynthesis evolved rapidly in the T. sinense genome (Additional file 1: Fig. S6).
Whole-genome duplications
Whole-genome duplications (WGDs, polyploidization) have played a major role in the evolutionary history of the angiosperms. Intragenomic syntenic analysis of T. sinense showed strong collinearity between large segments of chromosomes, such as 2, 5, 6, and 24; 4, 9, 15, and 19; 4, 10, 18, and 23; 7, 11, 12, and 13; and 8, 14, 16, and 20 (Fig. 2a). The widespread occurrence of one-versus-three syntenic blocks suggested that two WGD events (named α and β) have occurred in the genome of T. sinense.
T. aralioides is a sister of T. sinense. To investigate whether T. aralioides also experienced WGDs, we performed an intragenomic analysis of this species. Indeed, two WGDs also occurred in T. aralioides since intragenome comparisons showed a 1:3 syntenic depth ratio (Additional file 1: Fig. S7a). To explore if the WGDs are common or lineage-specific, we performed intergenomic syntenic analyses between T. sinense and T. aralioides. The results showed a 1:1 syntenic relationship between T. sinense and T. aralioides (Additional file 1: Fig. S7b), suggesting that the two WGDs are common to the Trochodendrales.
Furthermore, we assessed the synteny between T. sinense or T. aralioides and Amborella trichopoda, which is the only living representative of the sister lineage to all other extant flowering plants and did not experience lineage-specific WGD events [40]. We identified a 1:4 syntenic depth ratio in the A. trichopoda–T. sinense (Fig. 2b; Additional file 1: Fig. S8a) and A. trichopoda–T. aralioides (Additional file 1: Fig. S8b) comparisons. We also performed intergenomic comparisons between T. sinense/T. aralioides and several species including the basal angiosperm Nymphaea colorata, the magnoliid Cinnamomum kanehirae, two early diverging eudicots (Aquilegia coerulea and Nelumbo nucifera), and the core eudicot Vitis vinifera. Genomic comparisons of T. sinense/T. aralioides and N. colorata showed a 4:2 syntenic pattern, and comparisons of T. sinense/T. aralioides and C. kanehirae showed a 4:4 syntenic pattern (Fig. 2b, c; Additional file 1: Fig. S8c-f). Given the known lineage-specific WGD in N. colorata and two lineage-specific WGDs in C. kanehirae [41, 42], our results suggested that the two WGDs in Trochodendrales occurred after the divergence of eudicots and magnoliids.
In addition, the synteny relationship between N. nucifera and N. colorata and between N. nucifera and C. kanehirae is 2:2 and 2:4, respectively (Additional file 1: Fig. S8g, h), indicating one WGD in N. nucifera after the divergence from N. colorata and C. kanehirae. Several earlier studies showed that the WGD in N. nucifera is lineage-specific rather than eudicot-wide, rejecting the possibility that any WGD occurred in the common ancestor of all extant eudicots [13, 43, 44]. The collinear relationships between T. sinense/T. aralioides and A. coerulea, N. nucifera, and V. vinifera are 4: 2, 4:2, and 4:3 (Fig. 2d; Additional file 1: Fig. S9), respectively, consistent with independent WGDs in the respective lineages of the Trochodendrales, A. coerulea, N. nucifera, and V. vinifera.
Distributions of synonymous substitutions per synonymous site (Ks) were further analyzed to confirm the WGDs in Trochodendrales and other early diverging lineages of eudicots. The Ks distributions of T. sinense paralogs and T. aralioides paralogs in syntenic regions showed two overlapping peaks supporting two WGDs in a common ancestor of T. sinense and T. aralioides (Fig. 2e). As shown in Fig. 2e, the peaks of Ks distribution of T. sinense/T. aralioides paralogs (retained from two WGD events) are much smaller than those of orthologs comparing T. sinense/T. aralioides to A. coerulea and to N. nucifera, strongly suggesting the two WGDs in T. sinense/T. aralioides occurred after their divergence from A. coerulea and N. nucifera. In addition, the N. nucifera paralogs displayed lower peak Ks value than orthologs comparing N. nucifera to T. sinense/T. aralioides and to A. coerulea, further confirming that the WGD in N. nucifera is also lineage-specific.
According to the synteny analysis, we investigated the Ks distributions for paralogous gene pairs that originated from the two WGD events (Fig. 2f). Using divergence time and median Ks values of syntenic blocks between T. sinense and T. aralioides, we estimated the synonymous substitutions per site per year as 3.25e−9 for the Trochodendrales (Additional file 1). Using this rate, we estimated that the α and β WGD events of Trochodendrales occurred at around 82 MYA and 59 MYA according to the median Ks of T. sinense paralogs, and 78 MYA and 54 MYA according to the median Ks of T. aralioides paralogs. This dating again suggests that the two WGDs occurred after the divergence of Trochodendrales from Proteales (N. nucifera).
Lastly, we performed integrated syntenic and phylogenomic analyses to confirm the timing of these WGDs in early diverging eudicots. We obtained anchor genes from the inter-genomic synteny blocks with ratio of 1:2:4 in A. trichopoda, N. nucifera, and T. sinense. All genes in the same sets of syntenic block were concatenated and used to construct phylogenetic trees. In total, we obtained 31 groups of concatenated genes (a total of 94 genes) and 87% of the trees based on the concatenated genes (Additional file 1: Fig. S10) supported the hypothesis that the WGDs in N. nucifera and T. sinense are lineage-specific.
Occurrence of vessel elements in secondary xylem
Under a light microscope, we found there was a distinct growth ring delineated by a band of latewood in the cross-section. Earlywood cells were larger and thin-walled, while those in latewood were smaller and thick-walled as reported by the previous study [17]. Occasionally, we noticed some peculiar cells occurred through the growth ring boundaries with much larger diameter (56.0 ± 0.82 μm ±, n = 35) than normal cells (44.7 ± 0.98 μm, n = 35) (Fig. 3a). In the radial section, the peculiar cells had oblique end-walls (Fig. 3b). In the tangential section, we can clearly observe there are two types of tracheary elements. Most of the cells appeared as fibrous with 3.0–4.8 mm in length while the peculiar cells were much shorter ranging from 308 to 597 μm and present as fusiform cells with many pits (Fig. 3c). Under scanning electron microscope (SEM), we observed the normal cells often have scalariform bordered pits on radial walls, sometimes with pit membrane remnants that are like a fibrillar mesh as reported previously (Fig. 3d, e). Using X-ray computed microtomography (microCT), we made a 3-dimensional reconstruction of different cells based on 1032 serial slice images and found the short and fusiform peculiar cells are sporadically present in T. sinense wood (Fig. 3f–j). Therefore, we considered the larger and shorter peculiar cells as vessel elements, supporting the idea that vessel elements are present in T. sinense, although its vessel elements are primitive [19, 20].
The evolution of VNS, LBD, R2R3-MYB, CesA, and PLCP gene families which are involved in the formation of vessel elements
AtVND7, AtLBD18/AtLBD30, AtMYB46/AtMYB83, AtCESA4 and AtCESA8, and AtXCP1/AtXCP2, belonging to the VNS, LBD, R2R3-MYB, CesA, and PLCP gene families [25, 45,46,47,48], respectively, are key genes involved in the development of vessel elements in A. thaliana [21, 49]. To identify the T. sinense genes orthologous to these genes, we explored the evolutionary history of VNS, LBD, R2R3-MYB, CesA, and PLCP gene families.
The VNS gene family contains VND, NST, and SMB subfamilies [45]. As shown in the phylogenetic tree, genes homologous to VND, NST, and SMB each fell into different clades with high bootstrap values and each clade did not contain sequences from Physcomitrella patens (also known as Physcomitrium patens) and Selaginella moellendorffii, suggesting that the origin of these subfamilies occurred after the divergence of lycophytes. In the phylogenetic tree (Fig. 4a; Additional file 1: Fig. S11a), genes from the VND group clustered into two large clades: one including A. thaliana VND4–VND6 and the other including VND1–VND3 and VND7. Each of these clades contains sequences from gymnosperms (in black in the tree, Fig. 4a) and angiosperms, suggesting that the duplication leading to the ancestors of VND4–VND6, and VND1–VND3 and VND7 occurred before the divergence of gymnosperms and angiosperms. In the clade containing VND1–VND3 and VND7, genes homologous to VND1–VND3 and VND7 each formed a subclade and each subclade only contained sequences from angiosperms, suggesting that the duplication leading to the origin of the VND7 ancestor occurred after the divergence of angiosperms and gymnosperms. Sequences from the VND7 clade were aligned and two diagnostic domains of VND7 subfamily were found in all these sequences (Additional file 1: Fig. S11b), suggesting that they are bona fide VND7 genes.
The transcription factors ASL19/LBD30 and ASL20/LBD18 are involved in a positive feedback loop to regulate VND7 expression [27]. We identified 27 LBD18/LBD30 genes in angiosperms. However, no LBD18/LBD30 genes were identified in genomes of three gymnosperm species, one lycophyte, and one moss, suggesting that the origin of the LBD18 subfamily may have occurred after the divergence of angiosperms and gymnosperms. In the ML tree (Additional file 1: Fig. S12a), the two genes (AtLBD18 and AtLBD30) from A. thaliana formed one clade, suggesting that they originated from lineage-specific duplications. The two copies in T. sinense (Tesin01G0033700 and Tesin21G0059700) are putative orthologs to A. thaliana AtLBD18 and AtLBD30.
In total, 87 R2R3-MYB genes were identified in the T. sinense genome. These sequences together with sequences from A. thaliana and O. sativa were subjected to phylogenetic analysis. As shown in the ML tree (Additional file 1: Fig. S12b), sequences from the three species fell into different clades. All subgroups (except for 2 subgroups only including sequences from A. thaliana) contain sequences from A. thaliana/T. sinense and O. sativa, suggesting these subgroups originated before the splitting of monocots and dicots. In subgroup MYB46 and MYB83, there are three T. sinense sequences (Tesin13G0114100, Tesin11G0137400, Tesin07G144300).
In the ML tree (Additional file 1: Fig. S12c), the CesA sequences of P. patens formed their own clade. CesAs in seed plants fell into six major clades, which were designated as six subfamilies: CesA1/10, CesA3, CesA6, CesA4, CesA7, and CesA8. Each subfamily included sequences from angiosperms and gymnosperms, indicating the occurrence of ancient gene duplication events before the divergence of angiosperms and gymnosperms. In the phylogenetic tree, subfamilies CesA4, CesA7, and CesA8 clustered together and the remaining subfamilies clustered together. CesA sequences of S. moellendorffii fell into each of these two clades. All subfamilies, including subfamilies CesA4 and CesA8, also contained only one T. sinense gene except for subfamily CesA1/10, which contained three T. sinense genes.
Congruent with previous studies [48], most PLCP genes fell into nine subfamilies (SAG12, THI, CEP, XCP, RD21, XBCP3, RD19, ALP, and CTB). SAG12, THI, CEP, XCP, RD21, and XBCP3 have been classified as L-like cathepsins, and RD19, ALP, and CTB have been classified as F, H, and B-like cathepsins [48], respectively. In the phylogenetic tree (Additional file 1: Fig. S12d, e), all subfamilies of L-like cathepsins clustered together while F, H, and B-like cathepsins were relatively distant from them. L, F, H, and B-like proteins appeared in the moss and lycophyte, suggesting that ancient duplications led to the origin of these subfamilies. As shown in the ML tree, XCP genes originated before the divergence of gymnosperms from S. moellendorffii.
Nuclear localization and transcriptional regulatory network of VND genes
To investigate the intrinsic function of VNDs in T. sinense, we first analyzed their subcellular localization by transient expression. Like Arabidopsis VND6 and VND7, TsVND7.2-GFP, TsVND6.1-GFP, and TsVND6.2-GFP predominantly localize to the nucleus, as seen by co-localization assays with GPF fluorescence merged with a bright-field image (Fig. 4b; Additional file 1: Fig. S13a).
Yeast one-hybrid (Y1H) analysis was subsequently performed to determine the interaction between TsVND7.1 and TsVND7.2 and other genes. The yeast-carrying Bait-BD-pHIS2 vectors expressing the promoter of TsVND7.1, TsVND7.2, TsMYB, TsCesA4, TsCesA8, TsXCPa, or TsXCPc grew normally on selective medium without 3-AT but failed to grow with 3-AT, showing that no self-activation occurred (Additional file 1: Fig. S13b). The following tests were carried out by promoters fused to Bait-BD-pHIS2 co-transformed with transcription factors fused to Prey-AD-pGADT7. The results showed that TsVND7.2 interacted with the promoters of TsMYB, TsCesA4 and TsXCPa, and TsVND7.2 can be regulated by TsVND6.1 and TsLBD30a (Fig. 4c). Conversely, TsVND7.1 did not interact with the TsMYB, TsCesA4, TsCesA8, TsXCPa, TsXCPb, or TsXCPc promoters and was not regulated by TsLBD30a. TsVND7.2 also did not bind to the promoters of TsCesA8, TsXCPb, and TsXCPc (Additional file 1: Fig. S13c).
The interaction between TsVND7.1/TsVND7.2 and other genes was further assessed using luciferase activity assays (Fig. 4d; Additional file 1: Fig. S13d). In these assays, TsVND7.2 interacted with promoters of TsMYB, TsCesA4, TsCesA8, TsXCPa and TsXCPc, and TsVND7.2 can be regulated by TsVND6.1, TsLBD30a, and TsLBD30a. TsVND7.1 also showed similar interactions with other genes except TsLBD30a, TsCesA4, and TsXCPc.
Demographic history analysis of T. sinense
To explore the historical demographic fluctuations and present-day genetic diversity within this endemic species, we resequenced 55 individuals of T. sinense from a range-wide sampling from China (Fig. 5a). ADMIXTURE analysis [50] of Chinese samples revealed that the deepest splits occurred among the southwestern population (population YN), the populations near the Sichuan Basin in central China (represented by populations CQ from the Jinfo Mountains and SC from the Emei Mountains), and the populations from the North and East parts of central China (represented by populations SNJ from the Shennong Mountains and populations SX and TS from the Qingling Mountains) (Fig. 5b; Additional file 1: Fig. S14). When K = 4, the TS population from the Qinling Mountains in north-central China diverged from the third clusters and four ancestries of T. sinense occurred in China (Fig. 5b). Principal component analysis (PCA) and phylogenetic analysis also revealed four major clusters (Fig. 5c, d; Additional file 1: Fig. S14b, c).
T. sinense exhibited higher within-species genetic diversity than within-population level genetic diversity. The within-population genetic diversity of T. sinense (6.30–11.58e−3 for Watterson’s estimator (θw) and 6.48–11.25e−3 for nucleotide diversity (π), Additional file 1: Table S3) was obviously low compared to the species level diversity. The genetic differentiation statistics (fixation index; Fst) further showed that genetic differentiation between YN population and all other populations (0.293–0.382) was higher than those between others (most < 0.293), and the genetic differentiation between SX and SNJ population (0.102) was the lowest (Additional file 1: Table S4).
The Mantel test revealed a significant positive correlation between the genetic and geographic distance (r = 0.8417, p = 0.004, Fig. 5e). To estimate gene flow between populations, we performed Treemix v1.13. analysis [51], which suggested that the TS population from the Qinling Mountains in north-central China is the result of admixture between the SC population the near Sichuan Basin and the SX population from the Qinling Mountains in the north and east parts of central China (Fig. 5f). Gene flow is also observed in population SC and CQ. It is consistent with that they were in the same group according to phylogenetic, PCA, and structure analysis.
An analysis of demographic history using stairway plots revealed bottlenecks the populations may have undergone (Fig. 6), including one ranging 0.3–0.67 MYA, and a recent bottleneck around 0.03 MYA. For three populations (SC, SX, and YN), their effective population size is declining since they were affected by the Last Glacial Maximum and their extinction is underway.