The X. laevis L and S subgenomes show a bias in chromatin state and gene expression
To study the evolution of gene regulation in the context of WGD, we generated transcriptomic and epigenomic profiles in X. laevis early gastrula embryos (Nieuwkoop-Faber stage 10.5; Additional file 1). We performed RNA-seq and obtained epigenomic profiles using ChIP followed by deep sequencing (ChIP-seq). We generated ChIP-seq profiles for H3K4me3, associated with promoters of active genes, H3K36me3, associated with actively transcribed genes, the Polr2a subunit of RNA Polymerase II (RNAPII), and the transcription coactivator p300. In addition, we performed WGBS to obtain DNA methylation profiles [15]. The sequencing results and details are summarized in Additional file 1.
We created whole-genome alignments (see “Methods”) to establish a framework for analysis of the epigenetic modifications in the two X. laevis subgenomes and in the X. tropicalis genome. Of the X. laevis L and S non-repetitive sequence, 61% and 59%, respectively, can be aligned with the orthologous X. tropicalis sequence. This allows for comparisons of the activity of genes and regulatory elements between homeologous regions. Figure 1 shows a region on X. tropicalis chromosome 8 containing four genes, together with the corresponding aligning sequences on chr8L and chr8S in X. laevis. The epigenomic profiles (H3K4me3, p300, RNAPII, and H3K36me3) of both X. laevis and X. tropicalis [16] are shown and the sequence conservation obtained from the whole gene alignment is illustrated by gray lines in the center of the plot. Regions that are conserved at both the sequence level and at the functional level (as measured by ChIP-seq) are highlighted. The anp32e gene is an example of a conserved gene that is expressed from all three genomes, as evidenced by H3K4me3 at the promoter and H3K36me3 and elongating RNAPII in the gene body. In contrast, expression of the plekho1 gene has been lost from S. The gene is still present, but it is not active. There is no evidence of expression and both the H3K4me3 and the p300 signal are lost. Finally, the vps45 gene is an example of a gene that is completely lost from L.
Next, we quantified gene expression patterns in the X. laevis subgenomes. Of the 17,303 genes expressed at stage 10.5, 9230 can be assigned to the L subgenome and 6685 to S. Of those expressed genes, 4972 are singletons located on L and 2646 on S. As reported previously [8], when both genes of a homeologous pair have detectable expression (3545 genes), the expression level is correlated (Pearson R = 0.60, p < 1e-300; Fig. 2a) and a minor but significant expression bias is detected (median expression difference of L compared to S = 5.7%; p < 1e-4; Wilcoxon signed-rank test). However, for many homeologs the expression bias is quite high, such that for one copy hardly any expression can be detected. Such non-expressed homeologs are located on both L and S, but occur more frequently on S (L: 494, S: 713; p = 6.0e-11, Fisher’s exact test).
We examined whether the expression differences between the L and S homeologs could be explained by differential transcription regulation. We used the epigenomic profiles to assay the promoter state (H3K4me3, DNA methylation), enhancer activity (p300), and active expression (RNAPII, H3K36me3). The L subgenome has 38% more annotated genes than the S subgenome [8]. We observe the same trend for the regulatory elements. The number of H3K4me3 peaks, DNA-methylation free regions (see “Methods”), and p300 peaks is higher on L (28, 23, and 35%, respectively; Additional file 2). The overall effect is that there is no significant difference between the numbers of regulatory elements per gene for the two subgenomes.
To analyze the conservation of regulatory elements, we compared the H3K4me3 and p300 data to similar ChIP-seq profiles from X. tropicalis obtained at the equivalent developmental stage [16]. In general promoters are much more conserved than enhancers (Fig. 2b). From all H3K4me3 peaks in X. tropicalis, ~ 40% are conserved in X. laevis, while for the p300 peaks the conservation is only ~ 13% (p < 1e-4; Chi-squared test). This is congruent with the finding in mammals that enhancers evolve much more rapidly than promoters [17]. Whereas the number of conserved regulatory elements is lower in S than in L, the elements that can be aligned differ relatively little at the sequence level and show over ~ 60% sequence identity (Fig. 2c).
These analyses show that the L and S subgenomes have evolved differently with respect to gene content [8] and regulatory elements. Many more genes from S are lower expressed than their homeologs in L than vice versa. The number of functional regulatory elements, as identified by H3K4me3 and p300 ChIP-seq, is proportional to a more profound loss of homeologous genes from the S subgenome. Next, we set out to determine the origin of this differential loss.
Large deletions are prominent in the S subgenome
The chromosomes of the X. laevis S subgenome are substantially shorter than the L chromosomes. The average size difference is 17.3% based on the assembled sequence [8] and 13.2% based on the karyotype [18]. To investigate the cause of these differences, we analyzed the pattern of deletions on both subgenomes. We called deleted regions based on the absence of conservation between the X. laevis subgenomes if they were at least partly conserved between one X. laevis subgenome and X. tropicalis. In addition, to be able to measure the size of the deletions, we required that the putative deleted regions were flanked on both sides by conserved sequences on both X. laevis subgenomes (Additional file 3: Figure S1). This resulted in a set of 19,109 deletions, of which 13,066 (68%) were deleted from S (LΔS) and 6043 (32%) were deleted from L (SΔL). There is a clear deletion bias towards S, which increases with the size of the deletion (Fig. 3a). These deletions affect genes and their regulatory sequences, as for example in the glrx2 locus where the promoter and most of the exons have been lost from the S subgenome (Fig. 3b). We asked to what extent functional sequences in the L and S subgenomes are preserved (i.e. subject to fewer deletions) relative to the subgenome-specific deletion rates. To do that, we randomly redistributed the deletions per chromosome and compared the effect on various annotated and experimentally derived features. As we cannot assess these features before their deletion, we used the annotation and experimental data of the homeologous feature from the other subgenome as a proxy for the state in the genome from which that feature was deleted. The fold difference between the observed number of deleted basepairs and the expected number (mean of 1000 randomizations) is visualized in Fig. 3c. As expected, the frequency of deletions in intergenic regions and introns is similar relative to a uniform chromosomal distribution of deletions. The observed loss of exons on L is significantly lower than this randomized distribution (p = 1.8e-20; Fig. 3c). The fraction of exonic sequence that has disappeared is approximately fourfold less than intronic or intergenic sequence (Additional file 3: Figure S2). This is likely the result of negative selection against loss. By contrast, for subgenome S the fraction of exonic sequence that has been deleted is similar to the rest of S (Fig. 3c) and exonic sequences in S appear not to be under selection against deletion. To obtain more direct evidence of functional sequences, we examined the loss of genomic elements that are decorated with RNAPII and the active transcription histone mark H3K36me3 (IntronicTx, ExonicTx, see “Methods”), with the enhancer coactivator p300, or with the active promoter mark H3K4me3. There appears to be strong selection on both S and L against deletion of actively transcribed exons (Fig. 3c, middle panel; p = 2.4e-4 and p = 2.3e-7, respectively) but not of transcribed introns. Furthermore, active enhancers and promoters in S and in L have significantly fewer deletions compared to the uniform chromosomal distribution (Fig. 3c; p = 8.4e-7, p = 8.4e-8, p = 1.4e-5, and p = 2.9e-12, respectively) and therewith appear to be under selection against loss. There is a large difference in the number of deletions between L and S (Fig. 3a); however, this in itself is not necessarily the result of selection as it mostly affects non-functional sequences (Fig. S2a). We asked if, on top of this difference in absolute number, there is evidence for more selection against deletions in L than in S. We therefore compared the reduction in the loss of transcribed exons, promoters, and p300 elements relative to background loss between L and S. For all three the reduction in L appears to be larger than in S (Fig. 3c). For p300-bound enhancers and for H3K4me3-decorated promoters this difference in the reduction between L and S is significant (p = 0.003 and p = 0.001, respectively). This suggests that, aside from a higher deletion rate in S, there is also less selection against deletion of functional genetic elements in S than in L.
One of the possible sources of the loss of genomic DNA in the L and S subgenomes is non-allelic homologous recombination (NAHR), which is known to occur between long repetitive elements on the same chromosome [19]. To test whether this phenomenon could be responsible for the genomic losses detected, we examined the length distribution of repetitive elements in retained regions, i.e. the homeologous regions of the sequences that were lost in one of the subgenomes (Fig. 3d). Indeed, we observe that repetitive elements are on average 3.7 times longer (p < 1e-52; Mann–Whitney U test) compared to random genomic sequences (Fig. 3d). Furthermore, the flanks of the retained regions (L for LΔS and S for SΔL, respectively) tend to be more similar to each other than random genomic sequences (p < 1e-83; Mann–Whitney U test; Fig. 3e). Nevertheless, the current density of repetitive elements is similar in the L and S subgenomes (Additional file 3: Figure S3), indicating that repeat density alone does not cause biased sequence loss on S chromosomes. These observations suggest that NAHR of ancient repeats has played a significant role in the deletions of regions from both subgenomes; the overall sequence loss is much more prevalent on the S chromosomes (Fig. 3a). To estimate when in the evolution these deletions and other types of mutations occurred, we dated the origin of the pseudogenes that they caused.
High levels of pseudogenization started after hybridization and continue to the present
To date the pseudogenes, we aligned them with the protein-coding regions in L, S, and the outgroup X. tropicalis (see the “Search and alignment of orthologs and evolution rates” section in “Methods”). The coding regions in S are generally less conserved than in L, especially regarding synonymous substitutions (Ks, Fig. 4a, p < 2.2e-16; Wilcoxon signed-rank test). However, the ratio between non-synonymous and synonymous substitutions (Ka/Ks) is only slightly higher in S compared to L (Fig. 4b, p < 2.2e-16; Wilcoxon signed-rank test). The difference in Ks between the L and S subgenomes shows that S has been subject to moderately higher mutation rates than L. In order to examine whether the relatively high level of mutations in the S genome persists to this day, we examined the level of SNPs separating the published inbred genome [8] and the progeny of two outbred individuals (see the “SNP calling” section in “Methods”). We observe that the level of SNPs in the S genome is 3% higher than in the L genome in intergenic (p = 5e-136; Chi-squared test) and intronic regions (p = 8e-101; Chi-squared test). A similar difference is observed in fourfold degenerate (4D) positions of coding DNA (also assumed to be under relaxed constraint) but this is not statistically significant (Additional file 4). The 4D positions exhibit a SNP density higher than in non-coding DNA; this correlates with an overrepresentation of CpGs in coding DNA (Additional file 3: Figure S4) and has been observed before in human genomes [20].
Given that the hybridization event occurred 17 Mya [8], the higher SNP density in S relative to L (Additional file 4) cannot be a relic from the time before the hybridization (Additional file 5) and it suggests that the relatively high rate of genome degradation in S continues to this day. To examine the continuity of this genome degradation, we dated unitary pseudogenes [21] caused by point mutations and/or deletion-related events (Fig. 5a). We distinguish four, non-exclusive types of pseudogenes: genes that contain a premature stop codon; genes of which the coding sequence is at least 50% shorter than their homeolog and their ortholog in X. tropicalis; genes that have lost at least the 75% of their promoter relative to their homeologs that do have a promoter decorated with H3K4me3 in embryos; and genes that contain a frameshift. We furthermore required for each class that the pseudogene candidate is expressed at least tenfold lower than its homeolog. In all cases, we do observe that the rate of pseudogenization has increased dramatically around 18 Mya, i.e. close to the inferred date of the hybridization, and that that rate is ~ 2.3-fold higher in S than in L (Fig. 5a). Furthermore, this rate continues to be high until this day for every class considered (Fig. 5b). We obtained very similar results when we included one-to-one orthologs from additional species in the dating of the pseudogenes and bootstrapped the results per gene to obtain confidence intervals (see the “Bootstrapping pseudogene dates” section in “Methods”) (Additional file 3: Figure S5). When we separate the pseudogenes into non-overlapping classes we observe that deletions are a prevalent cause of pseudogenization (39% and 44% on L and S, respectively) and, as expected, the older pseudogenes are affected by more than one type of damage (Additional file 3: Figure S6). Pseudogenization after genome duplication has been observed to affect certain classes of protein functions more than others, with metabolic functions often being the first ones to be lost relative to regulatory proteins [6]. Indeed, when we date the loss of genes in the function categories associated with the loss, we find an overrepresentation of various metabolic processes, with the pseudogenes belonging to those categories dating often shortly after the WGD event (Additional file 3: Figure S7). We found no evidence for the preferential loss of complete complexes rather than partial complexes, e.g. for dimers the fraction of cases where of both genes only a single copy was left (17.6%), was not higher than the expected percentage if we assumed the losses of the genes from complexes to be independent from each other (18.0%) (see “Methods”). To test for the influence of a potential dosage effect on gene loss, we compared the predicted genome-wide haploinsufficiency score (GHIS) [22] of the human ortholog of X. laevis homoeolog and singleton genes (Additional file 3: Figure S8). Singletons indeed have a significantly lower GHIS score than homeologs (p = 1.1e-17; Mann–Whitney U test), although the difference is minor (3.0%).
To find independent evidence that the rate of pseudogenization in X. laevis remains high until the present, we examined genes that appeared to be polymorphic with respect to their pseudogene state, i.e. we searched for protein truncating variants (PTVs) (variants which potentially disrupt protein-coding genes) in the progeny of two of our outbred genomes (see the “SNP calling” section in “Methods”) relative to the published inbred genome [8]. Among all possible PTVs, we limited the analysis to SNPs that introduce a premature stop codon (nonsense mutations), as they can be called relatively reliably [23]. As a reference, we compared the nonsense SNP density with the one we measured in X. tropicalis using the same type of data and settings to call the SNPs, i.e. the progeny of two outbred genomes. In the 23,667 annotated genes in L and 16,939 in S, we detect 528 (2.23%) and 367 (2.17%) genes with at least one loss of function (LOF) variant. In contrast, in the 26,550 genes of X. tropicalis, we detect only 388 (1.46%) LOF variants (Fig. 5c, left). When normalizing the nonsense variants by the total number of SNPs in coding regions per (sub) genome, the fraction of premature stop variants in S (5.9e-3) is slightly higher than that in L (5.7e-3) while both are substantially and significantly higher than in X. tropicalis (4.5e-3; p < 0.001 for both comparisons; Chi-squared test; Fig. 5c, right). To substantiate that the selected PTVs are indeed hallmarks of incipient pseudogenes, we compared their expression with the expression of the other genes in their respective (sub)genome and found that genes with a SNP introducing a premature stop codon have a significantly lower expression (Fig. 5d). Second, we used the equation for dating of unitary pseudogenes to estimate the time of loss of selection in the PTV containing genes. We found that genes with this type of variants present in the population show evidence of loss of selection when compared to the set of genes that are not pseudogenes (p = 1e-5; Student’s t-test; Fig. 5e) and that this loss of selection is more recent than for pseudogenes with only a single feature for pseudogenization that is fixed in the population (p = 5.6e-7; Student’s t-test; Fig. 5e). That we find a higher level of SNPs in S than in L cannot be a relic from the time before the hybridization in which the S species may have had a higher SNP density than L, given that the hybridization occurred 17 Mya (Supplemental note). Altogether, these results suggest that, in addition to deletions, a higher mutation rate and a more relaxed selection pressure in S has contributed to the differences that the subgenomes present nowadays, including differential gene loss. This gene loss continues to be at a higher rate than in a closely related diploid species.
Transposons have contributed subgenome-specific enhancer elements
The results described above document the pervasive loss and ongoing decay of coding and regulatory sequences after interspecific hybridization genome duplication. We next asked to what extent regulatory innovations have contributed to genomic evolution of this species. At many loci, the profile of p300 recruitment is remarkably different between L and S loci, with differences in both p300 peak intensity and number of peak regions across homeologous loci, for example in the slc2a2 locus (Fig. 6a). We identified 2451 subgenome-specific p300 peaks lacking any conservation with either the other subgenome or X. tropicalis (colloquially referred to as “new” enhancers). There are similar numbers of these non-conserved subgenome-specific p300-bound elements in the L subgenome (n = 1214) and the S subgenome (n = 1237).
Because new sequences can be acquired by transposition, we examined the overlap of subgenome-specific enhancers with annotated repeats and found that 87% (2143 of 2451; overlap > 50%) are associated with annotated repeats, compared to 24% (5557 of 23,017) of all enhancers (p < 1e-308; hypergeometric test). Three repeats (designated REM1, Kolobok-T2, and family-131) were particularly enriched; individually they overlap with 37–53% of the subgenome-specific p300 peaks, compared to 3–9% at other p300 peaks (Fig. 6b). Together these three annotations account for 1338 (54%) of new enhancers, 862 of which have all three annotations overlapping at the same location. They form a 650-bp sequence with an almost perfect 195-bp terminal inverted repeat (TIR), the most terminal 65 bp of which shows 83–90% similarity with the TIRs of a Kolobok-family DNA transposon present in X. tropicalis (Additional file 3: Figure S9). This specific Kolobok DNA transposon carries the REM1 interspersed repeat and is present almost exclusively in X. laevis (8833 and 8802 copies in L and S, respectively, vs. four copies in X. tropicalis), suggesting that it is a relatively young TE that proliferated after the split with X. tropicalis. It carries several transcription factor (TF) motifs, including the Eomes T-box motif and the Six3/Six6 homeobox motif (Fig. 6c).
We examined the correlation of the new Kolobok enhancers with gene expression and found that genes with a transcription start site within 5 kb of these subgenome-specific Kolobok enhancers are more highly expressed than other genes in that subgenome (p = 1e-4 for L and p = 8e-5; Mann–Whitney U test) (Additional file 3: Figure S10), suggesting that the new enhancers are inserted close to active genes and/or promote the expression of these genes.
Regulatory remodeling by transposons in X. tropicalis × X. laevis hybrids
The gene expression (Fig. 2) and p300 recruitment (Fig. 6) differences between the L and S subgenomes may have been caused by regulatory incompatibilities affecting enhancer activity or DNA methylation, which could act immediately upon interspecific hybridization. Alternatively, these differences may represent the long-term effects of genomic co-evolution of the two subgenomes. To examine whether the differences between the two subgenomes were caused by the hybridization event itself, we determined the immediate effect of hybridization on DNA methylation and the patterns of H3K4me3 and p300 enrichment at regulatory regions. We generated embryos obtained by fertilization of X. laevis eggs (LE) with X. tropicalis sperm (TS). The resulting LETS hybrid embryos were compared to normal laevis (LELS) and tropicalis (TETS) embryos. The reverse hybrid (TELS) was not viable, as previously described [24].
To examine the early potential changes in DNA methylation, we performed WGBS on the DNA of LETS, LELS, and TETS embryos. The overall methylation in hybrid and normal embryos is almost identical at 92%. We identified a total of 709 differentially methylated regions (DMR) (false discovery rate [FDR] = 0.05); 181 and 72 hypermethylated and 384 and 72 hypomethylated regions in respectively the X. laevis and X. tropicalis genomes. This reflects both gain and loss of DNA methylation in the subgenomes of LETS hybrid embryos (Fig. 7f, g). There is no evidence in the underlying DNA sequence signatures for these regions being related to gene-regulatory regions (Additional file 3: Figure S11a–d). They are also not in close proximity of genes and may represent regions with inherently unstable DNA methylation. The global pattern of H3K4 trimethylation at promoters is also quite similar in hybrids and normal embryos; less than ten peaks changed in hybrid embryos relative to normal embryos (Additional file 3: Figure S11e).
Recruitment sites of p300, however, are specifically gained and lost at several subsets of X. tropicalis genomic loci in hybrid embryos (Fig. 7a); 629 p300 recruitment sites were gained (a 2.6% increase relative to normal X. tropicalis embryos), whereas just 67 p300-bound regions were lost (adjusted p value cutoff 1e-5). In the X. laevis part of the hybrid genome, none were lost or gained (Fig. 7a), indicating that the changes in the hybrid are biased towards the paternal tropicalis genome. To assess the epigenetic state of the gained and lost p300-binding regions, we used our epigenome reference maps of histone modifications in X. tropicalis [16]. Among all the marks tested, only H3K9me3 was significantly enriched, specifically at sites of gained p300 recruitment (Fig. 7b), suggesting that these regions are heterochromatic in normal (TETS) embryos but can recruit the p300 co-activator in LETS hybrid embryos.
While examining the p300 hybrid-specific recruitment sites, we noticed that transposable elements were present at many locations (Fig. 7c, d); 82% of the hybrid-specific p300 peaks overlapped more than 50% with annotated repeats. We therefore examined the occurrence of specific repeats at gained p300 sites and found that three repeat annotations (family - 451, 203, and 189) were strongly enriched (p = 1e-5; hypergeometric test), each accounting for 20–37% of all newly gained p300 peaks, whereas they only overlap with < 1% of other p300 peaks (Fig. 7c, lower panel). The three repeat annotations strongly co-occur and form a 1.3-kb sequence with a 200-bp imperfect TIR, which shows ~ 80% similarity with those of known PiggyBac-N2A DNA transposons (Additional file 3: Figure S12). We recently found that DNA transposons that are heterochromatinized by H3K9me3 in X. tropicalis embryos are relatively young relative to other TEs [25]. Indeed, the piggyBac DNA transposons that gain p300 binding in hybrids are much less abundant in X. laevis than in X. tropicalis, suggesting that these relatively young transposons get derepressed in the X. laevis egg which has had little prior exposure to this transposon. These elements also carry transcription factor binding sites. Nine motifs are enriched (p = 1e-5; hypergeometric test) and are present in 10–35% of gained p300 recruitment sites, compared to a 1–3% prevalence of these motifs in other p300 peaks (Fig. 7e). These DNA-binding motifs represent binding sites of Homeodomain and T-box binding factors, which are abundantly expressed during early embryogenesis.
These results document DNA transposon-associated p300 recruitment and DNA methylation instability in experimental interspecific hybrids.