Lineage-specific rediploidization is a mechanism to explain time-lags between genome duplication and evolutionary diversification

Background The functional divergence of duplicate genes (ohnologues) retained from whole genome duplication (WGD) is thought to promote evolutionary diversification. However, species radiation and phenotypic diversification are often temporally separated from WGD. Salmonid fish, whose ancestor underwent WGD by autotetraploidization ~95 million years ago, fit such a ‘time-lag’ model of post-WGD radiation, which occurred alongside a major delay in the rediploidization process. Here we propose a model, ‘lineage-specific ohnologue resolution’ (LORe), to address the consequences of delayed rediploidization. Under LORe, speciation precedes rediploidization, allowing independent ohnologue divergence in sister lineages sharing an ancestral WGD event. Results Using cross-species sequence capture, phylogenomics and genome-wide analyses of ohnologue expression divergence, we demonstrate the major impact of LORe on salmonid evolution. One-quarter of each salmonid genome, harbouring at least 4550 ohnologues, has evolved under LORe, with rediploidization and functional divergence occurring on multiple independent occasions >50 million years post-WGD. We demonstrate the existence and regulatory divergence of many LORe ohnologues with functions in lineage-specific physiological adaptations that potentially facilitated salmonid species radiation. We show that LORe ohnologues are enriched for different functions than ‘older’ ohnologues that began diverging in the salmonid ancestor. Conclusions LORe has unappreciated significance as a nested component of post-WGD divergence that impacts the functional properties of genes, whilst providing ohnologues available solely for lineage-specific adaptation. Under LORe, which is predicted following many WGD events, the functional outcomes of WGD need not appear ‘explosively’, but can arise gradually over tens of millions of years, promoting lineage-specific diversification regimes under prevailing ecological pressures. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1241-z) contains supplementary material, which is available to authorized users.

The Atlantic salmon Hox clusters include six ohnologue pairs retained from the salmonid-specific ('Ss4R') WGD along with one singleton cluster [49]. The nomenclature used with respect to ohnologues retained from the teleost-specific ('Ts3R') and Ss4R WGD events is a/b and α/β, respectively [49]. There was a consistent phylogenetic signal in support of AORe model predictions for multiple salmonid ohnologues of HoxAa ( and HoxDa (Fig. S8). All the individual trees produced for these clusters included two separate salmonid clades, each represented once by each salmonid species and including one of the two salmonid-specific ohnologues from Atlantic salmon, where genomic organization has been established [49]. Relationships within each clade were consistent with robust molecular phylogenies [e.g. 32, 41], suggesting a strong signal of orthology across the captured salmonid ohnologues.
For the salmonid HoxAa, HoxCa and HoxCb duplicate clusters, combining the phylogenetic signal within different sampled alignments provided maximal statistical support (posterior probability 1.0) for the tree root representing the split of the northern pike from salmonids (Fig. S9), even though one fifth of the individual trees placed this species as a sister group to one of the salmonid-specific ohnologue clades, evidently at random along each cluster and with weak support (Fig. S2, S6 and S7). Considering the dominant signal indicating the expected branching of pike, these individual trees likely represent artefacts linked to the short length of individual alignments leading to a violation of the molecular clock. However, for the HoxDa analysis, northern pike was a sister to the salmonid clade containing the Atlantic salmon HoxDaα ohnologue in every individual phylogenetic analysis (n=7) (Fig. S8), which was recaptured in the combined analysis (Fig. S9). This raises the possibility that HoxDaα and HoxDaβ arose before the split of Salmonidae and Esociformes. However, this interpretation requires additional assumptions, including the loss of an entire HoxDa cluster in northern pike and cannot easily explain the detectable absence of salmonid-specific ohnologues for HoxDaα and HoxDaβ. Therefore, the consistent branching of pike with HoxDaα may represent an artefact linked to regional genomic differences in the pattern of evolution between the HoxDa clusters, again causing a violation of the relaxed molecular clock model.

LORe supported for two Hox cluster pairs
Phylogenetic analyses including 10 Atlantic salmon ohnologue pairs from the HoxBa cluster produced radically different topologies to those fitting the AORe model (Fig. S4). In an analysis combining the phylogenetic signal of each of these sampled alignments, the three salmonid subfamilies were monophyletic and independently split into two sister clades, each represented by the breadth of study species (posterior probability: 1.0; see Fig. 4A). Moreover, northern pike was maximally supported as the sister branch to all salmonids (Fig. 4A). This topology matches to predictions of LORe, as described in Fig. 1 and Fig. S1, assuming that permanent ohnologue divergence started independently within the basal evolution of each of the three salmonid subfamilies. An alternative scenario enforcing the AORe model requires the loss of a salmonid-specific Hox cluster in the ancestor to salmonids, followed by a minimum of three independent small-scale duplication events (assuming doubling of entire clusters; small-scale duplications of single Hox genes would require a much larger number of events), one in each subfamily (see Fig. S1). This scenario is extremely unlikely for vertebrate Hox genes, where local duplication events within a cluster are yet to be observed and all expansions have occurred via WGD.
Phylogenetic analyses including five Atlantic salmon genes spanning the 'singleton' HoxAb cluster led to trees consistent with predictions of LORe ( Fig. S3 and S9). In this case, we repeatedly identified a single orthologue of each Atlantic salmon HoxAb gene in all other members of Salmoninae and exactly two unique sequences in species from Coregoninae and Thymallinae. The presence of a single HoxBa cluster in Salmoninae can be explained by the loss of one ohnologous cluster in the common subfamily ancestor or alternatively, may reflect a region where rediploidization has yet to be resolved, or was resolved so recently that little ohnologue divergence has evolved, leading the assembly process to collapse into single contigs [38]. There was evidence that LORe of HoxAb occurred twice within Coregoninae, separately within Prosopium and Coregonus lineages ( Fig. S3; Fig. S9), which was commonly observed in our genome-wide phylogenetic analyses (Fig. 5).
Text S2: Ambiguous trees in the genome-wide LORe analysis As evidenced in Fig. 3 and Additional file 1, our genome-wide sampling of phylogenetic trees was almost always accompanied by a strong phylogenetic signal along verified duplicated collinear blocks of the genome, with only 13 out of 383 trees having an ambiguous topology out-with predictions of the LORe or AORe model. Interestingly, these trees were not randomly distributed, and concentrated within a single duplicated block maintaining collinearity across chromosomes 9 and 20 (or '9qc-20qb' using Atlantic salmon nomenclature [38]). We sampled 23 trees from the 9qc-20qb region (Additional file 1), of which 3 and 7 fit predictions of LORe and the AORe model, respectively (the remaining 13 being the ambiguous trees) (Additional file 1). This is notable, as 9qc-20qb is the only region in the genome where we observed AORe and LORe trees physically interspersed within a single duplicated collinear block. Inspection of the ambiguous trees failed to reveal consistent branching patterns to explain why 9qc-20qb is an outlier in our analysis. Instead, the branching patterns included a range of paraphyletic groupings, involving different subfamilies and their ohnologues that could not be reconciled with either LORe or the AORe model. 9qc-20qb is known to be unusual in maintaining an average level of similarity mid-way between regions of the duplicated genome that unambiguously match predictions of LORe vs. the AORe model ( Fig. 3) [38]. However, we cannot currently explain our findings without unwarranted speculation. Nonetheless, this results points to unique rediploidization dynamics underlying the divergence of 9qc-20qb compared to the remaining genome. A high-resolution comparative analysis of salmonid genomes will be needed to further address this puzzle.
Text S3. Further details on the sequence capture study

Design of capture baits
Additional file 5 includes the sequences and accession numbers for 1,293 unique capture probes used in our study. The probes represented cDNAs mainly encoding complete protein sequences. The probes were from several salmonid species, predominantly Atlantic salmon Salmo salar (1,024 probes), rainbow trout Oncorhynchus mykiss (160 probes) and coho salmon Oncorhynchus kisutch (99 probes). Approximately 40% (514) of the probes were pre-selected to cover functional pathways of prior interest. These genes were extracted by BLASTn, either against NCBI or transcriptome databases for O. mykiss [Supplemental ref 1] and O. kisutch [Supplemental ref 2]. In Additional file 6, sequences obtained from transcriptome databases have been assigned an accession number for a closely related sequence (>99% identity) from S. salar or O. mykiss. 60% (776) of the probes were randomly selected S. salar genes. 69% (893) of the probes represented 'singleton' genes, where the sequences of any potential gene duplicates were absent from the probe set, even when such duplicates existed. The remaining 400 sequences (31% of probes) represented putative salmonid-specific ohnologue pairs/groups defined from past work or via BLAST analyses (see section below).

Efficiency of sequence capture
The efficiency of sequence capture across different levels of probe-to-target sequence divergence was calculated by mapping raw reads captured from northern pike (Esox lucius) back to the pike genome [51] (Fig. S14). This was done only for sequences with a 1:1 relationship between the salmonid probe and target pike gene. Therefore, the analysis was restricted to the top BLASTn hits in the pike genome (>80% sequence identity cut-off) for each of the 893 singleton probes. This removed any confounding effects arising from the presence of two ohnologue probes within the sequence capture mix. The , comparing the probe-to-target percentage sequence identity versus the mean mapped coverage of captured reads. This approach confirmed that sequence capture worked efficiently across large genetic distances (Fig. S14), spanning probe-to-target nucleotide identities of 72 to 97% (average 88.3% divergence; SD: 3.7%). The returned coverage across different probes ranged from 9x to 2,333x (average: 374x coverage; SD: 264x). There was a significant, but weak predictive effect of probe-to-target nucleotide identity on the efficiency of the sequence capture (R =7.1%, P <0.0001). Thus, using salmonid probes to capture pike genes was highly effective across a large sample of different genes and divergence levels.

Assessing the capture of salmonid ohnologues
Species-specific assemblies of the captured reads were used in reciprocal BLAST searches against the singleton probes in order to estimate the proportion of putative gene duplicates captured by single probe sequences for all salmonid species (Fig. S15). BLAST searches were conducted using local BLAST v2.4.0+ [68]. The top 5 hits in each assembly with sequence similarity >85% to the probe sequence across at least 100bp (1e-0.20 cut-off) were assessed to determine if the probe sequence matched to one or two unique sequences across the length of captured regions corresponding to the probe sequence. When >1 sequence was recovered by a singleton probe sequence, they were defined as being unique if they shared <98% identity, which is the upper end of similarity between salmonid-specific ohnologues [38]. Moreover, a lower cut-off of 85% sequence identity was selected, as this is the lower end of sequence identity between salmonid-specific ohnologue pairs within salmonid genomes [38]. Of the 893 singleton probes queried, on average 99.5% (904 probes, SD=0.2%) returned contigs representative of at least one gene copy over the 15 salmonid species (Fig. S15). Around half of all genes are retained as ohnologues from the Ss4R WGD [e.g. 38, 45] and our sequence capture data fall in line with these expectations, as 45.1% (SD = 3.3%) of the BLAST searches of singleton probes captured two paralogous genes (Fig. S15). Finally, 100% of assembled contigs for one randomly selected species (Brachymystax lenok) had a significant BLAST hit against the original probes (1e-0.20 cut-off), indicating that the assemblies were also highly specific to the original probes. Taken with our past work [46] and data published within this paper, it is clear that the Agilent sureselect platform provides a highly repeatable approach to obtain recently duplicated genes from any salmonid species, including salmonid-specific ohnologues, even when only one gene duplicate is present in the capture probe set. Topology expected if LORe occurred within the evolution of each salmonid subfamily. The inclusion of multiple lineages within each salmonid subfamily enable more precise inference of the point where ohnologue divergence (i.e. rediploidization) started. F) Hypothetical evolutionary scenario where the part E tree is expected under the AORe model. This scenario requires the loss of one salmonid-specific ohnologue ancestrally followed by multiple independent local duplications in different salmonid lineages. Importantly, in our genome-wide analyses (Fig. 3), we always included verified salmonidspecific ohnologues defined independently by their location in collinear blocks retained from the Ss4R WGD [38]. This step adds confidence that the part E topology is not a product of small-scale duplication.              S14. Global efficiency of our sequence capture study. The plot shows the relationship between the mean number of northern pike reads captured (y-axis) in relation to the percentage identity shared between 'singleton' salmonid probe sequences (i.e. where only one salmonid ohnologue was present in the probe mix, even if more existed) and the equivalent target sequence in pike.   Indolalkylamine biosynthetic process Ssa02-05 (n=1); Ssa03-06 (n=2); Ssa07-17 (n=2); Ssa11-26 (n=1) 0.50 GO:0006586 Indolalkylamine metabolic process Ssa02-12 (n=1); Ssa03-06 (n=3); Ssa04-08 (n=1); Ssa07-17 (n=2); Ssa02-05 (n=1); Ssa16-17 (n=1); Ssa11-26 (n=1) 0.20 GO:0090403 Oxidative stress-induced premature senescence