Gene co-expression is tissue-specific and highlights functional evolutionary trajectories
We applied the Arboretum [9] algorithm to RNA-seq data of six tissues in five species and identified 10 modules of 12,051–14,735 co-expressed genes (1205–1474 genes per module per species) represented across 18,799 orthogroups (Fig. 1a). Modules of co-expressed genes across the five species show varying expression levels in specific tissues, e.g., module 1 is eye specific, while module 3 is heart, kidney, and muscle specific (Fig. 1a). Consistent with the phylogeny and divergence times, there are more (13,171/18,799) orthologous genes that are conserved in module assignment (orthologous modules) in the three closely related haplochromines (Pundamilia nyererei, Maylandia zebra, and Astatotilapia burtoni) and Neolamprologus brichardi, than with Oreochromis niloticus (11,212/18,799 orthologous genes). Examples of modules where orthologs are not conserved in module assignment (non-orthologous modules) include modules 2, 4, and 6 (Additional file 1: Fig. S1a, blue off-diagonal elements) and are representative of gene expression divergence across the species. Between the haplochromines alone, 4179/18,799 orthologous genes are distributed in either one of two modules, e.g., 0 or 8 (Additional file 1: Fig. S1a, blue off-diagonal elements in haplochromines), indicative of gene expression divergence along the phylogeny.
The assignment of co-expressed gene modules by Arboretum [9] is inferred using a probabilistic framework starting from the last common ancestor (LCA) in the phylogeny. This allows us to model the evolutionary trajectory of orthologous genes and their co-expression along the species tree [9]. Orthologous genes of each species can be assigned to non-orthologous modules (Fig. S-R1a), indicative of co-expression divergence and potential transcriptional rewiring from the LCA; this is referred to as “state changes” in module assignment. In total, 7587/18,799 (40%) orthologous genes exhibit state changes in module assignment across branches. To ensure orthologous genes of all branches are included in subsequent analysis, we focused on state changes of 6844 1-to-1 orthologous genes to assess convergent and unique state changes along the phylogeny (Fig. 1b). We identified convergent state changes of 732 genes along all ancestral nodes versus Anc4 (Additional file 1: Fig. S2). This is made up of 772 genes in Anc3 and Anc2, 734 genes in Anc3 and Anc1, and 996 genes in Anc2 and Anc1 (Additional file 1: Fig. S2), including a few TFs (46 TFs—Anc3-2-1; 49 TFs—Anc3-2; 46 TFs—Anc3-1; 66 TFs—Anc2-1) such as tbx20, nkx3-1, and hoxd10. We identified unique state changes and expression divergence of 655 genes along ancestral nodes (Fig. 1b), including several cellular and developmental TFs (51 TFs—Anc4/3; 20 TFs—Anc3/2; 34 TFs—Anc2/1) such as foxo1, hoxa11 and lbx1. Several of these state changed regulatory TFs are also enriched (fold enrichment 1.1–1.7; false discovery rate, FDR < 0.05) in gene promoters of relevant tissue-specific modules; for example, promoters of module 1 genes (eye-specific expression) are significantly enriched (fold enrichment 1.1–1.6; FDR < 0.05) for TF motifs involved in retina- and lens-related development/functions, e.g., CRX, PITX3, and OTX1 [21] (Additional file 1: Fig. S3, Additional file 2: Fig. S2). Further examination identifies that there are differences in the levels of TF motif enrichment across species genes, including that of retina/lens-related TFs, e.g., RARα/β/γ and RXRα/β/γ [22] of module 1 gene promoters in all species except N. brichardi (Additional file 1: Fig. S3, Additional file 2: Fig. S2). Such differences in motif enrichment could be associated with changes in the level of TF expression, where state changes (Fig. 1b) reflect shifted domains of tissue expression and imply differential regulatory control of target genes across tissues and along the phylogeny. We tested this by taking (1) the log expression ratio (as used for Arboretum input), for all 337 expressed TFs in each species tissue; (2) the corresponding 2064 TF motif enrichment scores (−log q-value, FDR < 0.05) calculated across 12,051–14,735 promoters regions of all species genes in the 10 modules; and (3) calculating the cross-species Pearson correlation coefficient (r) between the motif enrichment score and expression value of each TF and in each tissue (Additional file 2: Fig. S3-S8) using the n = 5 species. We note different patterns of correlation between cross-species TF motif enrichment and tissue-specific expression; in total, 102–119/337 TFs had no correlation (0 ≤ r ≤ 0.01, n = 5) and included many TFs that had large shifts in motif enrichment and/or expression in several species, representative of several phylogenetic state changes, e.g., Kidney-Module2-FOXO1 (r = 0.01, n = 5) (Extended Data S-R1F). On the other hand, there is positive correlation ranging from small (0.1 ≤ r ≤ 0.3, n = 5) for 161–197 TFs, medium (0.3 < r ≤ 0.5) for 161–186 TFs, and large values (0.5 < r ≤ 1) for 226–262 TFs. The largely correlated TFs (0.5 < r ≤ 1) includes cases where there is comparable motif enrichment scores across species, as calculated by the variance distribution (see “Methods”), and either no shifts (no TF state changes), e.g., Brain-Module9-FOXA2 (r = 0.97, n = 5, p value < 0.05) or focused shifts (TF state change in one or subsets of species), e.g., Eye-Module2-CDX1 (r = 0.98, n = 5, p value < 0.05) in TF tissue expression (Additional file 1: Fig. S5, Additional file 2: Fig. S3-S8). Such patterns of focused shifts in expression are also observed in TFs of selected modules like, for example, module 1 which contains eye-expressed genes. We find that retinal TFs that are known to modulate opsin expression, e.g., CRX [23], have variable motif enrichment (fold enrichment 1.2–1.4) in eye-expressed genes, and are associated (r = 0.85, n = 5, p value < 0.1) with a concurrent change (increase in four species or decrease in N. brichardi) in TF eye expression along the phylogeny (Additional file 1: Fig. S6; see Additional file 1text). For most TFs (226–262/337 TFs) and tissues, motif enrichment is largely correlated (0.5 < r ≤ 1) with TF expression. After calculating the variance of each TF motif enrichment and categorizing the tails into either similar or dissimilar levels of TF motif enrichment (see “Methods”), we note that similar motif enrichment (across species) is associated with either expression conservation (across all species) or subtle expression changes (in one or subsets of species) and is more stable (in expression differences) than TFs with dissimilar/variable motif enrichment along the phylogeny (Additional file 2: Fig. S3-S8). Gene co-expression differences and convergence between species could therefore be driven by differences in TF motif levels in gene promoter regions.
Fine scale nucleotide variation at TF binding sites drives regulatory divergence in cichlids through GRN rewiring
Cis-regulatory elements, including promoters and enhancers, are central to gene expression regulation, largely acting through the binding of TFs to multiple transcription factor binding sites (TFBSs). Therefore, mutations within TFBSs can alter target gene transcription without affecting the expression pattern of other genes co-regulated by the same TF, thus driving GRN evolution. In the five cichlid genomes however, there is no significant increase in evolutionary rate at promoter regions compared to fourfold degenerate sites (Additional file 1: Fig. S7). However, we identify a few outlier genes with significantly higher evolutionary rate at promoter regions at ancestral nodes (12–351 genes, Additional file 1: Fig. S7b) and within species (29–352 genes, Additional file 1: Fig. S7d), indicative of small-scale changes in promoter regions (see Additional file 1 text). Concurrently, of all the identified pairwise species variation (8 to 32 million variants), a large proportion (13–28%) overlap predicted TFBSs in promoter regions, and this is higher than (8–9%) of variants that are present in flanking gene promoter regions of the same length (Additional file 1: Table S2, Additional file 1: Fig. S8). GO enrichment analysis of co-expressed genes with variation in their regulatory regions, against a background of all genes in each genome, highlights associations with key molecular processes, e.g., signal transduction-promoter TFBSs (Additional file 1: Fig. S9).
To further investigate patterns of divergent regulatory programs that could be associated with discrete nucleotide variation at regulatory binding sites, we developed and applied a computational framework (see “Methods,” Additional file 1: Fig. S20) to comparatively study regulatory interactions/relationships across the five cichlids. This involved the reconstruction of species-specific GRNs through the integration of different genomic datasets (Additional file 1: Table S3). We focused on regulatory interactions/relationships of trans-acting factors (TFs) and DNA (gene promoter regions); this involved integrating an expression-based network with in silico predictions of TF binding to target gene (TG) promoters using our cichlid-specific and vertebrate-wide TF motif scanning pipeline (see “Methods,” Additional file 1: Fig. S20). We first used species- and module-specific gene expression levels to infer an expression-based network [24] (see “Methods,” Additional file 1: Fig. S20), generating 3180–4099 transcription factor-target gene (TF-TG) edges across the five species (FDR < 0.05, Additional file 1: Table S3). Next, based on our in silico TFBS motif prediction pipeline, we predicted TFBS motifs up to 20 kb upstream of a gene transcription start site (TSS), and using sliding window analysis of 100 nucleotides (nt), we retained TF motifs in the gene promoter region, defined as up to 5 kb upstream of a gene TSS (see “Methods,” Additional file 1: Fig. S22). Each statistically significant TFBS motif (FDR < 0.05) was associated to its proximal target gene (TG) and represented as two nodes and one TF-TG edge. Based on the integrated approach (see “Methods,” Additional file 1: Fig. S20), we predicted a total of 3,295,212–5,900,174 TF-TG edges (FDR < 0.05) across the five species that could be encoded into a matrix of 1,131,812 predicted TF-TG edges (FDR < 0.05), where each edge is present in at least two species. To ensure accurate analysis of GRN rewiring and to retain relevant TF-TG interactions, all collated edges were then further pruned to a total of 843,168 TF-TG edges (FDR < 0.05) where (1) the edge is present in at least two species; (2) edges are not absent in any species due to node loss or mis-annotation; and (3) edges are based on the presence of nodes in modules of co-expression genes (see “Methods”).
We used three metrics to study large-scale TF-TG network rewiring between species that included: (1) state changes in module assignment; (2) DyNet [25] network rewiring scores; and (3) TF rate of edge gain and loss in networks. The first metric compares TF-TG edges of a single “focal” species versus the other species in the context of gene co-expression, while the second and third metric compute a likelihood score for the overall extent of edge changes (across all species) associated with single nodes of interest. We first focused on 6844 1-to-1 orthologous genes represented in 215,810 TF-TG interactions, termed “TF-TG 1-to-1 edges,” along the five cichlid tree. Using a background set of all module genes (18,799 orthogroups), the TF-TG 1-to-1 edges are associated with morphogenesis and cichlid traits under selection, e.g., eye and brain development (FDR < 0.05, Additional file 1: Fig. S10a). There are 379 TFs represented in the TF-TG 1-to-1 edges, and we focus on their interactions/relationships to determine whether TFs with (state) changes in module assignment have altered regulatory edges. In the first metric, rewiring is characterized as a unique TF-TG edge present in only one “focal” species, where the TF node is (1) state changed in module assignment and (2) present as a node in different TF-TG edges in any/all of the other species. Using this metric, 50–70 out of the 379 TFs (13–18%) are rewired (spanning 4060–9423/215,810 edges, FDR < 0.05, Fig. 2a; see Additional file 1 text) and change module assignment across the five species (in one focal vs all four other species). The gene nodes connected by the rewired edges are associated with signalling pathways and processes such as cell differentiation and embryonic development (FDR < 0.05, background of all module genes, Fig. 2b). Further examination of rewiring rates in the networks of 6844 1-to-1 orthologous genes (in 215,810 TF-TG interactions) using the DyNet [25] degree-corrected rewiring (Dn) score (Fig. 2c, Additional file 3: Table S1) identifies rewired networks of nine teleost and cichlid trait genes associated with morphogenesis from previous studies (Fig. 2c, Additional file 3: Table S2). These genes have a few standard deviations higher degree-corrected rewiring (Dn) score than the mean (0.17 ± 0.03 SD), and their rewiring scores are comparatively higher (Kolmogorov–Smirnov KS-test p value = 6 × 10− 4) than all 1-to-1 orthologs (Fig. 2c, left violin plot, orange dots; Additional file 3: Table S3; see Additional file 1 text). Examples of these rewired 1-to-1 genes include gdf10b associated with axonal outgrowth and fast evolving in cichlids [18] and the visual opsin gene, rh2 (Fig. 2c, left violin plot; Additional file 3: Table S3 S-R3C). To enable a genome-wide study of network rewiring, we extend our analyses beyond the 6844 1-to-1 orthologs only, by including an additional 7746 many-to-many orthogroups (see “Methods”) resulting in a set of 843,168 “TF-TG all edges” across the five species. Using a background set of all module genes (18,799 orthogroups), the gene nodes in the 843,168 TF-TG all edges are associated with morphogenesis, e.g., retina development (FDR < 0.05, Fig. SR3aB). These edges include interactions of 783 TFs of which 13–18% (100–140 TFs) are predicted to be rewired (in 20,716-37,590/843,168 edges, FDR < 0.05, Fig. 2d) and change module assignment across the five species (in one focal vs all four other species), indicating their associated transcriptional programs (FDR < 0.05, background of all module genes) are also altered (Fig. 2e). By examining the network rewiring rates of 14,590 orthogroups (in 843,168 TF-TG interactions, Additional file 3: Table S4) using DyNet [25], we identify 60 candidate teleost and cichlid trait genes associated with phenotypic diversity from previous studies (Fig. 2c, right violin plot; Additional file 3: Table S5). These genes have a few standard deviations higher degree-corrected rewiring (Dn) score than the mean (0.23 ± 0.007 SD) of all orthologs, and their rewiring score is comparatively higher (KS-test p value = 6 × 10− 14) (Fig. 2c, right violin plot, orange dots; Additional file 3: Table S4). These genes include those associated with craniofacial development, e.g., dlx1a and nkx2-5 [21], telencephalon diversity, e.g., foxg1 [26], tooth morphogenesis, e.g., notch1 [27], and strikingly, most visual opsins, e.g., rho, sws2, and sws1, as well as genes associated with photoreceptor cell differentiation, actr1b [28], and eye development, pax6a [21] (Fig. 2c, right violin plot; Additional file 3: Table S5). We then focus on the gain and loss rates of 186/783 TFs with > 25 TF-TG edges along the five cichlid tree (see “Methods”). Out of the 186 TFs, 133 (72%) are predicted to have a higher rate of edge gain than loss, e.g., DLX5 and NEUROD2, possibly acting as recruited regulators of gene expression in each branch from their last common ancestor (LCA) (Additional file 3: Table S6), whereas 53/186 TFs (28%) have a higher loss of edges than gains, e.g., OLIG2 and NR2C2, implying loss of gene expression regulatory activity from their LCA (Additional file 3: Table S6). In general, TFs and their binding sites are evolving towards gaining, rather than losing regulatory edges from their LCA.
To further characterize the role of the observed changes in cis-regulatory elements and their potential association with cichlid traits, we extended our analyses to include several radiating cichlid species. We screened all predicted TFBS (see “Methods”) variants between M. zebra (a Lake Malawi species) and the other four cichlids, with their corresponding positions in 73 phenotypically distinct Lake Malawi species [20], to identify between-species variation at regulatory sites along the phylogeny (Additional file 1: Fig. S11). As expected, the majority of variation at regulatory sites is identified between M. zebra and distantly related Lake Malawi species clades, e.g., NKX2.1 TFBS in sws1 gene promoter, whereas shared ancestral sites are found with mainly same/closely related Lake Malawi clades, e.g., EGR2 TFBS in cntn4 gene promoter. Genes that are associated with traits under selection, e.g., visual systems [29] (sws1) and morphogenesis [18] (cntn4), harbor between species regulatory variants that segregate according to phylogeny and ecology of radiating lake species.
Cis-regulatory changes lead to GRN alterations that segregate according to phylogeny and ecology of radiating cichlids
Through our comparative approach, we can examine the regulatory network topology of several genes that are important for cichlid diversification [30, 31] and represented by our six tissues. As a case study, we focus on the cichlid visual system; the evolution of cichlid GRNs and diverse palettes of co-expressed opsins can induce large shifts in adaptive spectral sensitivity of adult cichlids [29], and thus, we hypothesize that opsin expression diversity is the result of rapid adaptive GRN evolution in cichlids. Indeed, by focusing on species utilizing the same wavelength visual palette and opsin genes, we note that several visual opsin genes (rh2b, sws1, sws2a, and rho) have considerably rewired regulatory networks (Additional file 3: Table S6). Across the predicted transcriptional networks of cichlid visual opsins, there are several visual-system-associated regulators (TFs) of opsin genes (sws2a, rh2b, and rho) that are either common, e.g., STAT1A, CRX, and GATA2, or unique to each species, e.g., IRF1, MAFA, and GATA2A (Additional file 1: Fig. S12–14). These patterns of TF regulatory divergence could therefore contribute to differential opsin expression.
Sws1 (ultraviolet) opsin is utilized as part of the short-wavelength sensitive palette in N. brichardi and M. zebra. While there are common regulators associated with retinal ganglion cell patterning in both species networks, e.g., SATB1 [32], there are also several unique regulators associated with nuclear receptor signalling, e.g., RXRB and NR2C2 [33], and retinal neuron synaptic activity, e.g., ATRX [34] (Fig. 3a). Overall, using a significance threshold of FDR < 0.05 for predicted TF-TG edges, there are more predicted unique TF regulators of sws1 in M. zebra (38 TFs) as compared to N. brichardi (6 TFs) (Fig. 3a, bottom right). Furthermore, we identify that a candidate regulatory variant has likely broken the M. zebra NR2C2/RXRB shared motif that is otherwise predicted 2 kb upstream of the N. brichardi sws1 TSS (Fig. 3b). Functional validation via EMSA confirms that NR2C2 and not RXRB binds to the predicted motif in the N. brichardi sws1 promoter, forming a complex, and the variant has likely disrupted binding, and possibly regulation of M. zebra sws1 (Fig. 3c, d). This is further supported by correlating expression values of these regulators and sws1, where NR2C2 is better associated with sws1 than RXRB, particularly when focusing on eye tissue (Additional file 1: Fig. S16a on right; Additional file 1: Fig. S16b; see Additional file 1text).
In another example, rhodopsin (rho), associated with dim-light vision, is predicted to be regulated by GATA2 in O. niloticus, A. burtoni, and M. zebra but not its duplicate gene, GATA2A only in M. zebra (Additional file 1: Fig. S14). We identify a candidate variant (red arrow, Fig. 4a) that has likely broken the M. zebra GATA2A motif that is otherwise predicted 1.8 kb and 1.9 kb upstream of the O. niloticus and A. burtoni rho TSS (Fig. 4a). Functional validation via EMSA confirms that GATA2A binds to the predicted motif in the O. niloticus and A. burtoni rho promoter, and the variant is likely to have disrupted binding, and possibly regulation of M. zebra rho (Fig. 4b). Species-specific expression correlations with the rho target gene are supportive of GATA2’s possible conserved role in all three species (O. niloticus r = 0.89; A. burtoni r = 0.39; M. zebra r = 0.28, n = 6 Additional file 1: Fig. S17c), while a more divergent role of GATA2A (O. niloticus r = 0.79 and A. burtoni r = 0.21, n = 6) and negative correlation in M. zebra (r = − 0.18, n = 6) is supportive (Additional file 1: Fig. S17c) of the EMSA validation (Fig. 4). This further supports the notion that discrete point mutations in TFBSs could be driving GRN evolution and rewiring events in traits that are under selection in radiating cichlids.
Finally, we studied GRN rewiring as a result of between species TFBS variation in the context of phylogeny and ecology of lake species. Owing to the variability and importance of spectral tuning of visual systems to the foraging habits of all cichlid species, we focused on variants at regulatory sites of rewired visual opsin genes in the Lake Malawi species, M. zebra, as a reference to compare GRN rewiring (through TFBS variation) that could be associated with the ecology of sequenced Lake Malawi species [20]. If indeed the TFBSs are likely functional, we hypothesize that radiating species with similar foraging habits would share conserved regulatory genotypes, to possibly regulate and tune similar spectral sensitivities, whereas distally related species with dissimilar foraging habits would segregate at the corresponding regulatory site. For this, we started with 157,232 sites that (1) have identified variation between the five cichlid species and (2) are located in TFBSs of M. zebra candidate gene promoters. We identified 5710/157,232 sites with between species variation across 73 Lake Malawi species (Additional file 1: Fig. S11) that also exhibited flanking sequence conservation, representative of shared ancestral variation. The homozygous variant (T|T) that breaks binding of NR2C2 to M. zebra sws1 promoter (Fig. 3 and Fig. 5blue arrow) is (1) conserved with the fellow algae eater, Tropheops tropheops, that also utilizes the same short-wavelength palette; (2) heterozygous segregating (Petrotilapia genalutea—C|T and Iodotropheus sprengerae—T|C) in closely related Mbuna species; and (3) homozygous segregated (C|C) in distantly related Mbuna species (Cynotilapia afra, Corydoras axelrodi, and Genyochromis mento) and most other Lake Malawi species of which some utilize the same short-wavelength palette and are algae eaters, e.g., Hemitilapia oxyrhynchus (Fig. 5). This suggests that in species closely related to M. zebra, and with a similar diet and more importantly, habitat, sws1 may not be regulated by NR2C2, whereas in other species it could be, similar to N. brichardi (Fig. 3 and Fig. 5red arrow). In another example, regulation of rho by GATA2, and not its duplicate, GATA2A (Fig. 4), could be sufficient for regulating dim-light vision response in rock dweller species (M. zebra and possibly Petrotilapia genulatea, Tropheops tropheops and Iodotropheus sprengerae), but both gata2 copies could be required to regulate rho in many other Lake Malawi species (79% with C|C genotype that otherwise predicts the GATA2A TFBS in rho gene promoter), as well as A. burtoni and O. niloticus (Additional file 1: Fig. S14–15). This highlights the potential differential usage of a duplicate TF in dim-light vision regulation. Phylogenetic independent contrast analysis [37] of the NR2C2-sws1 (Additional file 1: Fig. S18a-f) and GATA2A-rho (Additional file 1: Fig. S19a-f) genotypes against visual traits and ecology of each of the 73 Lake Malawi species highlights very little change in correlation once the phylogeny is taken into account and a regression model fitted. Based on these examples of TFBS variants that segregate according to phylogeny and ecology of lake species, GRN rewiring through TFBS variation could be a key contributing mechanism of evolutionary innovation, especially visual systems, in East African cichlid radiations.