Genome and transcriptome analysis of the Mesoamerican common bean and the role of gene duplications in establishing tissue and temporal specialization of genes

Background Legumes are the third largest family of angiosperms and the second most important crop class. Legume genomes have been shaped by extensive large-scale gene duplications, including an approximately 58 million year old whole genome duplication shared by most crop legumes. Results We report the genome and the transcription atlas of coding and non-coding genes of a Mesoamerican genotype of common bean (Phaseolus vulgaris L., BAT93). Using a comprehensive phylogenomics analysis, we assessed the past and recent evolution of common bean, and traced the diversification of patterns of gene expression following duplication. We find that successive rounds of gene duplications in legumes have shaped tissue and developmental expression, leading to increased levels of specialization in larger gene families. We also find that many long non-coding RNAs are preferentially expressed in germ-line-related tissues (pods and seeds), suggesting that they play a significant role in fruit development. Our results also suggest that most bean-specific gene family expansions, including resistance gene clusters, predate the split of the Mesoamerican and Andean gene pools. Conclusions The genome and transcriptome data herein generated for a Mesoamerican genotype represent a counterpart to the genomic resources already available for the Andean gene pool. Altogether, this information will allow the genetic dissection of the characters involved in the domestication and adaptation of the crop, and their further implementation in breeding strategies for this important crop. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0883-6) contains supplementary material, which is available to authorized users.

include 28,601 isotigs with an average length of 1,047 bp; form these assembled transcripts, 24% showed splicing variants (Table S12). The comparison of the transcriptome sequences and the genomic scaffolds of BAT93, revealed that 99.6% of the assembled isotigs could successfully map a genomic region.
Additionally, the isotigs were blasted against various plant databases and NCBI-NR, showing a large number of hits with other legumes such as Glycine max, Medicago truncatula and Lotus japonicus. Only 773 isogroups had no significant hit with any predicted gene or protein.
Genome assembly

P. vulgaris BAT93 assembly
We assembled a reference genome sequence from the P. vulgaris genotype BAT93 based on Roche/454, SOLiD and Sanger reads using Newbler v2.6 (Roche) ( Table S1). Roche/454 sequencing reads underwent trimming for the emulsion PCR primer sequence 5´-CAAGGCACACAGGGGATAGG-3'. Redundant MP and paired-end data were removed based on their map positions in a pre-assembly. SOLiD MPs were only used for the scaffolding step. SOLiD MPs were also removed based on mapping coordinates as above.

Assembly correction
Correction of consensus errors in the final Newbler assembly was performed using BAT93 genomic Illumina reads (45-fold genome coverage of raw data). Reads were filtered according to Minoche et al., [8] and Dohm et al., [9] Filtered reads were aligned to the assembled genome with BWA v.0.6.2 [10] allowing 3 mismatches. Variant calling was performed with SAMtools mpileup v.0.1.18 [11], and only uniquely mapped reads were considered. We corrected homopolymer consensus errors with length 3 nt or larger, covered by 10 or more reads (both in forward and reverse directions), and at least 60% of reads supporting an alternative correct sequence. Intra-scaffold gaps were closed with GapCloser (SOAP v.1.3 package) [12] on the consensus-corrected assembly using Illumina BAT93 reads (Tables S2 and S3).
Additional assembly improvement and verification was done by utilized genotyping-by-sequencing data, generated on the Illumina sequencing platform from 60 progeny of an F5 advanced intercross of (BAT93/ Jalo EEP558). Each line was sequenced to an average coverage of 6.7-fold. We first utilized 35 lines to assign genotype profiles to scaffolds and to detect misassemblies ( Figures S1 and 1a) The entire set of 60 lines was used for scaffold ordering. Genomic paired-end reads were filtered as above, and aligned onto the consensus-corrected, gap-closed reference assembly using BWA v0.5.8 (3 mismatches tolerated).
Only uniquely matching read pairs of correct distance and orientation were considered further. Picard v1.48 tools (http://picard.sourceforge.net) were used to merge, sort and remove duplicate reads; indels were re-aligned using GATK v1.6.5 [13]; we performed variant calling with SAMtools mpileup v. 0. 1.16 and GATK v1.6.5. The resulting variants were filtered to exclude false-positives calls by applying the "HARD_TO_VALIDATE" filter; we also adjusted the read depth thresholds depending on the sample coverage. Variant clusters within flanking windows of 20 bp were excluded. Positions identified as variants between BAT93 Illumina reads and the BAT93 reference assembly were not taken into account.
We used a sliding window of 10kb for scaffolds with length >=100kbp and a sliding window of 1kbp for scaffolds below 100 kb; genotype scores were assigned to all scaffolds larger than 20 kb. Then, a genotype profile was constructed by aggregation of all genotypes from all lines for each scaffold.
Unambiguous profiles were assigned to 1,908 scaffolds. Scaffolds with a profile change at the same position in more than 20 lines were considered as misassembly candidates, and 48 scaffolds were edited following visual inspection. Broken scaffolds were newly connected using SSPACE v.2.0 [14] if supported by read pairs and GBS data.

Chromosomal anchoring
Chromosomal anchoring of the assembly was performed using 827 publically available markers, which had been located on an integrated P. vulgaris genetic map (http://phaseolusgenes.bioinformatics.ucdavis.edu/), together with the genotype profiles obtained by using 60 GBS lines as described above. The markers were aligned against the BAT93 assembly with Blastn (e-value cutoff 1e-15); 688 markers aligned to unique positions within 399 scaffolds. These scaffolds were used as seeds, and their profiles were compared to all other profiles, tolerating at most three mismatches. Scaffolds with identical or very similar patterns were clustered and were assigned to the same linkage group as the respective seed scaffolds. The genotype profiles were then converted into MapMaker format, and genetic maps were constructed using AntMap v.1.2 [15] (Table S4).

Repeat masking
Transposable element (TE) annotation was performed on the assembly v.9.0 by a combination of software tools and refined by manual inspection. For the de novo predictions, based on the identification of repetitive signatures dispersed across the genome, the REPET [16] pipeline was used. This pipeline initially compares the genome sequence against itself, with similar segments being clustered and used to generate multiple alignments and deriving respective consensus regions. Following this procedure a repeat database was built and used to rescan the genome to refine the discovery process and to categorize the bona fide TE copies according to a recommended hierarchical classification framework [17].
The predicted LTR retrotransposon family was further refined using the programs LTRharvest [18] and LTRdigest [19]. These predicted LTR retrotransposons based on characteristic structural features, such as long terminal repeats and distinctive protein domains. The final prediction for LTR retrotransposons is the union of this procedure and REPET-based predictions. The collection of TE elements from the assembly v.9.0 was mapped into the assembly v.10.0 using the BLAT [22] and Blastn [23] alignment tools with the low complexity filter turned off. All families from the repeat database built on the v9.0 assembly, as described above, were successfully identified and localized in the v.10.0 assembly, with 33% of the P. vulgaris genome being spanned by transposon elements (Table   S6). Additionally, we ran RepeatMasker v3.2.8 against plant-specific repeat families and G. max repeat library from RepBase to identify interspersed repeats.

Protein-coding gene annotation
First, RNA-Seq reads from 34 tissues (Table S8) were aligned with GEM [24] to the reference genome.

Functional annotation
Functional annotation was performed by using in-house developed pipeline ( Figure S2). The pipeline performs an electronic inference of function that is based in the sequence similarity between the bean predicted proteins and known proteins in different public repositories: InterPro [32], KEGG [33], Reactome [34], PhylomeDB [35] and Blast2GO [36] Additionally, SignalP [37] was used to predict the presence and location of signal peptide cleavage sites. In total 62,713 (94.12%) transcripts and 26,635 (87.35 %) genes had some type of functional annotation (Tables S16-S17, Figure S3).

Small non-coding RNAs annotation
To detect small structured non-coding RNAs in the bean genome, we used the CMsearch tool from the Infernal package (version 1.1rc2) [39]. We scanned the genome looking at every RNA model stored in the Rfam database (version 11) [40]. Overlapping hits were removed by selecting the hit with the lowest E-value. Setting an E-value cut-off of 0.01 allows to detect 2,717 non-overlapping hits; of these 353 are in contigs and 2,364 in scaffolds. Small RNA sequencing libraries were made after a size selection step.
Reads from each small RNA libraries were independently aligned to the assembly v.10 using Bowtie2 [41]. Resulting mappings and de-novo predicted small RNAs were used as an input to htseq-count, HTSeq v.0.6.1 [42] to quantify small RNA features. We checked for the presence of sequences similar to rRNA by using the riboPicker tool [43]. As shown on Figure S5, the majority of the small RNAs are microRNAs, tRNAs and snoRNAs. In the Table S22 we have summarized number of the small RNAs annotated in the A.thaliana and legume plants in the Rfam database.

Prediction of plant disease resistance genes
An automated pipeline, the Disease Resistance Analysis and Gene Ontology (DRAGO) pipeline [44], was used to predict resistance genes (R-genes). This pipeline combines an homology search against a set of curated R-genes having experimental verification, followed by the search of the canonical domain combinations conferring the disease resistance function. The search for sequence homology consists of running BLASTP (with an e-value cutoff of 1e-10) against a manually curated set of 112 R-genes and performing the domain analysis using InterProScan v.5. Finally, proteins are assigned to each of the particular resistance classes depending on the domain combination they have.
The high number of predictions belonging to the RLK and RLP transmembrane classes indicates some type of overprediction by our pipeline. This can be due to the fact that both classes contain a general Kinase domain that is not specific for the disease resistance activity. Additionally, PTO-like R-genes were not predicted by our automated pipeline because this class contains only a general kinase domain. This class was initially identified in the tomato and interacts with the avrPto gene product of the bacterial pathogen Pseudomonas syringae and launches a disease sensitive response. In order to identify the PTO-like genes we used the available experimental evidence supporting the presence of this PTO-like family in Bean by doing a BLASTP (e-value cutoff 1e-10) search of the proteins identified in the cited experimental work against our set of proteins.
Additionally, 120 R-genes were identified following a similar protocol by Schmutz et al [46]. Briefly, InterProScan v.5 results were scanned to identify sequences containing the NB-ARC domain (PF00931).
After filtering results by the "trusted cutoff" established by PFAM (e-value equivalent to 1e-5), 174 genes were identified as containing NBS domains. Additional domains such as TIR (PF01582) and LRR (PF00560, PF07723, PF07725, PF12799, PF13306, PF13516, PF13504 and PF13855) were identified for this set of R-genes applying the same cutoff. Finally, those genes which were not previously identified were added to the R-genome of P. vulgaris BAT93 ( Figure S4, Table S21).
Long non coding RNA analysis Homology based lncRNAs were predicted using the strategy reported in [45], taking A. thaliana lncRNA transcripts taken from [47] as templates. These were blasted against the bean assembly previously masked for interspersed repeats using RepeatMasker [20] and the hits were then used as anchor points to re-align the A. thaliana queries with surrounding genomic regions using exonerate as a split aligner. Gene models thus recovered were then filtered for their coverage of the original query (>70%) and their content in plant specific ancestral repeats (<20%) as well as proximity to protein coding genes (>1kb). Final conservation was estimated on T-Coffee [48] pairwise re-alignments between the query and its predicted spliced model (excluding introns).
Ab-intio lncRNA models were predicted using Cufflinks [25] to build transcript models on all RNA-Seq samples, processed one at a time. Models containing only one exon or overlapping with protein-coding genes were filtered out. We then used Cuffmerge (from the Cufflinks suite) to combine transcript models from all samples into a single set of consensus models. These were then filtered by excluding models having high ancestral repeat coverage (>20%), coding potential (CPC [49] with default settings), any overlap with GeneID [27] predicted ORFs onto the genome, or ORF covering more than 25% of their length. These cut-offs were selected so as to retain the trusted predictions of non-coding genes obtained by homology. The full set was further filtered by eliminating all transcripts having both an RPKM lower than 0.1 in any of the original 27 RNAseq libraries and no homology support in any of the 12 plant genomes considered here. Sets of overlapping transcripts (>=1nt) were clustered into gene models and further filtered by removing all genes containing one or more transcripts within 1kb of protein coding genes.
Phylogenomic profiling was carried out by applying the homology-based strategy described above using the complete set as queries and the 12 genomes as targets. The horizontal clustering ( Figure 3, main text) was obtained by imposing the 12 species phylogenetic tree topology. Gene level conservation was estimated by considering, within each bean gene, the transcript conserved in the largest number of species. LncRNA transcript expressions were obtained using the Flux Capacitor (http://flux.sammeth.net/capacitor.html). Analysis was then done at the gene level, summing up the expression of transcripts belonging to the same loci. We only kept for further analysis genes with an expression of at least 0.1 RPKM in at least one condition. Housekeeping lncRNA genes were defined as the overlap between genes expressed with at least 0.1 RPKM in each of the organs with the set of lncRNA genes conserved in at least 7 species ( Figure S14).

1.Read mapping and quantification
Prior to mapping RNA-Seq reads were inspected for the presence of adapter sequences and trimmed if found; no other filtering was performed. Then reads from each sample were independently aligned to the reference P. vulgaris assembly v.10 using the gemtools RNA-Seq pipeline v.1.6.2 (https://github.com/gemtools/gemtools). This pipeline uses GEM as a mapper [24], which performs full paired-end, quality-aware alignment and does not require preliminary filtering by quality. At the same time GEM is split-aware, and finds gapped matches, that is, it can correctly align reads into the genome originating from exon junctions. Gemtools maps reads independently to the genome, transcriptome and de novo-created splice junctions and then merges results of this three mappings into a single output file.
Additional filtering steps were performed allowing less than 2 mismatches and less than 10 multimappings. On average, 89±5% of the reads were mapped across samples, 69±10% of the reads mapping uniquely (Dataset S5). applied "non-parametric irreproducible discovery rate" criteria [51]. The npIDR values were calculated for each genomic element independently (genes, transcripts, exons and junctions). The elements are assumed to be reproducible if they had in both replicates >0 reads mapped and its npIDR score was <0.1.
We consider a gene to be expressed if it has RPKM value >=1 in at least one sample. In total we found expression in 21,472 (70%) genes across samples, in average 16042±1117 per sample (Table S30; Figure   S10, Datasets S5-S7). Ubiquitous expression in all samples was found for 9,881(32%) genes.
For the differential expression analysis and co-expression network construction we have normalized read counts into counts per million (CPM) [52].
The libraries were classified into organ groups by its phenotype: root, stem, leaf, flower, axial meristems, pods and seeds. Also we grouped the libraries by developmental stage of the plant -5 vegetative stages: (Maturation) (see Table S8).

2.Housekeeping genes (ubiquitous expression)
Coefficient of variation(CV) value was used to identify putative housekeeping genes -we have calculated CV for each gene and have selected top 10% of the genes with lowest CV. In total, we assign 2811 genes into housekeeping category(Dataset S8).
Previously published data sets -1000 soybean and 2500 common bean, variety Negro Jamapa(PvGEA), housekeeping gene set were downloaded from the supplementary materials of the corresponded papers -Severin et al., Additional file 8 [53] and O'Rourke et al., Additional file 22 [54]. Ortology one-2-one relationships from the phylomeDB was used to compare selected gene sets. Beside the orthologs we can identify 92% of PvGEA genes and 78% soybean genes. In common between three data sets were 195 genes; and 1279 genes were in common between PvGEA and our data sets ( Figure S13).

Specific gene expression across organs and stages
Organ and stages specific expression was estimated with applying a threshhold of ≥0.1 RPKM to average gene expression in a given organ/stage, compare to the other organs/stages. GO terms enrichment analysis was done with R bioConductor package topGO using whole set of genes as a background.
In total we assigned 937 protein-coding and 171 long non-coding RNA genes to be organ specific. As stages-specific we identified 624 protein-coding and 130 long non-coding genes. (Table S30, Figures S14,S17)

Differential expression
Differential expression was estimated with the edgeR package (R version: 3.0.1, edgeR v. 3.2.4); using statistical methods based on generalized linear models (GLMs) [55,56]. To estimate the common dispersion in the absence of biological replicates we choose pseudo-biological replicas: SP_R7 / MP_R9, NR_V3 / R_V3, AM_V4b / AM_R5 FTL_V2 / TL_V4b. The common dispersion calculated (0.12887) was use for the rest of the analysis of differential expression. Genes that exhibited an FDR ≤ 0.05 and a log2 Fold Change ≥ 1 (log2FC) were determined to have differential expression (Datasets S12-S22).

Co-expression network
Gene co-expression networks provide a high level view on the relations between the expression of individual genes and help to identify sets of gene sharing expression profiles across multiple conditions.
In a co-expression network two genes are connected if their transcriptional profiles have significant correlation. For the construction of the bean co-expression network we used all protein-coding and long ncRNA genes that have >3CPM at least in one sample. From this set, we removed outlier genes that capture ~30% of expression in three tissues ( Figure S12) since they could introduce noise into network; 21,560 genes remained. The resulting data matrix was standardized both along genes (rows) and samples (columns). Standardization guarantees that only the profile and not the magnitude of gene expression contribute to covariance estimation. Standardization was done with the R-function 'scale'. The Pearson correlation matrix was calculated from the scaled data, and graphical lasso [57] with penalty parameter RHO =0.9 was applied to this matrix to reconstruct the network. The resulting network has 8,884 genes (vertex) connected with one or more edges (max is 260 connections per one vertex, mean is 18; density 0.002; transitivity 0.45) and 12,676 singletons. Downstream analyses were performed on the sub-network with vertex having one or more edges.
A random network with the same node degree distribution was generated using erdos.renyi.game command from the igraph R package. The resulting network had a maximum of 35±3 connections, and a density and transitivity 0.002.
A Fast-greedy community algorithm [58] was used to divide network into the modules. Modules are the set of nodes with a higher number of connections within module members than between members from different modules In total 738 modules were identified with 2-1271 genes in each module; 11 modules had more than 100 genes ( Figure 5, main text).
The GO terms enrichment analysis with the R package topGO was performed for the 11 largest modules (FDR ≤ 0.05, Datasets S23,S24).

Analysis of duplicated and paralogous genes expression.
For each pair of duplicated genes we have calculated pearson correlation coefficient and TEC score as indicator of their expression similarity.
Pearson correlation coefficient was calculated on normalized RPKM values across all samples.
To compute tissue expression complementarity (TEC) score we have transformed vector of the expression values among all samples for a given gene into set composed of 1 and 0; where 1 or 0 is corresponded to the expression above or below threshold of selected gene in a given sample (presence/absence of expression). Here we used 1RPKM as a threshold.
TEC score for two genes i and j was calculated by the formula suggested by the Huerta-Cepas et al. [59]: gene j is not expressed) and ti is the total number of samples in which gene i is expressed.
For each group of paralogous genes we have calculated its coefficient of variation (CV) as standard deviation normalized by mean expression level of the gene, for the genes having ≥ 1 RPKM in at least one sample. The CV value was used as a level of fluctuation of the gene expression across all samples. In additional we have performed few rounds of simulations in which genes were assigned to paralogous groups randomly ( Figure 6C, main text).

Publicly available transcriptomic data
RNAseq data from the Bellucci et al [60] and Rourke et al [54] studies were downloaded from the SRA archive [61]. The reads from each sample were independently aligned to the P. vulgaris assembly by using the gemtools RNAseq pipeline and gene expression was quantified by using Flux Capacitor, as was In order to perform the phylome reconstruction, a Smith-Waterman [62] search was used to retrieve homologs using an e-value cut-off of 1e-5, and considering only sequences that aligned with a continuous region representing more than 50% of the query sequence. Then, selected homologous sequences were aligned using three different programs: MUSCLE v3.8 [63], MAFFT v6.712b [64], and KAlign v2.08 [65]. Alignments were performed in forward and reverse direction (i.e. using the Head or Tail approach [66], and the six resulting alignments were combined using M-Coffee [67]. The resulting combined alignment was subsequently trimmed with trimAl v1.4 [68], using a consistency score cutoff of 0.1667 and a gap score cutoff of 0.1, to remove poorly aligned regions. Phylogenetic trees based on the Maximum Likelihood (ML) approach were inferred from these alignments. ML trees were reconstructed using the two best-fitting evolutionary models. The selection of the evolutionary models best fitting each protein family was performed as follows: A phylogenetic tree was reconstructed using a Neighbour Joining (NJ) approach as implemented in BioNJ [69]. The likelihood of this topology was computed, allowing branch-length optimization, using nine different models (JTT, LG, WAG, Blosum62, MtREV, CpREV, VT, DCMut and Dayhoff), as implemented in PhyML v3 [70]. The two evolutionary models best fitting the data were determined by comparing the likelihood of the used models according to the AIC criterion [71]. Then, the two ML trees were derived using these models with the default tree topology search method NNI (Nearest Neighbor Interchange). A similar approach based on NJ topologies to select the best-fitting model for a subsequent ML analysis has been shown previously to be highly accurate. Branch support was computed using an aLRT (approximate likelihood ratio test) parametric test based on a chi-square distribution, as implemented in PhyML. In all cases, a discrete gamma-distribution with four rate categories plus invariant positions was used, estimating the gamma parameter and the fraction of invariant positions from the data.

Orthology/paralogy predictions
Orthology and paralogy relationships among P. vulgaris genes and those encoded by the other considered genomes were inferred using a phylogenetic approach [72] (Tables S25 and S26). In brief, a species-overlap algorithm, as implemented in ETE v2 [73], was used to label each node in the phylogenetic tree as duplication or speciation depending on whether the descendant partitions have, at least one, common species or not (i.e. using a Species Overlap Score of 0). The resulting orthology and paralogy predictions can be accessed through phylomeDB.org [35]. These predictions have been used in subsequent analyses such as orthology-based functional annotation, identification of gene expansions, or duplication dating.

Functional gene annotation using orthology.
To complement the functional annotation based on proteins signatures, we functionally annotated BAT93 genes using orthology relationships. 13,522 one-to-one orthology relationships among P. vulgaris BAT93 genes and genes from other 5 species used in this project were found. Using these pairs, 43,028 GO terms were transferred from P. vulgaris BAT93 counterparts genes to them. Taking into account the annotation based in proteins signatures, we were able to add 20,696 new GO terms to 5,211 genes of which 365 genes were functionally annotated for the first time.

Gene mapping between P. vulgaris BAT93 and G19833 accessions.
Phylomes containing both bean accessions were used to assess to what extend the gene-set predicted for P. vulgaris BAT93 overlapped with the one for P. vulgaris G19833. 21,600 (~69.8%) genes in BAT93 were mapped to 21,604 (~79.5%) ones in G19833 (Table S18). Single gene-trees from both phylomes were scanned to predict one-to-one orthology relationships among bean sequences. Reciprocal one-to-one predictions constitute the biggest proportion of the mapping (~93.9%). One-to-one predictions detected from only one accession constitute the second source of mapping. Protein sequences were also scanned looking for identical sequences. Cases without orthology predictions but with identical sequence in both genomes were included as part of the mapping. Gene mapping was complemented by analyzing the gene-order conservation between bean accessions. Genes mapping to the same pseudo-chromosome/linkage group surrounded by genes already mapped were added to the final set.

Conservation level of orthologous PCGs of P. vulgaris BAT93 and G19833 acessions.
We aligned all orthologous pairs to assess their conservation in terms of sequence identity using MAFFT (parameter --auto) [64]. Identity was estimated using trimAl [68] before and after removing columns with gaps (parameter -nogaps). Orthologous pairs with sequence identities up to 0.95 were grouped into different bins (Supplementary table S19). Enrichment analyses for over-represented GO terms in these proteins compared with the whole set of annotated P. vulgaris BAT93 and G19833 PCGs was performed using FatiGO as implemented in the Babelomics webserver [74]. A Fisher exact test looking for overrepresented terms in specific sets of proteins against the whole annotated genome was used with a e-value cutoff of 0.001. Supplementary table S20 contains enriched GO terms at different sequence identity thresholds.

Species tree reconstruction
A phylogeny for the species included in the phylome was inferred using two complementary approaches, which rendered identical topologies. Firstly, a super-tree was inferred from all the trees in the three

Comparative genomics, in terms of homologous relationships, among plant species
A comparative analysis, in terms of homology relationships, was performed among the 14 species included in two of the three phylomes. To carry out such analysis, a BLAST search of all species against all species was performed to retrieve homologous sequences with a cut-off e-value of 1e-5 and a minimum coverage of 50% between query and target sequences. Then, results showing different patterns of homology, i.e. genes present in all species or specifically restricted to legumes, were computed and plotted in the species tree inferred (see Figure 3, main text).

Legume-specific proteins detected in P. vulgaris BAT93
P. vulgaris BAT93 proteins with detectable homologous sequences at legume species level were analyzed looking for any functional enrichment. Two patterns of homology were selected: 1) proteins with homologs in all 5 legumes species, and 2) proteins with homologs in at least one of the legume species included in this study but not BAT93. Enrichment analyses for over-represented GO terms for legume-specific proteins as compared with the whole set of annotated P. vulgaris BAT93 proteins were performed using FatiGO as implemented in Babelomics webserver [74]. A Fisher exact test looking for overrepresented terms in specific sets of proteins against the whole annotated genome was used with e-value cutoff of 0.001. Table S27 shows over-represented terms grouped by homology profiles and ontologies.

Accession specific expansions in P. vulgaris BAT93
We focused on accession specific expansions of the BAT93 genome for which 4,163 genes (~13.65%) were mapped to such events. Then, these genes were grouped into clusters with at least 75% of shared genes and a minimum size of 10 members. 974 genes (~23%) were assigned to an unique cluster. A broader selection of genes were filtered out because either they mapped to lineage-specific expansion in P. vulgaris or at least 25% of the cluster members had homologs in the G19833 genome. Fig. S6 shows the clusters distribution in terms of clusters size.
Those clusters were analysed looking for any statistically significant functional enrichment. Functional enrichment is provided for all clusters with at least one significant enriched term. Enrichment analyses of over-represented GO terms were performed by using FatiGo as implemented in Babelomics webserver [74] using the Fisher exact test for genome comparison and e-value cutoff of 0.001. Then, GO terms redundancy was reduced using REViGO webserver [75] setting a similarity threshold of 0.7, using the ratio of odd log as metric, as reference database the annotation the whole UniProt and as semantic similarity algorithm SimRel. Dataset S2, Additional file 2 contains the functionally enriched terms grouped by clusters.

Lineage specific expansions in P. vulgaris
We scanned single-gene trees from phylomes containing both bean genomes to detect genes family expansions predating the split of both accessions. We detected 661 BAT93 and 800 G19833 genes mapping to such events. Then, those genes were grouped into clusters with at least 75% of shared genes and a minimum size of 10 genes. 547 genes (~37%) were assigned to an unique cluster. Figure S7 shows the distribution of those clusters in terms of clusters size.
These clusters were analyzed looking for any statistically significant functional enrichment. We put together the functional annotation available for both genomes in order to carry out this analysis.
Functional enrichment is provided for all clusters with at least one significant enriched term. Enrichment analyses of over-represented GO terms were performed by using FatiGo as implemented in Babelomics webserver using the Fisher exact test for genome comparison and e-value cutoff of 0.001. Then, GO terms redundancy was reduced using REViGO webserver [75] setting a similarity threshold of 0.7, using the ratio of odd log as metric, as reference the whole UniProt database and as semantic similarity algorithm SimRel. Dataset S3 contains the functionally enriched terms grouped by clusters. There is a label to indicate which phylome was used to detect such cluster. Identical clusters (11) were labeled as phylome_BAT93.

Dating of duplications for P. vulgaris BAT93
We scanned the BAT93 phylome to detect and date duplication events, using a previously-described algorithm [76]. We focused on events assigned to four different relative evolutionary periods: basal to P.
vulgaris, basal to legumes, basal to rosids, and basal to the split of rosids and asterids. Individual trees were scanned and all duplication events that involved the seed protein and others bean proteins were dated. All these pairs together with their relative age were used to analyze the correlation, if any, between the age of the events and the expression patterns among paralogous proteins. Additionally, we recorded for each seed protein how many paralogous sequences were found. This information will be used to detect variations in terms of expression regarding the number of detected paralogous.
The ratio given by the number of detected events per age divided by the total number of single-gene trees reconstructed in the phylome informs about massive gene duplications, including Whole Genome Duplication (WGD) events, at specific relative ages in evolution. As shown in Table S28, it is possible to trace back the ancestral polyploidization events predating the split of rosids and asterids [77] where proteins almost duplicated twice. However, we did not detect similar duplication patterns at any of the other 3 relative ages. This can be easily understood taking into account the duplications patterns shown in Figure S8 where two clear patterns are present in terms of 1) number of duplicated proteins, and 2) how many times each protein got duplicated at each relative age.
P. vulgaris BAT93 proteins duplicated at different relative ages were analyzed looking for any functional enrichment. Enrichment analyses for over-represented GO terms were performed using FatiGO as implemented in Babelomics webserver [74]. A Fisher exact test looking for overrepresented terms in specific sets of proteins against the whole annotated genome was used with an e-value cutoff of 0.001.
Then, redundancy was reduced using REViGO webserver [75] setting a similarity threshold of 0.7, using the ratio of odd log as metric, as reference database the whole UniProt database and as semantic similarity algorithm SimRel. Dataset S4 contains the resulting functionally enriched terms grouped by relative age.

Assigning relative ages to P. vulgaris BAT93 proteins.
We scanned the BAT93 phylome to date speciation events, using a previously-described algorithm [76].
We only considered the furthest orthologous sequence, in terms of relative ages, for each seed protein. In this way, we were able to date 24,098 genes (~79%) of the BAT93 genome. For the remaining genes, we analyzed the BLAST output, after removing the limit of 150 sequences, to detect the most distant homologous sequence and date those genes according to the relative age of those homologs.    R-genes containing at least one kinase domain are represented as a different category since those genes were not predicted for the G19833 genome. Given the repetitive nature of R-genes, G19833 genes were mapped into BAT93 genes using bidirectional BLAST hits rather than the more accurate and finer methods based on orthology predictions.     Housekeeping genes were obtained from this study (P.vulgaris BAT93) variety, from O'Rourke et al [54] (P.vulgaris variety Negro Jamapa) and from Severin et al. [53] (Glycine max.)     The histogram shows the absolute number of genes divided into the ones with at least one paralog and the ones without any. Genes are grouped according to the number of connections in the co-expression network. Of note, only protein-coding genes were considered for this analysis.

Supplementary Figures
Supplementary Tables   Table S1. Assembly input data. Paired data sets were filtered for redundancy. Column "Coverage" lists the sequence coverage for single reads and the library coverage for paired data.  Supplementary Table S32. Proportion of genes in the largest modules from the co-expression network at different relative evolutionary ages. Proportion of genes from the 11 modules in the co-expression network with 100 or more connections with respect to all genes used to build the network.
Proportions are given at different relative evolutionary ages. Statistical significance of under/over-representation of a genes assigned to a given age as compared to the whole genome was computed using a Fisher's exact test. Of note only protein-coding genes were taken into consideration since the evolutionary dating was only available for this type of sequences. Protein-coding genes represent more than 97% of the total genes present in the network.

SUPPLEMENTARY DATASETS
Dataset S1. BAT93/Jalo EEP558 F5 samples used for GBS analysis. In green -samples used for initial genotyping and misassembly correction.
Dataset S2. GO terms enrichment for the strain-specific expanded gene families in the BAT93 genome.
Dataset S3. GO terms enrichment for the lineage-specific expanded gene families of P. vulgaris.
Dataset S4. GO terms enrichment for proteins duplicated at the four relative ages proposed in this study. Dataset S23. Co-expression modules with >100 genes Dataset S24. GO terms enrichment for the co-expression modules with >100 genes