Massive gene presence-absence variation shapes an open pan-genome in the Mediterranean mussel
Genome Biology volume 21, Article number: 275 (2020)
The Mediterranean mussel Mytilus galloprovincialis is an ecologically and economically relevant edible marine bivalve, highly invasive and resilient to biotic and abiotic stressors causing recurrent massive mortalities in other bivalves. Although these traits have been recently linked with the maintenance of a high genetic variation within natural populations, the factors underlying the evolutionary success of this species remain unclear.
Here, after the assembly of a 1.28-Gb reference genome and the resequencing of 14 individuals from two independent populations, we reveal a complex pan-genomic architecture in M. galloprovincialis, with a core set of 45,000 genes plus a strikingly high number of dispensable genes (20,000) subject to presence-absence variation, which may be entirely missing in several individuals. We show that dispensable genes are associated with hemizygous genomic regions affected by structural variants, which overall account for nearly 580 Mb of DNA sequence not included in the reference genome assembly. As such, this is the first study to report the widespread occurrence of gene presence-absence variation at a whole-genome scale in the animal kingdom.
Dispensable genes usually belong to young and recently expanded gene families enriched in survival functions, which might be the key to explain the resilience and invasiveness of this species. This unique pan-genome architecture is characterized by dispensable genes in accessory genomic regions that exceed by orders of magnitude those observed in other metazoans, including humans, and closely mirror the open pan-genomes found in prokaryotes and in a few non-metazoan eukaryotes.
The Mediterranean mussel Mytilus galloprovincialis Lamarck, 1819 (Bivalvia, Mytilida), a member of the M. edulis species complex, is an edible cosmopolitan bivalve mollusk and an important seafood product in Europe and China. This shellfish has been consumed by humans since 6000 BC, and its global production currently exceeds 400 thousand tons per year . Due to its invasive nature, this species has spread far beyond its native range, and it is considered a worldwide threat for autochthonous bivalve populations . Like many other marine invertebrates, mussels have separate sexes and reproduce by broadcast spawning. Upon the release of gametes into the open water, and after fertilization, larvae can travel long distances carried by the oceanic currents . Metamorphosis takes place during planktonic life, which ends after 1 to 2 months—depending on water temperature and food availability—with the settlement of juveniles and the start of the sessile adult life. Mussel beds are therefore usually composed by genetically heterogeneous individuals derived from large, randomly mating populations of different geographical origin. While genetic introgression in mussels has been broadly documented , a number of natural and genetic barriers concur in maintaining the genetic discontinuities observed both between different species belonging to the M. edulis species complex and between different lineages belonging to the same species .
Due to their filter-feeding habits, mussels are constantly exposed to a wide range of potentially pathogenic microorganisms, biotoxins, and anthropogenic pollutants. However, they display a remarkable resilience to stress and infections, can evolve novel traits in response to predation within a few generations , and have the ability to rapidly adapt to environmental changes, such as ocean acidification . Moreover, mussels are capable of significant bioaccumulation , without experiencing the massive mortalities often seen in other farmed bivalves [9, 10].
Although M. galloprovincialis displays a morphologically conserved karyotype compared with other mussels and has not undergone known whole-genome duplication or allopolyploidization events , it shares with other bivalves a relatively large and complex genome, characterized by high heterozygosity and numerous mobile elements [12,13,14,15,16,17]. These factors posed a serious challenge to previous assembly efforts, which resulted in extremely fragmented genome sequences for this species [18, 19] which, unlike the congeneric Mytilus coruscus  and a few other mussel species , still lacks a highly contiguous reference genome assembly.
The remarkable level of intraspecific sequence diversity which characterizes several bivalve immune gene families , together with the recent implication of high standing genetic variation within natural populations in the extraordinary capability of environmental adaptation of M. galloprovincialis , stimulate further investigation about the role played by the genomic complexity of this species in explaining its invasiveness and resilience . While a small but growing number of studies connected gene presence-absence variation (PAV) to the generation of this molecular diversity in a few gene families encoding antimicrobial peptides (AMPs) [22,23,24,25], it is presently unclear whether and to what extent this phenomenon is widespread in bivalve genomes.
Gene PAV is intimately linked to the pan-genome concept, which can be defined as a genome that includes a set of core genes found in all individuals and dispensable genes that are entirely missing in some individuals within a population . Pan-genomes have been extensively studied in prokaryotes and viruses, where a simple genome architecture and frequent lateral gene transfer events facilitate the acquisition of accessory genomic sequence that may provide an evolutionary advantage in the colonization of new ecological niches or in the interaction with the host [27,28,29]. In Eukaryotes, pan-genomes have been occasionally reported in plants, fungi, and microalgae, where they have been often associated with the development of phenotypic traits linked with environmental adaptation, resistance to diseases, and intraspecific differentiation [30,31,32,33,34,35,36]. Although a few studies have recently extended the pan-genome concept to the animal kingdom, to the best of our knowledge, these have so far mostly linked the dispensable fraction of animal genomes with intergenic regions, bringing little evidence in support of the association between accessory genomic regions and gene PAV with adaptation [37,38,39].
We here report an improved, highly contiguous reference genome assembly for M. galloprovincialis, obtained from the sequencing of a single female individual (nicknamed Lola) and provide evidence in support of massive gene PAV through the analysis of whole-genome resequencing (WGR) data from 14 additional individuals. The widespread observation of the gene PAV phenomenon, which involves 20,000 protein-coding genes significantly enriched in functions related with survival, provides strong evidence in support of the presence of an open pan-genome in the Mediterranean mussel.
An overview of the mussel reference genome
Our multi-step hierarchical de novo assembly strategy (Additional file 1: Data Note 1) resulted in a 1.28-Gb genome, of slightly smaller size compared to cytogenetic estimates , but of higher quality and contiguity compared to previous attempts [18, 19] (10,577 scaffolds; contig N50 = 71.42 kb; scaffold N50 = 207.64 kb). This genome shared some typical features of other bivalves, such as a low GC content (32%) and a widespread presence of repeats (43% of the assembly), but it was particularly rich in both protein-coding (60,338) and non-coding (75,973) genes. While these figures largely exceed those observed in most sequenced bivalve species [12, 14], they closely matched the numbers recently reported in the king scallop  and in the zebra mussel  (Additional file 1: Data Note 3). The reconstruction of the evolutionary relationships among M. galloprovincialis and 15 selected lophotrochozoan species [42,43,44,45], followed by an analysis of gene family trees , revealed that this large gene repertoire is the result of multiple lineage-specific duplication events that took place after the split between Mytilus and the rest of Mytilida (Fig. 1, Additional file 1: Data Note 5). Most mussel protein-coding genes were functionally annotated based on sequence similarity (56.17%) and were supported by transcriptomic evidence (78.70%). However, more than 5000 genes belong to recently acquired gene families specific of the Mytilus lineage, with uncharacterized function (Additional file 1: Data Note 20) .
A genome characterized by widespread heterozygosity and hemizygosity
The contribution of heterozygosity to the overall intraspecific genomic variation was estimated in Lola, Pura (a female individual subject of a previous assembly effort ), and in the 14 resequenced genomes by analyzing only those regions shared by all individuals. The average heterozygosity rate observed across individuals was 1.73 ± 0.24%, indicating that the mussel genome harbors a very high density of single-nucleotide polymorphisms (SNPs), 12–22-fold higher than the human genome [48,49,50] (Additional file 1: Data Note 6). This value is in line with previous reports [4, 7] and with genomic evidence collected in other mytilids [15, 16]. However, while this high heterozygosity rate may seem rather large when compared to other animal species, it does not appear to be the main source of intraspecific genomic diversity in M. galloprovincialis. Indeed, structural variation, and large insertion/deletion polymorphisms in particular, appear to be a key aspect in the genome of this species. In spite of the assembly strategy we adopted, aimed at the removal of sequence derived from alternative haplotypes, the haploid reference genome assembly still contained a high fraction (36.78%) of sequence with low coverage. The bimodal distribution of the read mapping coverage in Lola (Fig. 2b) clearly shows that such regions are found in a hemizygous state, i.e., they are present in only one of the two homologous chromosomes.
Massive gene presence-absence variation
The hemizygous fraction of the mussel genome does not only include intergenic regions, but also contains nearly one-third of the protein-coding genes annotated in the reference genome, as well as a significant fraction of non-coding genes (Additional file 1: Data Note 8 and 9). Even more surprisingly, our analyses revealed that 24.25% of the protein-coding genes and 16.71% of the non-coding genes were entirely missing in at least one of the resequenced genomes, i.e., they were subject to gene PAV [51, 52] (Fig. 2a–c). On average, each individual lacked 4829 (8.01%) protein-coding genes and 3744 (5.12%) non-coding genes found in Lola.
Unlike the 45,518 core protein-coding genes found in homozygous genomic regions in Lola and in all the resequenced genomes, the 14,820 genes subject to PAV are dispensable and often associated with hemizygous genomic regions, i.e., they can be present in either one, two, or in neither of the two homologous chromosomes of the different mussels analyzed. Indeed, most of the genes encoded by hemizygous genomic regions in Lola either displayed a sequencing coverage consistent with hemizygosity or were entirely absent in the resequenced genomes (58.50% and 23.23% on average, respectively, Additional file 1: Fig. S56). On the other hand, the overwhelming majority (98.05%) of the genes present in homozygous genomic regions in Lola were present in all the resequenced genomes, in 85.46% of cases with a sequencing coverage also consistent with homozygosity (Additional file 1: Fig. S57).
We ruled out the possibility that our observations were hampered by biases or confounding factors linked with the library preparation, sequencing, or bioinformatics analyses through a series of additional tests. First and foremost, the visual inspection on agarose gel of PCR amplification bands resulting from the analysis of twelve dispensable genes plus five core genes revealed a complete overlap between in silico predicted and experimentally observed PAV patterns (Fig. 3a, Additional file 1: Data Note 12). A similar PCR-based approach was extended to three independent families of full-sib mussels comprising the two parents and three first-generation offspring each. This allowed us to bring further support to the hypothesis that the dispensable genes are encoded by hemizygous genomic regions and that they follow a Mendelian mode of inheritance (Fig. 3d). Moreover, the presence of dispensable genes in hemizygous genomic regions was confirmed by the analysis of mapping data derived from a second round of sequencing of Lola obtained from a different tissue (gills, see Additional file 1: Fig. S33-S34), and the possible effect of mapping artifacts was excluded through computational simulations (Additional file 1: Data Note 10).
The mussel pan-genome
Our recursive pan-genome reassembly strategy (summarized in Additional file 1: Fig. S66), paired with a strict decontamination pipeline (Additional file 1: Fig. S67), led to the generation of 267,538 contigs unambiguously taxonomically assigned to M. galloprovincialis, accounting for 578.74 Mb sequence data not present in Lola. Consistently with the high contiguity observed for some dispensable genes contained in large (up to 30 kb) hemizygous genomic regions in Lola (Fig. 2d, Additional file 1: Data Note 17), several large assembled pan-genomic contigs had protein-coding potential. This process brought the cumulative size of the mussel pan-genome assembly to 1.86 Gb, and the total number of annotated protein-coding genes to 65,625 (20,106 of which are dispensable). On average, each resequenced genome included 1974 out of the 5286 newly annotated protein-coding genes. Overall, we estimate that each resequenced genome lacked, on average, 8141 dispensable genes found in the mussel pan-genome (Additional file 1: Data Note 15).
Characterization and functional enrichment of dispensable genes
Although mussel dispensable genes generally display a shorter ORF length and a lower gene architecture complexity than core genes, they mostly retain signatures of functionality, which include the presence of conserved regulatory elements and the lack of significant GC or codon usage bias (Additional file 1: Data Note 17). While dispensable genes display, on average, expression levels 3× lower than core genes, nearly 60% of them are supported by mild or strong transcriptional evidence, accounting for 3–10% of the global transcriptional activity, depending on the tissue considered (Additional file 1: Data Note 16). They are also on average evolutionarily younger, subject to an increased lineage-specific duplication rate and four times more likely to be taxonomically restricted than core genes (Additional file 1: Data Note 19–20).
We identified several mussel gene families significantly more prone to PAV than expected by chance (Fig. 4a). The functional annotation of the dispensable genes found both in the reference genome and in the recursively re-assembled pan-genomic contigs revealed an enrichment in functions related to survival. These may be provided by proteins with marked protein- or carbohydrate-binding properties (e.g., pattern recognition receptors like C1qDC proteins, FReDs, and Ig domain-containing proteins), involved in apoptosis pathways (e.g., DEATH and BIR), or playing a role in immune signaling (e.g., interferon-inducible and IMAP GTPases) (Additional file 1: Data Note 18).
Mussel gene families encoding antimicrobial peptides (AMPs) were also subject to massive gene PAV. Although several dozen different sequence variants were identified for each AMP family in the resequenced genomes, each individual mussel possessed a unique combination of a small number of variants, with very little overlap with other mussels from the same population (Fig. 4b, Additional file 1: Data Note 21). On the other hand, other gene families were significantly under-represented among dispensable genes. Notably, these included genes encoding transposable elements (and therefore found in multiple copies in the genome), such as reverse transcriptase and RNase H-like proteins, or with housekeeping functions (e.g., protein kinases and G protein-coupled receptors).
A pan-genome contains a set of core genes present in all individuals of the same species, which are fundamental for survival, and dispensable genes, which are only found in a subset of the individuals, and usually have accessory functions . The expansion of pan-genomic studies to Eukaryotes has later extended this concept to intergenic genomic regions and to the activity of mobile elements [37, 39, 50, 53]. However, we will refer here to the original pan-genome definition, associated with the accessory functions of dispensable genes and with the acquisition of the ability to quickly respond to selective pressure  and to colonize new ecological niches . The widespread occurrence of PAV in mussels is most certainly consistent with the gene-centric definition provided by Medini and colleagues  and reveals an “open” pan-genomic architecture with a high rate of dispensable to core genes, i.e., 1:3 (Fig. 3c).
The small size, simple organization and fast gene gain, loss, and horizontal transfer rates of bacterial and viral genomes [54,55,56] can explain the presence of a large number of dispensable genes in these organisms. However, pan-genomes have been also occasionally reported in eukaryotes, such as plants, fungi, and microalgae, where they may represent an unearthed source of intraspecific genomic diversity . For example, in some cultivated crops, dispensable genes contribute significantly to the development of agronomic traits [30,31,32], such as improved resistance to disease . Another key example of the adaptive importance of the pan-genome architecture is provided by some fungi, whose pathogenic potential, antimicrobial resistance, and host immune system avoidance are fueled by dispensable genes [33, 34, 58]. In the context of ecological adaptation, the cosmopolitan oceanic distribution and ability to thrive in different habitats of the coccolithophore Emiliania huxley can be explained by the acquisition of accessory metabolic activities provided by dispensable genes .
In spite of the growing number of reports of pan-genomes in eukaryotes, large-scale studies have been restricted to a few vertebrate species , targeting in particular human populations [37, 38, 50]. Moreover, the presence of accessory genomic regions not included in metazoan reference assemblies has never been associated with the massive occurrence of gene PAV, and the general impact of this phenomenon on animal intraspecific diversity has always been presumed to be minimal and in some cases deleterious .
Our observations, strongly supported by both experimental (Additional file 1: Data Note 12) and computational evidence (Additional file 1: Data Note 13), revealed that the dispensable fraction of the mussel pan-genome exceeds by approximately 5 and 15 times that of Sus scrofa and Homo sapiens, respectively [37, 39]. Moreover, while just 240 genes (i.e., 1.17% of the total) are presumably dispensable in humans , we show that 25% of mussel genes are subject to PAV and that each of the individual mussels resequenced in this study lacked on average 8141 dispensable genes identified in the pan-genome, pointing out that the fraction of protein-coding genes affected by this phenomenon in mussel is 20 times higher than in humans (Additional file 1: Data Note 22). To the best of our knowledge, PAV has been only marginally explored in bivalves as a phenomenon linked to a few gene families involved in immune functions, such as big defensins in Crassostrea gigas [22, 24] and myticins and myticalins in M. galloprovincialis [23, 25]. Therefore, this is the first study to report the widespread occurrence of gene PAV at a whole-genome scale in the animal kingdom.
Besides the 60,338 protein-coding genes present in Lola and the 5286 protein-coding genes associated with the recursively re-assembled contigs (Additional file 1: Data Note 15), the mussel pan-genome might include several thousand additional dispensable genes in natural populations which still remain unexplored (Additional file 1: Data Note 22). This open pan-genomic architecture is strongly supported by the recursive reassembly of 580 Mb DNA sequence not present in the reference assembly, by the observation that several dispensable genes were only observed in single individuals, and by the fact that the pan-genome assembly growth curve was far from reaching saturation (Additional file 1: Data Note 14).
The Mytilus genus has a complex evolutionary history, characterized by extensive gene flow among congeneric species, a process which is still ongoing in mosaic hybrid zones [60,61,62]. However, the analysis of nuclear and mitochondrial genetic markers ruled out the possibility that our resequenced individuals were hybrids between M. galloprovincialis and other Mytilus species (Additional file 1: Data Note 7), which suggest that dispensable genes are unlikely to be recently introgressed allelic variants that cannot be mapped to the reference genome due to their sequence divergence (Additional file 1: Data Note 10). While genetic admixture among contemporary mussel species cannot explain the mussel pan-genome architecture, the role of ancient hybridization and homologous recombination between ancestral Mytilus species remains to be investigated, as similar processes have been identified as the key drivers of PAV in plants . Similarly, the possible role of transposable elements in the origin and spread of the PAV phenomenon  will be fully elucidated only with the availability of a chromosome-scale assembly (Additional file 1: Data Note 17.4).
Regardless of the origin of mussel dispensable genes, their absence or presence in a hemizygous or homozygous state in the mussel genome suggests that the PAV phenomenon might be strictly dependent on the matching between paternal and maternal chromosomes during the fertilization process and that dispensable genes might have Mendelian inheritance (Fig. 3d). This hypothesis was confirmed by the observation of F1 proportions fully compatible with a Mendelian pattern in full-sibs resulting from a controlled cross between individuals showing PAV at the E3 ubiquitin ligase 1 gene (Fig. 3b).
Our finding that a large fraction of the mussel genome is in a single-allele state is congruent with the presence of chromosome structural variation  and with the significant intra-individual and inter-population variation in nuclear DNA content reported in previous cytogenetic studies . This also mirrors the situation previously described in other metazoans with high intraspecific genome diversity, such as the roundworm Caenorhabditis brenneri and the ascidian Ciona savignyi, which have genomes characterized by significant structural variations and frequent polymorphic indels [66, 67]. The very high amount of intraspecific genomic diversity revealed in our study may come at the cost of interfering with conventional homologous chromosome pairing, recombination, and segregation during meiosis.
The observation of highly skewed coverage profiles in the sequencing libraries from the gonadal tissue of some (but not all) male mussels, regardless of the stage of sexual maturation, may support this hypothesis (Additional file 1: Data Note 23). These results, confirmed by a second independent round of resequencing, were not obtained in non-reproductive tissues (i.e., gills) or in female individuals (Fig. 2a). We suspect that this observation may be the result of a significant presence of aneuploid gametes, potentially generated by an aberrant meiotic process linked with the high structural divergence between homologous chromosomes. Several studies have reported the presence of strong genetic barriers in Mytilus, acting both between and within species. Although intrinsic post-zygotic selection has been invoked as one of the most likely mechanisms underpinning the preservation of mosaic hybrid zones [62, 68], the nature of this process still remains to be elucidated. Here, we postulate that the reduced fertility of the offspring produced by individuals carrying “structurally incompatible” chromosomes may be key for explaining post-zygotic selection and the maintenance of the pan-genome architecture in mussels.
Whether the pan-genomic architecture of the mussel genome provides a selective advantage at the population level is a fundamental question. In our opinion, the large over-representation of genes involved in the response to stress and survival in the variable fraction of the mussel pan-genome (Additional file 1: Data Note 18) and the impact of PAV on the molecular diversification of AMPs (Fig. 4b, Additional file 1: Data Note 21) may suggest an adaptive role for the pan-genomic architecture. This would be consistent with the benefits provided by the development of a complex arsenal of immune molecules in sessile species characterized by high population densities such as mussels, where the spread of pathogens can be very efficient [69, 70]. We can speculate that the accessory functions provided by the 20,000 dispensable mussel genes might underpin an improved ability to adapt to challenging and varying environmental conditions, resulting in the cosmopolitan distribution and high invasiveness potential of this species [2, 6, 7] and explaining a high standing genetic variation in mussel populations . Since a large number of genes subject to PAV seem to belong to recently expanded, taxonomically restricted gene families with unknown function (Additional file 1: Data Note 19–20), the putative adaptive benefits of PAV might extend well beyond immunity and survival, with a potential impact on multiple aspects of mussel biology. At the present stage, in the absence of experimental data linking phenotypic variation and fitness to PAV in different ecological contexts, this remains a working hypothesis that needs to be formally tested.
Curiously, the genome of the congeneric non-invasive mussel M. coruscus , whose geographical distribution is limited to the Yellow Sea, displays a significantly lower number of protein-coding genes and a much lower level of heterozygosity compared with M. galloprovincialis (Additional file 1: Data Note 3 and 6), which suggests that the prevalence of PAV may vary from species to species. Future investigations, which will hopefully benefit from the release of additional chromosome-scale genome assemblies, should be aimed at investigating whether the pan-genomic architecture we described in the Mediterranean mussel is shared by other mollusks.
We provide, for the first time, significant evidence in support of the existence of widespread gene PAV in a metazoan pan-genome. The unusual structure of the mussel genome is the result of the massive presence of hemizygous genomic regions, which contain several thousand dispensable protein-coding genes. The enrichment of these genes in functions associated to resilience to stress and immune response warrants further investigation on the possible links between massive PAV and the evolutionary success of mussels, exemplified by the cosmopolitan distribution of this species in temperate marine coastal waters. Most likely, extensive PAV might be found in other cosmopolitan marine invertebrates characterized by broadcast spawning, very large effective population size and subject to similar environmental pressures, including other bivalve species where similarly high heterozygosity rates have been reported.
Reference genome sequencing and assembly
The genomic DNA extracted from the mantle tissue of a single female mussel individual nicknamed Lola, collected at Ría de Vigo (Spain), was processed to generate different sequencing libraries for sequencing on an Illumina HiSeq2000 platform. A short-insert (800 bp) paired-end (PE) library, whose output accounted for an expected 110X genome coverage , was complemented with two long-insert mate-pair (MP) libraries, with a fragment size of 3 and 5 kb, respectively. Moreover, a fosmid library with 150,000 clones was used to generate 150 pools containing 1000 clones each, and two additional independent fosmid-end (FE) libraries were also constructed and sequenced. Overall, 330 Gb of raw sequence data were produced by Illumina sequencing, and 15.63 Gb additional raw sequence data (10.5X coverage) was obtained from the sequencing of a SMRT library on a PacBio Sequel platform.
We followed a hybrid multi-step de novo assembly strategy, which combined algorithmic strategies from the de Bruijn graph and Overlap-Layout-Consensus methods (Additional file 1: Fig. S2), with the aim to produce a highly contiguous non-redundant haploid reference assembly of the mussel genome which, based on preliminary k-mer count analyses , was expected to display a considerable proportion of duplicated sequence and high heterozygosity.
Briefly, the non-redundant unitigs, built with ABySS  and merged with ASM  to remove large artefactual duplicated haplotype blocks, served as anchors for the hybrid assembly of PacBio reads with DBG2OLC . This noisy preliminary assembly was polished with Raccoon (https://github.com/lukud/raccoon-), using the sequencing data derived from the Illumina PE800 library. SSPACEv3.0  was then used to perform a first round of scaffolding using all the available PE, MP, and FE libraries available, and a second round of scaffolding with PacBio reads was performed with SSPACE-LongRead . The scaffolding procedure was re-iterated a second time, with both Illumina and PacBio reads, and was followed by a MP and FE libraries-derived gap-closing step performed with PBJelly .
The resulting assembly was subjected to an additional round of polishing with Proovread , and the coding portion of the genome was further refined with GATK , based on the alignment between the genome sequence and available transcriptome data generated with STAR . RNA-seq data was also used for an additional round of scaffolding with AGOUTI v0.2.4 . Finally, a local region of assembly, which included the myticin gene cluster, was improved by the combination of Platanus  and DBG2OLC .
All the aforementioned steps of the assembly were paired with strict decontamination procedures, which employed KRAKEN 2  and BLASTN [84, 85]. These were aimed at removing exogenous DNA sequence which may have resulted from accidental environmental or laboratory contamination, a common issue in NGS approaches . A final analysis of our final assembly (mg10) using Blobtools v1.1.1 [87, 88] confirmed the absence of known contaminants in the genome sequence. Detailed information concerning the DNA extraction, library preparation, sequencing, genome assembly, and decontamination processes are provided in Additional file 1: Data Note 1.2.
Quality evaluation, gene model construction, and functional annotation
The completeness of the genome assembly was estimated with BUSCO v.3 , using a set of 843 conserved metazoan single-copy orthologs as a reference, and the resulting data about the present, fragmented, duplicated, and missing gene models were compared with previous genome assembly efforts carried out in M. galloprovincialis [18, 19] (Additional file 1: Data Note 1.3.3). The residual presence of artefactual duplications was assessed with the Kmer Analysis Toolkit . Consensus gene models were obtained by combining transcript alignments generated with PASA v 2.0.2 , bivalve protein alignments created with SPALN v2.2.2 , and ab initio gene predictions obtained with GeneID , GeneMark-ES , GlimmerHMM , and Augustus . Evidences derived from these methods were assigned different weights and combined into consensus CDS predictions with EvidenceModeler-1.1.1. Gene models were subjected to an additional round of quality control to refine the annotation of UTRs and alternatively spliced exons (Additional file 1: Data Note 2.1 and 2.2). Gene models were functionally annotated with InterPro , KEGG , Blast2GO , SignalP , and NCBI CDsearch  (Additional file 1: Data Note 2.3). The gene models supported by PASA, but lacking a CDS, were considered as non-coding genes and included in a separate annotation track (Additional file 1: Data Note 2.5).
The completeness and integrity of gene models, as well as the genome assembly size and the number and density of gene models, were compared with several other recently sequenced molluscan genomes (Additional file 1: Data Note 3). Each gene model was assigned a support level (high, mild, or low) based on evidence obtained from Lola gills and digestive gland transcriptomes, as well as from several publicly available M. galloprovincialis RNA-seq datasets (Additional file 1: Data Note 4).
Whole-genome resequencing of 14 additional individuals and pan-genome recursive assembly
The genome of 14 additional adult M. galloprovincialis specimens, belonging to two independent European populations (Ría de Vigo, Spain, 9 individuals, and Goro lagoon, Italy, 6 individuals, Additional file 1: Data Note 6.1), was resequenced on an Illumina HiSeq 2500 platform, aiming at achieving a 35X genome sequencing coverage. Raw sequencing data from the previous assembly of Pura was also included in this analysis . In total, besides Lola, whole-genome resequencing (WGR) data of comparable quality was obtained for six female and eight male individuals. Trimmed sequencing reads were mapped on the mussel reference genome, and unmapped reads were collected and de novo assembled with the CLC Genomics Workbench v.20 (Qiagen, Hilden, Germany). Newly assembled contigs were added to the reference assembly, building a mussel pan-genome. This process, inspired by a similar approach previously carried out by other authors , was performed recursively for the 14 individuals (+ Pura), mapping the reads obtained from each genome against the growing pan-genome (Additional file 1: Fig. S66). All the de novo assembled contigs underwent a strict filtering process, aimed at removing exogenous contaminants, based on strict coverage and GC content criteria, and the detection of BLAST matches against known contaminants (Additional file 1: Fig. S67). Assembled contigs satisfying threshold quality criteria (detailed in Additional file 1: Data Note 14) were annotated following the same procedure described above for the reference genome.
Presence-absence variation analysis
Quality-trimmed sequencing reads obtained from all individuals were independently mapped to the reference assembly and to the accessory pan-genomic contigs, with BWA mem (v0.7.15) . As detailed in Additional file 1: Data Note 8, the mapping strategy we used aimed at tolerating multi-mappings (i.e., the alignment of reads with similar scores on different genomic positions). Exon mapping data, extracted with BEDtools , were used to calculate the average read coverage per base within the coding region of each gene. These values, normalized by the expected haploid genome size, allowed us to classify genes either as “present” or “absent,” depending on whether their normalized coverage exceeded 0.25 (i.e., less than 25% of expectations for a gene found in a hemizygous genomic region). This strict threshold was set to put a major focus on the high-confidence identification of putatively absent genes, at the cost of the detection of some false positives in the set of “present” genes. The same procedure was extended to non-coding genes annotated in the reference genome (Additional file 1: Data Note 9). The expected hemizygous and homozygous peaks of coverage for each genome were estimated with an accurate calibration pipeline, based on a set of over 4000 genes displaying high coverage stability across individuals, as explained in Additional file 1: Data Note 23.
The comparative analysis of gene PAV among individuals led to their categorization either as core (i.e., present on all individuals) or dispensable (i.e., absent in one or more individuals) genes. Please note that all the genes annotated in the accessory pan-genomic contigs were, by definition, dispensable, as they were not found in Lola, as verified by an accurate re-mapping of genomic reads obtained from both gills and mantle (Additional file 1: Data Note 14).
Validation of PAV patterns and further characterization of dispensable genes
We explored whether the PAV patterns observed could be explained by technical artifacts linked with mapping stringency criteria or by high sequence diversity between allelic variants, computationally simulating the effect of decreasing mapping stringency on mapping rates, and of increasing diversity between allelic variants on the drop of observed sequencing coverage (Additional file 1: Data Note 10). We further confirmed the widespread nature of PAV in mussels though the analysis of the distribution of the genes encoded by the accessory pan-genomic contigs in the resequenced individuals (Additional file 1: Data Note 15) and identified further cases of PAV through the analysis of several publicly available M. galloprovincialis transcriptomes (Additional file 1: Data Note 13).
The PAV phenomenon was further confirmed by PCR on 13 mussel genomes, through the evaluation of the presence-absence of specific amplification bands on agarose for 12 selected dispensable gene targets, expected to produce discordant PCR results across individuals due to PAV, and 5 core genes. These experimental observations were compared with in silico predictions (see the details in Additional file 1: Data Note 12.1). Moreover, similar PCR analyses were extended to three different families of full-sib mussels, produced after induced spawning of a single male and a single female individual, to test whether the presence-absence of dispensable genes could be explained by Mendelian inheritance (see the details in Additional file 1: Data Note 12.2.).
We assessed to what extent dispensable genes were associated with hemizygous genomic regions by evaluating whether their coverage was consistent with the presence of zero, one, or two alleles in each individual (Additional file 1: Data Note 11). Particular attention was focused on the analysis of a few selected large genomic regions characterized by the presence of several contiguous dispensable genes (see Additional file 1: Data Note 17.1). We characterized the transcriptional activity of dispensable genes in different tissues through the mapping of several RNA-seq datasets (as detailed in Additional file 1: Data Note 16) and evaluated whether they were associated with significant codon usage bias, functional promoters, architectural alterations, and flanking transposable elements (Additional file 1: Data Note 17).
Dispensable genes were subjected to functional enrichment analyses with hypergeometric tests , which identified over- or under-represented associated Gene Ontology terms and conserved domain annotations, based on a FDR-corrected p value threshold of 0.05 (Additional file 1: Data Note 18). The phylome data (see below, and Additional file 1: Data Note 5) allowed us to investigate the association of dispensable genes with recent linage-specific gene cluster expansions, through the comparison between the rates of gene duplication with the background rate of the genome (Additional file 1: Data Note 19), and to evaluate their overlap with taxonomically restricted gene (TRG) families (Additional file 1: Data Note 20).
The mussel phylome was reconstructed using the PhylomeDB pipeline , as detailed in Additional file 1: Data Note 5.1. This approach, which involved 16 target species, enabled the detection of orthology and paralogy relationships (Additional file 1: Data Note 5.2), lineage-specific gene duplications (Additional file 1: Data Note 5.4), and associated significantly enriched annotations, based on the genes annotated in Lola (Additional file 1: Data Note 5.5). Moreover, species trees were built using three different approaches: (i) a maximum likelihood analysis, carried out with PhyML v3.0 on a concatenated gene alignment dataset ; (ii) a gene-tree parsimony analysis, carried with the dup-tree algorithm ; and (iii) a coalescent-based analysis, performed with ASTRAL-III [44, 104] (Additional file 1: Data Note 5.3).
Assessment of genetic introgression from congeneric species
Exploiting previously published data and experimentally validated haplotypes, we inspected whether Lola, Pura, and the resequenced mussel genomes displayed genetic signatures consistent with their identification as part of a “pure” M. galloprovincialis lineage, or any evidence of hybridization with congeneric species M. edulis and M. trossulus existed. For this purpose, we recovered in each individual the two alleles for three target nuclear loci Glu-5′ [105, 106], mac-1 [107,108,109], and EFbis [110, 111], and the sequence of the mitochondrial markers 16S rRNA and COI [60, 112, 113]. As detailed in Additional file 1: Data Note 7, amplicon size was predicted by in silico PCR, and the nucleotide sequences, aligned with MUSCLE  with sequences of known taxonomic origin retrieved from GenBank, were used to build neighbor joining (NJ) phylogenetic trees .
Analysis of target PAV-associated gene families
We collected the nucleotide sequences of the core gene EEF1A1 and its dispensable paralogous gene EEF1A1-bis from Lola, Pura, and the 14 resequenced individuals. Similarly, all sequence variants available for the myticin, mytilin, big defensins, mytimycin, mytimacin, and myticalin gene families were recovered, with particular focus on the exons encoding the mature peptide region of these AMPs. We studied their association with the PAV phenomenon and investigated their molecular phylogeny and the ongoing process of pseudogenization of several variants with a NJ phylogenetic reconstruction approach (Additional file 1: Data Note 21).
Availability of data and materials
All the genome sequencing data obtained in this work, as well as the genome assembly and the annotation, are available in the European Nucleotide Archive (ENA) under the project IDs PRJEB24883 (Lola)  and PRJNA230138 (resequencing) . A genome browser and a blast server for this genome can be accessed in our local server (https://denovo.cnag.cat/mussel/). The mussel phylome is available for download or browsing at PhylomeDB .
FAO Fisheries and Aquaculture Department. Cultured aquatic species information programme. Mytilus galloprovincialis. Cultured aquatic species information programme. Rome: FAO Fisheries and Aquaculture Department; 2020. http://www.fao.org/fishery/culturedspecies/Mytilus_galloprovincialis/en.
Bonham V. Mytilus galloprovincialis. Invasive species compendium. Wallingford: CAB; 2017.
Gosling E. Bivalve molluscs: biology, ecology and culture. Hoboken: Blackwell Publishing Ltd; 2003.
Fraïsse C, Belkhir K, Welch JJ, Bierne N. Local interspecies introgression is the main cause of extreme levels of intraspecific differentiation in mussels. Mol Ecol. 2016;25:269–86.
El Ayari T, Trigui El Menif N, Hamer B, Cahill AE, Bierne N. The hidden side of a major marine biogeographic boundary: a wide mosaic hybrid zone at the Atlantic–Mediterranean divide reveals the complex interaction between natural and genetic barriers in mussels. Heredity. Nat Publ Group; 2019;122:770–784.
Freeman AS, Byers JE. Divergent induced responses to an invasive predator in marine mussel populations. Science. 2006;313:831–3.
Bitter MC, Kapsenberg L, Gattuso J-P, Pfister CA. Standing genetic variation fuels rapid adaptation to ocean acidification. Nat Commun. 2019;10:5821 Nature Publishing Group.
Goldberg ED. The mussel watch — a first step in global marine monitoring. Mar Pollut Bull. 1975;6:111.
Barbosa Solomieu V, Renault T, Travers M-A. Mass mortality in bivalves and the intricate case of the Pacific oyster, Crassostrea gigas. J Invert Pathol. 2015;131:2–10.
Xiao J, Ford SE, Yang H, Zhang G, Zhang F, Guo X. Studies on mass summer mortality of cultured zhikong scallops (Chlamys farreri Jones et Preston) in China. Aquaculture. 2005;250:602–15.
Pérez-García C, Morán P, Pasantes JJ. Karyotypic diversification in Mytilus mussels (Bivalvia: Mytilidae) inferred from chromosomal mapping of rRNA and histone gene clusters. BMC Genet. 2014;15:84.
Zhang G, Fang X, Guo X, Li L, Luo R, Xu F, et al. The oyster genome reveals stress adaptation and complexity of shell formation. Nature. 2012;490:49–54.
Du X, Fan G, Jiao Y, Zhang H, Guo X, Huang R, et al. The pearl oyster Pinctada fucata martensii genome and multi-omic analyses provide insights into biomineralization. Gigascience. 2017;6:1–12.
Wang S, Zhang J, Jiao W, Li J, Xun X, Sun Y, et al. Scallop genome provides insights into evolution of bilaterian karyotype and development. Nat Ecol Evol. 2017;1:s41559–017–0120–017.
Sun J, Zhang Y, Xu T, Zhang Y, Mu H, Zhang Y, et al. Adaptation to deep-sea chemosynthetic environments as revealed by mussel genomes. Nat Ecol Evol. 2017;1:0121.
Uliano-Silva M, Dondero F, Dan Otto T, Costa I, Lima NCB, Americo JA, et al. A hybrid-hierarchical genome assembly strategy to sequence the invasive golden mussel, Limnoperna fortunei. Gigascience. 2018;7:1–10.
Kenny NJ, McCarthy SA, Dudchenko O, James K, Betteridge E, Corton C, et al. The gene-rich genome of the scallop Pecten maximus. Gigascience. 2020;9 Oxford Academic:giaa037.
Murgarella M, Puiu D, Novoa B, Figueras A, Posada D, Canchaya C. A first insight into the genome of the filter-feeder mussel Mytilus galloprovincialis. PLoS One. 2016;11:e0151561.
Nguyen TTT, Hayes BJ, Ingram BA. Genetic parameters and response to selection in blue mussel (Mytilus galloprovincialis) using a SNP-based pedigree. Aquaculture. 2014;420–421:295–301.
Li R, Zhang W, Lu J, Zhang Z, Mu C, Song W, et al. The whole-genome sequencing and hybrid assembly of Mytilus coruscus. Front Genet. 2020;11 Frontiers:440.
Gerdol M, Gomez-Chiarri M, Castillo MG, Figueras A, Fiorito G, Moreira R, et al. Immunity in molluscs: recognition and effector mechanisms, with a focus on bivalvia. In: Cooper EL, editor. Advances in comparative immunology. Cham: Springer International Publishing; 2018. p. 225–341.
Rosa RD, Alonso P, Santini A, Vergnes A, Bachère E. High polymorphism in big defensin gene expression reveals presence–absence gene variability (PAV) in the oyster Crassostrea gigas. Dev Comp Immunol. 2015;49:231–8.
Leoni G, De Poli A, Mardirossian M, Gambato S, Florian F, Venier P, et al. Myticalins: a novel multigenic family of linear, cationic antimicrobial peptides from marine mussels (Mytilus spp.). Marine Drugs. 2017;15:261.
Gerdol M, Schmitt P, Venier P, Rocha G, Rosa RD, Destoumieux-Garzón D. Functional insights from the evolutionary diversification of big defensins. Front Immunol. 2020;11:758.
Rey-Campos M, Novoa B, Pallavicini A, Gerdol M, Figueras A. Comparative genomics reveals a significant sequence variability of myticin genes in Mytilus galloprovincialis. Biomolecules. 2020;10:943 Multidisciplinary Digital Publishing Institute.
Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr Opin Genet Dev. 2005;15:589–94.
de Brito AF, Braconi CT, Weidmann M, Dilcher M, Alves JMP, Gruber A, et al. The pangenome of the Anticarsia gemmatalis multiple nucleopolyhedrovirus (AgMNPV). Genome Biol Evol. 2015;8:94–108.
McInerney JO, McNally A, O’Connell MJ. Why prokaryotes have pangenomes. Nat Microbiol. 2017;2:17040.
Choudoir MJ, Panke-Buisse K, Andam CP, Buckley DH. Genome surfing as driver of microbial genomic diversity. Trends Microbiol. 2017;25:624–36.
Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G, Vaillancourt B, et al. Insights into the maize pan-genome and pan-transcriptome. Plant Cell. 2014;26:121–35.
Marroni F, Pinosio S, Morgante M. Structural variation and genome complexity: is dispensable really dispensable? Curr Opin Plant Biol. 2014;18:31–6.
Golicz AA, Batley J, Edwards D. Towards plant pangenomics. Plant Biotechnol J. 2016;14:1099–105.
Plissonneau C, Hartmann FE, Croll D. Pangenome analyses of the wheat pathogen Zymoseptoria tritici reveal the structural basis of a highly plastic eukaryotic genome. BMC Biol. 2018;16:5.
McCarthy CGP, Fitzpatrick DA. Pan-genome analyses of model fungal species. Microb Genom. 2019;5:e000243.
Read BA, Kegel J, Klute MJ, Kuo A, Lefebvre SC, Maumus F, et al. Pan genome of the phytoplankton Emiliania underpins its global distribution. Nature. 2013;499:209–13.
Hübner S, Bercovich N, Todesco M, Mandel JR, Odenheimer J, Ziegler E, et al. Sunflower pan-genome analysis shows that hybridization altered gene content and disease resistance. Nat Plants. 2019;5:54–62 Nature Publishing Group.
Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, et al. Building the sequence map of the human pan-genome. Nature Biotechnol. 2010;28:57–63 Nature Publishing Group.
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81.
Tian X, Li R, Fu W, Li Y, Wang X, Li M, et al. Building a sequence map of the pig pan-genome from multiple de novo assemblies and Hi-C data. Sci China Life Sci. 2020;63:750–63.
Ieyama H, Kameoka O, Tan T, Yamasaki J. Chromosomes and nuclear DNA contents of some species of Mytilidae. Venus. 1994;53:327–31.
McCartney MA, Auch B, Kono T, Mallez S, Zhang Y, Obille A, et al. The genome of the zebra mussel, Dreissena polymorpha: a resource for invasive species research. bioRxiv. 2019. https://www.biorxiv.org/content/10.1101/696732v1.
Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704.
Wehe A, Bansal MS, Burleigh JG, Eulenstein O. DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics. 2008;24:1540–1.
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 2018;19:153.
Gabaldón T. Large-scale assignment of orthology: back to phylogenetics? Genome Biol. 2008;9:235.
Huerta-Cepas J, Gabaldón T. Assigning duplication events to relative temporal scales in genome-wide studies. Bioinformatics. 2011;27:38–45.
Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 2009;25:404–13.
Leffler EM, Bullaughey K, Matute DR, Meyer WK, Ségurel L, Venkat A, et al. Revisiting an old riddle: what determines genetic diversity levels within species? PLoS Biol. 2012;10:e1001388.
The International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409:928–33.
Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet. 2019;51:30–5.
Springer NM, Ying K, Fu Y, Ji T, Yeh C-T, Jia Y, et al. Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009;5:e1000734.
Trowsdale J, Barten R, Haude A, Andrew Stewart C, Beck S, Wilson MJ. The genomic context of natural killer receptor extended gene families. Immunol Rev. 2001;181:20–38.
Morgante M, De Paoli E, Radovic S. Transposable elements and the plant pan-genomes. Curr Opin Plant Biol. 2007;10:149–55.
Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008;11:472–7.
Kuo C-H, Ochman H. The fate of new bacterial genes. FEMS Microbiol Rev. 2009;33:38–43.
Aherfi S, Andreani J, Baptiste E, Oumessoum A, Dornas FP, Andrade AC dos SP, et al. A large open pangenome and a small core genome for giant pandoraviruses. Front Microbiol. 2018;9 Frontiers:1486.
Khan AW, Garg V, Roorkiwal M, Golicz AA, Edwards D, Varshney RK. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends in Plant Science. 2020;25:148–58 Elsevier.
Badet T, Oggenfuss U, Abraham L, McDonald BA, Croll D. A 19-isolate reference-quality global pangenome for the fungal wheat pathogen Zymoseptoria tritici. BMC Biol. 2020;18:12.
Stammnitz MR, Coorens THH, Gori KC, Hayes D, Fu B, Wang J, et al. The origins and vulnerabilities of two transmissible cancers in Tasmanian devils. Cancer Cell. 2018;33:607–619.e15.
Śmietanka B, Burzyński A, Hummel H, Wenne R. Glacial history of the European marine mussels Mytilus, inferred from distribution of mitochondrial DNA lineages. Heredity. 2014;113:hdy201423.
Bierne N, Borsa P, Daguin C, Jollivet D, Viard F, Bonhomme F, et al. Introgression patterns in the mosaic hybrid zone between Mytilus edulis and M. galloprovincialis. Mol Ecol. 2003;12:447–61.
Ayari TE, Menif NTE, Hamer B, Cahill AE, Bierne N. The hidden side of a major marine biogeographic boundary: a wide mosaic hybrid zone at the Atlantic–Mediterranean divide reveals the complex interaction between natural and genetic barriers in mussels. Heredity. 2019;122:770–84.
Hurgobin B, Golicz AA, Bayer PE, Chan CK, Tirnaz S, Dolatabadian A, et al. Homoeologous exchange is a major cause of gene presence/absence variation in the amphidiploid Brassica napus. Plant Biotechnol J. 2018;16:1265–74.
Martínez-Lage A, González-Tizón A, Méndez J. Chromosome differences between European mussel populations (genus Mytilus). Caryologia. 1996;49:343–55.
Bihari N, Mičić M, Batel R, Zahn RK. Flow cytometric detection of DNA cell cycle alterations in hemocytes of mussels (Mytilus galloprovincialis) off the Adriatic coast, Croatia. Aquat Toxicol. 2003;64:121–9.
Small KS, Brudno M, Hill MM, Sidow A. Extreme genomic variation in a natural population. Proc Natl Acad Sci U S A. 2007;104:5698–703.
Dey A, Chan CKW, Thomas CG, Cutter AD. Molecular hyperdiversity defines populations of the nematode Caenorhabditis brenneri. Proc Natl Acad Sci U S A. 2013;110:11056–60.
Bierne N, Bonhomme F, Boudry P, Szulkin M, David P. Fitness landscapes support the dominance theory of post-zygotic isolation in the mussels Mytilus edulis and M. galloprovincialis. Proc R Soc London B Biol Sci. 2006;273:1253–60.
Rolff J, Siva-Jothy MT. Invertebrate ecological immunology. Science. 2003;301:472–5.
Cremer S, Pull CD, Fürst MA. Social immunity: emergence and evolution of colony-level disease protection. Annu Rev Entomol. 2018;63:105–23.
Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–70.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol İ. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–23.
Cruz F, Julca I, Gómez-Garrido J, Loska D, Marcet-Houben M, Cano E, et al. Genome sequence of the olive tree, Olea europaea. GigaScience. 2016;5:29.
Ye C, Hill CM, Wu S, Ruan J, Ma Z(S). DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci Rep. 2016;6:31900.
Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27:578–9.
Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics. 2014;15:211.
English AC, Richards S, Han Y, Wang M, Vee V, Qu J, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One. 2012;7:e47768.
Hackl T, Hedrich R, Schultz J, Förster F. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics. 2014;30:3004–11.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Zhang SV, Zhuo L, Hahn MW. AGOUTI: improving genome assembly and annotation using transcriptome data. GigaScience. 2016;5:31.
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 2014;24:1384–95.
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.
Lusk RW. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS One. 2014;9:e110808.
Laetsch DR, Blaxter ML. BlobTools: interrogation of genome assemblies. F1000Res. 2017;6:1287.
Challis R, Richards E, Rajan J, Cochrane G, Blaxter M. BlobToolKit – interactive quality assessment of genome assemblies. G3 (Bethesda). 2020;10:1361–74.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.
Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2017;33:574–6.
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:R7.
Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012;40:e161.
Parra G, Blanco E, Guigó R. GeneID in drosophila. Genome Res. 2000;10:511–5.
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33:6494–506 Oxford Academic.
Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Suppl 2):ii215–25.
Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–5.
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–62.
Conesa A, Götz S. Blast2GO: a comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics. 2008;2008:619832.
Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Meth. 2011;8:785–6.
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 2011;39:D225–9.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Falcon S, Gentleman R. Hypergeometric testing used for gene set enrichment analysis. Bioconductor case studies. New York: Springer; 2008. p. 207–20.
Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31:i44–52.
Rawson PD, Joyner KL, Meetze K, Hilbish TJ. Evidence for intragenic recombination within a novel genetic marker that distinguishes mussels in the Mytilus edulis species complex. Heredity. 1996;77:599–607.
Inoue K, Waite JH, Matsuoka M, Odo S, Harayama S. Interspecific variations in adhesive protein sequences of Mytilus edulis, M. galloprovincialis, and M. trossulus. Biol Bull. 1995;189:370–5.
Daguin C, Borsa P. Genetic characterisation of Mytilus galloprovincialis Lmk. in North West Africa using nuclear DNA markers. J Exp Mar Biol Ecol. 1999;235:55–65.
Daguin C, Bonhomme F, Borsa P. The zone of sympatry and hybridization of Mytilus edulis and M. galloprovincialis, as described by intron length polymorphism at locus mac-1. Heredity (Edinb). 2001;86:342–54.
Ohresser M, Borsa P, Delsert C. Intron-length polymorphism at the actin gene locus mac-1: a genetic marker for population studies in the marine mussels Mytilus galloprovincialis Lmk. and M. edulis L. Mol Marine Biol Biotechnol. 1997;6:123–30.
Bierne N, David P, Boudry P, Bonhomme F. Assortative fertilization and selection at larval stage in the mussels Mytilus edulis and M. galloprovincialis. Evolution. 2002;56:292–8.
Bierne N, David P, Langlade A, Bonhomme F. Can habitat specialisation maintain a mosaic hybrid zone in marine bivalves? Mar Ecol Prog Ser. 2002;245:157–70.
Gérard K, Bierne N, Borsa P, Chenuil A, Féral J-P. Pleistocene separation of mitochondrial lineages of Mytilus spp. mussels from Northern and Southern Hemispheres and strong genetic differentiation among southern populations. Mol Phylogenet Evol. 2008;49:84–91.
Stewart DT, Sinclair-Waters M, Rice A, Bunker RA, Robicheau BM, Breton S. Distribution and frequency of mitochondrial DNA polymorphisms in blue mussel (Mytilus edulis) populations of southwestern Nova Scotia (Canada). Can J Zool. 2018;96:608–13.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–25.
Gerdol M, Moreira R, Cruz F, Gómez-Garrido J, Vlasova A, Rosani U, et al. PRJEB24883. ENA. https://www.ebi.ac.uk/ena/browser/view/PRJEB24883 (2020).
Gerdol M, Moreira R, Cruz F, Gómez-Garrido J, Vlasova A, Rosani U, et al. PRJNA230138. ENA. https://www.ebi.ac.uk/ena/browser/view/PRJNA230138 (2020).
Huerta-Cepas J, Capella-Gutiérrez S, Pryszcz LP, Marcet-Houben M, Gabaldón T. Phylome ID: 599. PhylomeDB. http://phylomedb.org/phylome_599?q=phylome_browser&phyid=599 (2020).
We would like to thank the following members of the CNAG Data Analysis Team for their help evaluating intermediate assemblies and their useful comments on Variant Calling: Sophia Derdak, Marcos Fernández, Steve Laurie, Jordi Morata, and Raúl Tonda. We are grateful to Edoardo Turolla from Istituto Delta (Goro, Italy) for the support provided in mussel sampling. We would like to thank María Gasset for critically reading the manuscript and providing suggestions for its improvement.
Peer review information
Kevin Pang was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
The review history is available as Additional file 3.
This work was conducted with the support of the projects AGL2011-14507-E, AGL2015-65705-R, RTI2018-095997-B-I00 (Ministerio de Ciencia, Innovación y Universidades, Spain) and INCITE 10PXIB402096PR, IN607B 2016/12 (Consellería de Economía, Emprego e Industria - GAIN, Xunta de Galicia). Antonio Figueras, Beatriz Novoa, Rebeca Moreira, Alberto Pallavicini, Marco Gerdol, Paola Venier, and Umberto Rosani are supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 678589. David Posada is supported by the European Research Council, the Spanish Ministry of Economy and Competitiveness, and Xunta de Galicia. We acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, the Centro de Excelencia Severo Ochoa, the CERCA Programme/Generalitat de Catalunya, the Spanish Ministry of Science and Innovation through the Instituto de Salud Carlos III, the Generalitat de Catalunya through Departament de Salut and Departament d’Empresa i Coneixement, and the Co-financing by the Spanish Ministry of Science and Innovation with funds from the European Regional Development Fund (ERDF) corresponding to the 2014-2020 Smart Growth Operating Program.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Manuscript data notes. This file includes additional data notes 1–24, along with all the supplementary figures and most supplementary tables referenced in the main text.
Large supplementary tables. This file includes Table S1, Table S3, Table S4, Table S6, Table S16, Table S17, Table S18, Table S19, Table S20, Table S21, Table S22, Table S23, Table S24, Table S25, Table S26, Table S27, Table S28, Table S29, Table S30, Table S31, Table S32, Table S33, Table S34, Table S40, Table S52, Table S53, Table S54, Table S55, Table S56 and Table S57.
About this article
Cite this article
Gerdol, M., Moreira, R., Cruz, F. et al. Massive gene presence-absence variation shapes an open pan-genome in the Mediterranean mussel. Genome Biol 21, 275 (2020). https://doi.org/10.1186/s13059-020-02180-3