DNA extraction, Illumina/PacBio library preparation and sequencing
Fresh developing fruits of “Rojo Pasión” were frozen in liquid nitrogen immediately after being sampled in Murcia, Spain. After being shipped to the Max Planck Institute for Plant Breeding Research (MPIPZ, Cologne, Germany), DNA was extracted from the mesocarp and exocarp of the fruits using the Plant DNA Kit of Macherey-Nagel™ to create a PacBio sequencing library. Meanwhile, fresh leaves were sampled from the parental cultivars (“Currot” and “Orange Red”) at the experimental field of CEBAS-CSIC in Murcia, Spain, and Illumina short-read libraries were prepared after DNA extraction using the Plant DNA Kit of Macherey-Nagel™.
All libraries were sequenced with the respective sequencing machines (Illumina HiSeq 3000 and PacBio Sequel I) at Max Planck Genome-centre Cologne (MP-GC), which led to 19.9-Gb long reads for “Rojo Pasión” (PacBio; Additional file 1: Fig. S2) and 15.7- and 16.2-Gb short reads for the parental cultivars (Illumina). Note that the parental WGS data were only used for haplotype validation and for sorting the individual chromosome assemblies to two sets of eight chromosomes to match the inheritance of the chromosomes.
Pollen nuclei DNA extraction, 10x sc-CNV library preparation and sequencing
Dormant shoots of “Rojo Pasión” bearing developed flower buds were collected in Murcia, Spain. Then, the shoots were shipped at 4 °C to MPIPZ (Cologne, Germany) and were grown in long-day conditions in the greenhouse. Flowers at the pre-anthesis stage were frozen in liquid nitrogen. Anthers from ten “Rojo Pasión” [27] flowers were extracted with forceps and submerged in woody pollen buffer (WPB) [42]. Around 500,000 pollen grains were extracted from anthers by vortexing them in WPB. The nuclei were isolated from the pollen using a modified bursting method [30]. Isolated pollen was prefiltered (100 μm) and bursted (30 μm) using Celltrics™ sieves and woody pollen buffer. The nuclei were then stained with propidium iodide (PI) at 50 μg/mL just before sorting and counting by flow cytometry to remove pollen grain debris using a BD FACSAria Fusion™ with high-speed sort settings (70 μm nozzle and 70 PSI sheath pressure) and 0.9% NaCl as sheath fluid. The nuclei were identified by PI fluorescence, light scattering, and autofluorescence characteristics (Additional file 1: Fig. S4). A total of 12,600 nuclei were counted and collected in a solution of 4.2 μL phosphate-buffered saline with 0.1% bovine serum albumin.
According to manufacturer’s instructions, the nuclei were loaded into a 10x™ Chromium controller in two batches with 6300 nuclei each, i.e., two 10x sc-CNV libraries were prepared. In each library, DNA fragments from the same nucleus were ligated with a unique 16-bp barcode sequence (of A/C/G/T). Both libraries were sequenced using Illumina HiSeq3000 in the 2 × 151 bp paired-end mode, totaling 95 and 124 million read pairs, respectively (61.7 Gb).
Hi-C library preparation and sequencing
Approximately 0.5 g of flash-frozen leaf samples of “Rojo Pasión,” which were collected from the field, were thawed and fixed with 1% formaldehyde for 30 min at room temperature under vacuum. Subsequently, the in situ Hi-C library preparation was performed according to a protocol established for rice seedlings [43]. The libraries were sequenced on an Illumina HiSeq3000 instrument; in total, around 191.2 million pair-end reads were obtained.
RNA extraction and sequencing
Fruits tissue was collected in the same way for the PacBio sequencing library. Tissue from reproductive buds, vegetative buds, flowers, leaves, and bark tissues were collected from the same shoots used for pollen nuclei isolation. RNA was extracted from these tissues using the NucleoSpin® RNA Plant of Macherey-Nagel™ to prepare Illumina libraries.
All libraries were sequenced with Illumina HiSeq 3000 at Max Planck Genome-centre Cologne (MP-GC) in the 150 bp single-end mode, which respectively led to 32.8 (reproductive buds), 28.9 (vegetative buds), 30.2 (flowers), 23.8 (leaves), 18.6 (fruit), and 26.1 (bark) million reads, totaling 24.1 Gb.
Genome size estimation
After trimming off 10x Genomics barcodes and hexamers from the 61.7-Gb reads of the two 10x sc-CNV libraries, k-mer counting (k = 21) was performed with Jellyfish [44]. The k-mer histogram was provided to findGSE [29] to estimate the size of the “Rojo Pasión” genome under the heterozygous mode (with “exp_hom=200”; Additional file 1: Fig. S3).
Preliminary diploid-genome assembly and curation
With the 19.9-Gb raw PacBio reads of “Rojo Pasión” (Additional file 1: Fig. S2), a preliminary diploid assembly was constructed using Canu [28] (with options “genomeSize=242500000 corMhapSensitivity=high corMinCoverage=0 corOutCoverage=100 correctedErrorRate=0.105”).
All raw Illumina reads from the 10x libraries were firstly aligned to the initial assembly using bowtie2 [45]. Then, the purge haplotigs pipeline was used to remove haplotigs (i.e., haplotype-specific contigs inflating the true haploid genome) based on statistical analysis of sequencing depth and identify primary contigs to build up a curated haploid assembly [46]. To reduce the false-positive rate in defining haplotigs, each haplotig was blasted to the curated assembly; if over 50% of the haplotig could not be covered by any primary contigs, it was re-collected as a primary contig.
SNP marker selection
After trimming off 10x barcodes and hexamers, all pooled Illumina reads from the 10x sc-CNV libraries (61.7 Gb) were re-aligned to the curated haploid assembly using bowtie2 [45]. With 87.2% reads aligned, 989,132 raw SNPs were called with samtools and bcftools [47]. Three criteria were used to select potential allelic SNPs (578,209), including (i) the alternative allele frequency must be between 0.38 and 0.62, (ii) the alternative allele must be carried by 60–140 reads, and (iii) the total sequencing depth at a SNP must be between 120 and 280x (as compared with genome-wide mode depth of 208x; Fig. 2b).
Deletion marker selection and genotyping
The assemblies included 10,452 regions of over 2 kb without SNP marker (total: 110.9 Mb). If the average sequencing depth of such regions was less than or equal to 146x (i.e., the value at the valley between middle and right-most peaks in the sequencing depth distribution; Fig. 2b), it was selected as a deletion-like marker. This revealed 3253 deletion markers (36.4 Mb), including 237 on contigs without a single SNP marker. The remaining 7199 regions (74.5 Mb) were defined as conserved (homozygous regions) between two haplotypes (Fig. 2b). For each deletion marker in each gamete genome, we assessed the normalized read count (RPKM value) within the deletion using bedtools [48]. The genotype at such a deletion marker was initialized as a or n, where a refers to the presence of reads (and therefore relates to the haplotype without the deletion) and n refers to the absence of reads (either the deletion haplotype or not having enough information).
Haplotype phasing and CO identification
Barcodes in the raw reads were corrected using cellranger, with which 182.1 million read pairs (51.0 Gb) were clustered into 691 read sets. Reads of each read set were aligned to the curated assembly using bowtie2 [45], bases were called using bcftools [49], and a simple bi-marker majority voting strategy was applied to phase the SNPs along each contig (Fig. 2c). After phasing, we identified COs as consistent switches between the haplotypes.
Ploidy evaluation of single-cell sequencing
For each nucleus, with short-read alignment and base calling to the curated assembly, we counted the number of inter-genotype transitions (genotype a to b and b to a) at phased SNP markers over all contigs. Correlating this to the number of covered markers revealed two clusters of nuclei (Additional file 1: Fig. S5c). One cluster with 217 nuclei showed that inter-genotype transitions increased linearly with the number of covered markers (while there were high ratios of more than 5 transitions in every 100 markers), which indicated the sequencing data were mixed from more than one nucleus. The other cluster of 445 nuclei (31.2 Gb with 111.4 million read pairs) showed a limited increase (probably due to sequencing errors or markers from repetitive regions), which supported the expected haploid status.
Imputation of virtual markers at ends of contigs
Let a and b denote the parental genotypes. The genotype of a nucleus at both ends of a contig (referred to as virtual markers) can be represented by aa, bb, or ab (or ba) where aa/bb indicates an identical genotype along the contig while ab (or ba) indicates a CO event in the regions of contig. Then, we can build up genotype sequences at the two ends of all contigs (with SNP markers) by imputing at all nuclei. For example, given a contig, sequences of aaaaaababbbbbbb (marker 1) and aaaaaaaabbbbbbb (marker 2) means there is a CO (in bold) at the 7th (of 15) nuclei (Fig. 2c).
Linkage grouping and genetic mapping
All virtual markers (defined using SNP markers along contigs) were classified into 8 linkage groups (653 contigs: 212.9 Mb) after pairwise comparison of their genotype sequences using JoinMap 4.0 [32] (with haploid population type: HAP; and logarithm of the odds (LOD) values larger than 3.0).
After filtering out contigs with > 10% missing nuclei information or nuclei with > 10% missing contigs, a high-quality genetic map consisting of 216 contigs (147.3 Mb, corresponding to 622.0 cM; Fig. 3a) was first obtained using regression mapping in JoinMap 4.0 with the following settings: LOD larger than 3.0, a “goodness-of-fit jump” threshold of 5.0 for removal of loci and a “two rounds” mapping strategy [32]. Genotype sequences imputed at contig ends or deletions (i.e., respective virtual markers) were used to integrate the remaining 723 contigs into the genetic map. For example, given a deletion marker (e.g., p and q in Fig. 2c–e), if the respective contig had already existed in the genetic map, phasing was only performed at the deletion (according to surrounding phased SNPs); otherwise, phasing plus positioning to the genetic map would be applied. Both operations were based on finding the minimum divergence of the genotype sequence of the marker to that of the other contigs (in the corresponding genetic map). The final genetic map was completed as 891 contigs of 228.0 Mb.
Haplotype-specific PacBio read separation
PacBio reads (19.9 Gb) were classified based on three major cases after being aligned to the curated assembly using minimap2 [50]. First, a read covering phased SNP markers was directly clustered into the haplotype supported by the respective alleles in the read. Second, a read covering no SNP markers but overlapping a deletion marker was clustered into the respective genotype based on its phasing with neighboring imputed markers in genetic map. Third, a read in a conserved region was assigned to one of the haplotypes randomly.
Haplotype assembly and chromosome-level scaffolding
Independent assemblies were performed with 16 sets of reads, i.e., for every two haplotypes in each of the eight linkage groups using Flye [33] with the default settings.
Using the 891 contigs of the curated assembled that were assigned to chromosomal positions with the genetic mapping, we created a pseudo reference genome, with which the newly assembled contigs were scaffolded using RAGOO [51], leading to chromosome-level assemblies (i.e., those labeled with “scaf” in Fig. 3b).
Haplotype evaluation
The genotypes of the 16 assemblies were firstly identified by comparing k-mers in each assembly with Illumina WGS of the parental cultivar (k = 21; Fig. 3c). Although evaluation can always be performed in each linkage group, we combined the eight linkage-group-wise assemblies for “Currot”-genotype and the other eight for “Orange Red”-genotype, respectively.
After polishing the assemblies respectively with the “Currot”-genotype and “Orange Red”-genotype PacBio reads using apollo [52], we built up two sets of haplotype-specific k-mers from the assemblies, rC and rO. Correspondingly, a set of “Currot”-specific k-mers (with coverage from 10 to 60x), pC, was selected from the parental Illumina WGS that did not exist in “Orange Red” short reads (coverage over 1x) but in “Rojo Pasión” pollen short reads (coverage from 10 to 300x); similarly, a set of “Orange Red”-specific k-mers, pO, was also collected. Then, we intersected rC and rO with pC and pO respectively, leading to four subsets rC ∩ pC, rC ∩ pO, rO ∩ pC, and rO ∩ pO, which were used to calculate average haplotyping accuracy. All k-mer processing (counting, intersecting and difference finding) were performed with KMC [53]. After haplotype validation, the assemblies were further polished with the respective parental short-read alignment using pilon [54] (with options “--fix bases --mindepth 0.85”) generating v1.0 of the assemblies. Manual correction of the v.1.0 assemblies was performed according to focal and parental reads to generate assembly v1.1. Finally, k-mer-based assembly validation was performed with KAT [34] and Merqury [35].
Genome annotation
We annotated protein-coding genes for each haplotype assembly (v1.0) by integrating evidences from ab initio gene predictions (using three tools Augustus [55], GlimmerHMM [56], and SNAP [57]), RNA-seq read assembled transcripts, and homologous protein sequence alignments. We aligned protein sequences from the database UniProt/Swiss-Prot, Arabidopsis thaliana and Prunus persica to each haplotype assembly using the tool Exonerate [58] with the options “--percent 60 --minintron 10 --maxintron 60000”. We mapped RNA-seq reads from reproductive buds, vegetative buds, flowers, leaves, fruits (except seeds), and bark tissues, as well as a published Apricot RNA-seq dataset [36], using HISAT [59], and we assembled the transcripts using StringTie [60]. Finally, we used the tool EvidenceModeler [61] to integrate the above evidence in order to generate consensus gene models for each haplotype assembly.
We annotated the transposon elements (TE) using the tools RepeatModeler and RepeatMasker (http://www.repeatmasker.org). We filtered the TE-related genes based on their coordinates overlapping with TEs (overlapping percent > 30%), sequence alignment with TE-related protein sequences, and A. thaliana TE-related gene sequences (both requiring blastn alignment identity and coverage both larger than 30%).
We improved the resulting gene models using in-house scripts. Firstly, we ran a primary gene family clustering using orthoFinder [62] based on the resulting gene models from each haplotype to find haplotype-specific genes. We then aligned these specific gene sequences to the other haplotype using blastn [63] to check whether it was specific because the ortholog was unannotated in the other haplotype. For these potentially unannotated genes (blastn identity > 60% and blastn coverage > 60%), we checked the gene models from ab initio prediction around the aligned regions to add the unannotated gene if both the gene model and the aligned region had an overlapping rate larger than 80%. We also directly generated new gene models based on the Scipio [64] alignment after confirming the existence of start codon, stop codon, and splicing site. Finally, the completeness of assembly and annotation were evaluated by the BUSCO [41] v4 tool based on 2326 eudicots single-copy orthologs from OrthoDB v10 [65]. A similar process was used to filter for haplotype-specific genes (Additional file 2: Tables S2–3). Finally, a genome annotation lift-over was performed from v1.0 to v1.1 using liftoff [66] with default parameters.
Genome assembly comparison
All genome assemblies, including “Rojo Pasión” haplotypes, “Chuanzhihong” apricot (Prunus armeniaca) [36], Japanese apricot (Prunus mume) [37], and “Lovell” peach (Prunus persica) [38], were aligned to each other using nucmer from the MUMmer4 [67] toolbox with parameters “-max -l 40 -g 90 -b 100 -c 200”. The alignments were further filtered for alignment length (> 100 bp) and identity (> 90%), with which structural rearrangements and local variations were identified using SyRI [22]. To follow the nomenclature of the Prunus community, the “Rojo Pasión” chromosomes were numbered according to the numbering in ‘Lovell’ peach [38].
Hi-C data analysis
We used ALLHiC [8] and SALSA2 [39] for phasing and scaffolding. All 191.2 million Hi-C read pairs were aligned (using BWA version 0.7.15-r1140) to the haplotype-resolved unitigs assembled by Canu. Only uniquely mapped read pairs were selected using filterBAM_forHiC.pl from the ALLHiC package. The selected alignments were used as input for ALLHiC_partition (“ALLHiC_partition -b clean.bam -r unitigs.fa -e GATC -k 19”) and SALSA2 (“python run_pipeline.py -a unitigs.fa -l unitigs.fa.fai -g unitigs.gfa -m yes -b alignment.bed -e GATC -o SALSA2_out -i 8”, where the file alignment.bed was generated and sorted from clean.bam using bedtools bamtobed (version v2.29.0) and unitigs.gfa was collected from the Canu output). For ALLHiC, we had to set group number as 19 to get 16 linkage groups (of chromosome-level size), and 3 smaller groups below 2.5 Mb, which were not considered further. We continued with ALLHiC pipeline as SALSA2 did not achieve chromosome-level scaffolds. The subsequent pipeline of ALLHiC were run by default except for using “-RE GATC” in the “allhic extract” command. For comparison, we also aligned all raw Hi-C reads to haploid assemblies generated by gamete binning, and selected the uniquely mapped read pairs as described above. Hi-C maps were visualized using ALLHiC_plot at 300–500 kb resolution. Alignments of ALLHiC and gamete binning-based assemblies were obtained using minimap2 and dot plot was drawn with script pafCoordsDotPlotly.R at https://github.com/tpoorten/dotPlotly.
Crossover identification
All 220 million pollen nuclei-derived short-read pairs were pooled and aligned to the “Currot”-genotype assembly, from which 739,342 SNP markers were defined with an alternative allele frequency distribution of 0.38 to 0.62 and alternative allele coverage of 50 to 150x. Then, short reads of 445 nuclei were independently aligned to the “Currot”-genotype assembly using bowtie2 [45] and bases were called using bcftools [49]. Finally, TIGER [68] was used to identify COs. The landscape of COs from 369 nuclei with a sequencing depth over 0.1x was calculated within 500 kb sliding windows along each chromosome at a step of 50 kb (Fig. 5), where for each window, the recombination frequency (cM/Mb) was defined as C/n/(w/106)* 100%, where C is the number of recombinant nuclei in that window, n is the total number of nuclei (369) and w is the window size. SNP/Mb and gene/Mb were calculated for the same windows as x/(w/106), where x was the count of the feature in the respective window.