Plant materials and growth conditions
The 27 soybean accessions used in this study include 3 wild soybeans, 9 landraces, and 15 improved cultivars, which were the same accessions used in our previously reported pan-genome analyses of wild and cultivated soybeans [39]. Because the seeds of wild soybeans show the property of strong dormancy, their seed coat needed to be scarified to break dormancy before planting. The soybean seeds were sown and germinated in standard soil in the greenhouse, and the plants were grown under long-day conditions (16-h light/8-h dark). Daytime and nighttime temperatures were 26–28°C and 18–20°C, respectively, with a relative humidity of 40–60%.
Tissue collection
Fully expanded young leaves from 2-week-old seedlings were collected for Hi-C and RNA-seq experiments. For the Hi-C experiment, the collected tissues were washed three times with distilled water and were immediately used for subsequent library preparation. For the RNA-seq experiment, after collection, the leaves were immediately shock-frozen in liquid nitrogen and stored at −80°C until subsequent processing.
In situ Hi-C
In situ Hi-C was performed following previous description [8] with minor modifications. Fresh leaves were cut into 2- to 3-mm strips and cross-linked with 1% formaldehyde at room temperature. The cross-linked tissues were treated with vacuum three times under a vacuum pressure of −60 to −80 kPa, maintained for 10 min each time, after which atmospheric pressure was restored. Subsequently, a final concentration of 0.125 M glycine was added for 5 min at room temperature to terminate fixation. The fixed tissues were homogenized with liquid nitrogen and resuspended in nuclei isolation buffer. The isolated nuclei were lysed with 0.1% SDS at 65°C for 10 min and digested with DpnII at 37°C overnight. The restriction fragments were blunt-ended and biotinylated with biotin-14-dCTP, diluted, and ligated with T4 ligase at 16°C for 4 h. After the reversal of crosslinks at 65°C overnight, the ligated DNA was purified and sheared to lengths of 300–500 bp with Covaris M220, pulled down with streptavidin beads, and prepared for Illumina sequencing according to the standard Illumina library construction protocol. Libraries were quantified and sequenced using Illumina NovaSeq 6000 or HiSeq X Ten sequencing platform.
RNA-seq
Total RNA was extracted using the RNAprep Pure Plant Kit (Tiangen) according to the manufacturer’s instructions and quantified with NanoDrop spectrophotometer. RNA-seq libraries were constructed using the Illumina TruSeq Stranded mRNA Library Preparation Kit using 1–2 μg of total RNA and were then subjected to sequencing on the Illumina NovaSeq 6000 or HiSeq X Ten sequencing platform.
Hi-C data processing
The HiC-Pro suite (v.2.9.0) was used for valid mapping of the Hi-C reads [69]. Clean Hi-C data from individual accessions were mapped to their own assembled genomes (termed Self mapping in this study) and a common reference genome, ZH13 genome version 2 (ZH13 v2) (termed Reference mapping in this study), using Bowtie2 (v2.3.3) [70]. During mapping, the reads were first aligned using an end-to-end aligner, and the reads spanning ligation junctions were trimmed at their 3′ end and realigned to the genome. The aligned reads of both fragment mates were then paired in a single paired-end BAM file. Dangling-end reads, same-fragment reads, self-circled reads, self-ligation reads, and other invalid Hi-C reads were discarded from subsequent analyses.
After removing duplications, valid pairs were summed across two biological replicates and were used to generate raw Hi-C matrix at 5 kb, 10 kb, 25 kb, 40 kb, 50 kb, 100 kb, and 1 Mb resolutions. These matrices were normalized by the Knight-Ruiz (KR) method or the iterative correction and eigenvector decomposition (ICE) method. To investigate the contact domains, the valid pairs were converted into .hic format files with the juicer (v1.9.8) tool [71] pre command; to calculate compartment signals, the valid pairs were converted into .cool format files with hicConvertFormat command in HiCExplorer suite (v3.3.1) [72]; and to visualize the Hi-C contact map, the valid pairs were converted into .h5 format.
Reproducibility score
The reproducibility score was calculated using GenomeDISCO software [73]. Under this method, random walks on the contact map graph are applied for smoothing before comparing the contact maps, resulting in a concordance score that can be used for quality control of biological replicates. We calculated the GenomeDISCO reproducibility score based on 40-kb resolution contact maps for the biological replicates of Hi-C libraries.
Matrix resolution analysis
Matrix resolution analysis was performed using two methods. The first method was performed as described previously [8], in which the matrix resolution was defined as the smallest locus size at which 80% of loci showed at least 1000 contacts. The matrix resolution reflects the finest scale at which one can reliably discern local features. Using this method, we determined that the matrix resolution in this study was 10 kb.
The second method was to identify a “saturated” matrix resolution based on the valid pairs of Hi-C data of different soybean accessions. To this end, we first selected nine representative accessions with the highest, intermediate, and lowest numbers of valid pairs from the 27 soybean accessions, which included three wild soybeans, three landraces, and three improved cultivars. Next, we chose the number of valid pairs and the matrix resolution for a series of gradients. For the valid pair gradient, we randomly picked the same number of valid pairs from each representative accession. The initial number of valid pairs was 100,000,000, and the gradient was 40,000,000 until the number of valid pairs reached the total number of valid pairs of each accession. For the matrix resolution gradient, we chose a range of matrix resolutions: 5 kb, 10 kb, and 25 kb. We performed the evaluation of the valid pair gradient and the matrix resolution gradient with two methods: identification of TAD boundaries by insulation scores [16] method and identification of contact domains by Arrowhead [8] method. We determined that the insulation score method was relatively stable regardless of the valid pair gradient and the matrix resolution gradient, while the Arrowhead method reached saturation at a 25-kb matrix resolution for all representative accessions. Finally, we chose the robust resolution of 25 kb as the matrix resolution of the 27 soybean accessions.
A/B compartment identification
E1 values from the eigenvector decomposition on Hi-C contact maps were used to indicate the A/B compartment status. We used cooltools software (v0.3.2) [74] with the parameter “cooltools call-compartments” to obtain the E1, E2, and E3 values based on a 100-kb resolution Hi-C contact matrix. Because E2 or E3 values sometimes reflect A/B compartments, we manually checked the E1, E2, and E3 tracks with gene density and the plaid pattern in the Hi-C contact maps along each chromosome and obtained the final “E1” list. The direction of the eigenvalues is arbitrary; therefore, negative values were set to “A,” and positive values were set to “B” based on their association with gene density. Compartment border was defined as the edge bin of the A/B compartments.
A compartment percentage
A compartment percentage was defined as the percentage of the length of A compartment on one chromosome to the full length of the chromosome. In Fig. 1b, the average A compartment percentage was defined as the average of the A compartment percentages of 20 chromosomes in each soybean accession.
I region identification
I regions were defined as the regions of the A compartment and B compartment intersection. To identify these regions in the genome, we used a 500-kb sliding window (5 bins) and a 100-kb step size (1 bin) to extract these compartment intersection regions. If a single window contained A compartment and B compartment at the same time, the bin was defined as an I region. For example, when we traced whether one bin belonged to the I region, we traversed the compartment status of this bin and the following four compartment bins. If there were both A and B compartments in the window (5 bins), we identified the bin as belonging to I regions.
Pan-3D genome of A/B compartments
The pan-3D genome of the A/B compartments was based on the Reference mapping data, and A/B compartment identification was performed based on the ZH13 reference genome. We defined conservative compartments as compartment bins showing the same compartment status in all 27 accessions, which could be divided into three types: A compartments present in all 27 accessions, B compartments present in all 27 accessions, and NAs present in all 27 accessions. Similarly, we defined variable compartments as compartment bins that showed a different compartment status in at least one accession, which could be divided into three types: compartments with only A compartments (compartment bins containing only A compartments and NAs in all 27 accessions), compartments with only B compartments (compartment bins containing only B compartments and NAs in all 27 accessions), and compartments with both A and B compartments (AB variable compartments, compartment bins containing A and B compartments at the same time in all 27 accessions).
Coefficient of compartment variation
The CCV was defined as the natural logarithm of the number of B compartments divided by the number of A compartments in the variable bin. The formula for calculating CCV was as follows:
$$\textrm{CCV}=\log \left(\#\textrm{B}/\#\textrm{A}\right)$$
For CCV calculation, we selected compartments with both A and B compartments (AB variable compartments).
As expected, the CCV could impact the size and orientation of compartment variation. The absolute value of the CCV reflected the variation size, such that the smaller the absolute value was, the greater the variation within the bin and, thus, the smaller the difference between the numbers of A compartments and B compartments within the bin; conversely, the greater the absolute value was, the lower the variation within the bin and, thus, the greater the difference between the numbers of A compartments and B compartments within the bin. A positive or negative CCV reflected the variation orientation: if the CCV value was positive, the number of B compartments was greater than the number of A compartments in the bin, and the bin thus tended toward a B compartment state; if the CCV value was negative, the number of A compartments was greater than the number of B compartments in the bin, and the bin thus tended toward an A compartment state.
Genomic feature analysis
For GC content, we used bedtools software (v2.17.0) [75] with the option “nuc” to count the numbers of A, T, G, C, and N bases in the specific window of the soybean genome, and calculated the GC content of the window manually. For TSS density, we extracted all TSSs from gff3 files within the specific window of the soybean genome, after deduplication, we calculated the TSS density of the window and normalized it manually. For repeat percentage, we used the repeat datasets that we previously generated using repeatmasker [39] and calculated the repeat percentages in specific windows using our custom Perl script.
TAD boundary identification
TAD boundaries were identified using the insulation score method [16] based on a 25-kb resolution KR normalized matrix with the parameters “--is 500000 --ids 250000.” This method first calculated the average number of contacts that occurred across each bin. This could be visualized by sliding a 500 kb × 500 kb (20 bins × 20 bins) square along the matrix diagonal and aggregating all signals within the square. The mean signal within the square was then assigned to the 25-kb diagonal bin, and this procedure was subsequently repeated for all 25-kb diagonal bins except bins within 500 kb of the matrix start/end. The insulation score was then normalized relative to all of the insulation scores across each chromosome by calculating the log2 ratio of each bin’s insulation score and the mean of all insulation scores. Valleys or minima along the normalized insulation score vector represented loci of reduced Hi-C interactions that occurred across the bin. These valleys or minima were interpreted as insulation boundaries or areas of high local insulation. The valleys or minima were identified as follows: first, a delta vector was calculated to approximate the slope of the normalized insulation vector. The delta vector was defined as the difference between the amount of the insulation change 125 kb to the left of the central bin and 125 kb to the right of the central bin. The delta vector crossed the horizontal 0 at all peaks and all valleys. All bins where the delta vector crossed 0 were extracted. Zero-crossings occurring at peaks were removed, and the remaining zero-crossings, all occurring at potential valleys, were passed through a boundary strength filter. The boundary strength was defined as the difference in the delta vector between the local maximum to the left and local minimum to the right of the boundary bin. All boundaries with a boundary strength < 0.1 were removed. TADs were identified as regions between two boundaries.
Contact domain identification
Contact domains were identified with the Arrowhead algorithm using Juicer tools (v1.9.8) [71] based on a 25-kb resolution KR normalized matrix with the parameter “-r 25000.” Arrowhead is a heuristic algorithm for detecting the corners of domains to locate the boundaries of TADs. The algorithm first performed an “arrowhead” transformation on the normalized matrix, which is defined as Ai,i+d = (Mi,i-d– Mi,i+d)/(Mi,i-d+ Mi,i+d), where M is the normalized contact matrix and A is the arrowhead matrix. Based on this transformation, Ai,i+d will be strongly positive if locus i − d is inside a domain and locus i + d is not. If the reverse is true, Ai,i+d will be strongly negative. If the loci are both inside or both outside a domain, Ai,i+d will be close to zero. Consequently, each contact domain was replaced with an arrowhead-shaped motif after transformation and could be identified using dynamic programming.
Directionality index calculation
The directionality index (DI) can measure the tendency of a locus to interact with upstream sites and downstream sites. The DI was calculated by domaincaller software [7] at 25-kb resolution contact maps and 2-Mb windows. It was calculated using the following equation: DI = ((B − A)/|B − A|)×((A − E)2/E + (B − E)2/E), where A was the number of reads that mapped from a given 25-kb bin to the upstream 2 Mb, B was the number of reads that mapped from the same 25-kb bin to the downstream 2 Mb, and E was the expected number of reads under the null hypothesis, which was equal to (A + B)/2.
Repeat enrichment analysis
The repeat datasets of 27 soybean accessions were annotated as described above [39]. In our analysis, repeats were divided into 17 subtypes, including 14 subtypes of TEs and 3 subtypes of TRs. The TEs included DNA transposons (hAT, CACTA, Mutator, PIF/Harbinger, Tc1/Mariner, Tourist-like MITE, Stowaway-like MITE, Helitron, and unclassified DNA transposons), LTR retrotransposons (Gypsy, Copia, and unclassified LTR retrotransposons), and non-LTR retrotransposons (LINEs and SINEs). The TRs included simple repeats, low complexity repeats, and satellite repeats. We performed the repeat enrichment analysis of each subtype by calculating the percentages, numbers, and lengths of repeat fragments. For number and length calculations, we defined the repeat fragments within the bin based on the midpoint of the fragments.
Genome coordinate conversion
To convert the genome coordinates of TAD boundaries between the ZH13 reference genome and the query genomes, we first generated custom UCSC chain files between the reference genome and query genomes using the UCSC liftOver tool (http://genome.ucsc.edu). Thereafter, we performed genome coordinate conversion using CrossMap (v0.2.9) [76] with the option “CrossMap.py region.”
Comparative 3D genome of TAD boundaries
After genome coordinate conversion, we performed the pairwise comparison of the TAD boundaries of 26 query accessions with the TAD boundaries of reference accession ZH13 in the reference genome. Considering the bias in the identification of boundaries, we allowed these boundaries to be shifted by up to 25 kb (1 bin) in each flank. This meant that the 75-kb (3 bins) boundary regions were compared between reference accession ZH13 and the other 26 query accessions. If there was overlap between the reference boundary region and the query boundary region, we considered them to be matched; otherwise, we considered them to be mismatched. The matched boundaries of the reference accession were referred to as reference-common boundaries, and the mismatched boundaries of the reference accession were referred to as reference-specific boundaries. Analogously, the matched boundaries of the query accessions were referred to as query-common boundaries, and the mismatched boundaries of the query accessions and those boundaries that could not be translated to the reference genome via genome coordinate conversion were collectively referred to as query-specific boundaries. The comparisons were performed individually for each of the 26 query accessions.
Pan-3D genome of TAD boundaries
Sequence extraction
To increase the robustness of TAD boundary identification, we extracted the boundaries as well as the flanking 25-kb (1 bin) regions on both sides of the boundaries. This 75-kb (3 bins) region was defined as the boundary region of each boundary.
Pairwise alignment
To evaluate the similarity of each accession’s boundary regions, we performed pairwise alignment using the sequence alignment software MUMmer4 [77]. The alignment was performed by using the “nucmer” command with the parameters “-c 5000 -b 1000,” and filtering was performed by using “delta-filter” with the default parameters and the “show-coords” with parameters “-bcHlorT”. Alignment between two accessions was performed in the forward and reverse directions at the same time. Consequently, the total number of alignments was 729 (272 = 729).
Similarity statistics
After alignment, we calculated the sequence similarity of boundary regions. Similarity was calculated as the percentage of the mapped sequence to the full length of the boundary regions. Consequently, we considered two boundary regions to be the same boundary region based on the similarity of the mapped sequences of the two boundary regions.
Boundary clustering
Based on the similarity evaluation of boundary regions, we clustered the matching boundary regions. Clustering was performed using the 729 similarity files by our custom clustering script. Finally, we obtained the clustering results of all boundary regions of the 27 accessions.
Pan-3D genome
The final pan-3D genome was presented in files with two different formats: the first file was in csv format, and the second was in table format.
GO enrichment analysis
The GO term annotation of the protein sequences encoded by the transcripts of the 27 soybean accessions was performed using PANNZER2 [78]. Enrichment analysis was performed using the R package clusterProfiler (v3.10.1) [79]. Only statistically significant (FDR < 0.01) GO terms were used.
SV datasets
The SV datasets of the ZH13 soybean accession and 26 other soybean accessions were generated from our previous pan-genome analyses of wild and cultivated soybeans [39]. For PAV identification, we used SVMU to identify these variations based on the results provided by NUCmer. For CNV identification, we filtered synteny blocks of less than 100 bp in the results of MUMmer4 and identified CNVs from regions with two or more overlapping synteny blocks (>90% identity). INV and TRANS were identified by manual checking depending on their locations and orientations with respect to neighboring blocks according to the nonallelic homology blocks.
SV distribution in A/B compartments
For A/B compartments, we investigated the distribution of observed and expected values of SVs in the A compartments, where expected values were generated by the bootstrapping method of generating random A compartments 10,000 times in the genome. We calculated the numbers of SVs within A compartments and then calculated z scores and P values. Additionally, we observed the distribution of the four types of SVs in I regions using the same method described above. In this study, we analyzed each PAV, INV, CNV, and TRANS event independently, although these SVs might occur concurrently in a given sample.
SV distribution around TAD boundaries
For TAD boundaries, we calculated the percentage of four types of SVs around boundaries. This analysis was performed based on the comparative 3D genome and pan-3D genome of TAD boundaries.
SV classification
We classified the SVs according to the positional relationship between SVs and TAD boundaries. In brief, we divided SVs into three types: ABA SVs were defined as SVs spanning the whole length of one boundary, PBA SVs were defined as SVs spanning part of the length of one boundary, and NBA SVs were defined as SVs within TADs.
SV effects and contributions
To identify SV effects on and contributions to TAD boundary variation, we performed association analysis of SVs and comparative 3D genome of TAD boundaries. For each of the eight types of SVs (ABA-PAV, PBA-PAV, ABA-CNV, PBA-CNV, ABA-INV, PBA-INV, ABA-TRANS, and PBA-TRANS), we classified the boundaries into four types: common boundaries with SVs, common boundaries without SVs, specific boundaries with SVs, and specific boundaries without SVs. The contribution of each SV type is defined as the proportion of specific boundaries that can be explained by each type. The effect size of each SV type is defined as the proportion of specific boundaries with the SV relative to all the boundaries with the SV.
RNA-seq data processing
Read alignment and quantification
Clean reads were mapped to the genome of each accession by using HISAT2 (v2.1.0) with the default parameters [80]. An annotation file in GFF3 format was provided to StringTie (v1.3.4) using the -G option for the transcript assembly process [81]. Fragments per kilobase of transcript per million reads mapped (FPKM) values were also calculated by StringTie software.
Differential gene expression analysis
For differential expression analysis, clean reads were mapped to the ZH13 genome by HISAT2 with the default parameters. Gene expression FPKM values were calculated by StringTie software as described previously. Differential gene expression was performed using DESeq2 (v1.4.5) with the default parameters [82]. FDR <0.05 and |fold change| > 2 were used as cutoffs for significantly differentially expressed genes.
Enrichment of DEGs
To analyze the density of DEGs, we enriched all the DEGs around TAD boundaries and calculated their average number as the DEG density. For the enrichment of the comparative 3D genome and pan-3D genome of TAD boundaries, we performed the analysis in 26 query accessions one by one and averaged the results.
NonVariation-gene analysis
NonVariation-genes were defined as genes that had no variations in reference accession and query accession. We used genome variation data, which included single-nucleotide variants (SNVs), insertions and deletions (InDels), and SVs from the de novo genome assembly, to identify nonVariation-genes.
Genome scanning for selective signals
We performed a genome cross-population FST value scan using the whole-genome sequencing (WGS) data from 2898 soybean accessions [39]. FST values were calculated with a 25-kb window using VCFtools. Evidence for selection across the genome during domestication and improvement was evaluated in two comparisons: landraces versus G. soja for domestication and improved cultivars versus landraces for improvement. The highest FST values, accounting for 5% of the genome, were considered as selected regions.
Quantitation and statistical analysis
The R language was used for statistical analysis. The Wilcoxon rank-sum test was used for the assessment of statistical significance with the wilcox.test function in R. The Kolmogorov–Smirnov test was used to calculate statistical significance for the difference between densities of boundaries in the A and B compartments. Fisher’s exact test was used for statistical significance with the fisher.test function in R. * indicates P < 0.05, ** indicates P < 0.01, *** indicates P < 0.001; n.s. indicates P > 0.05.