Chromosome-level genome assembly of a regenerable maize inbred line A188

Background The maize inbred line A188 is an attractive model for elucidation of gene function and improvement due to its high embryogenic capacity and many contrasting traits to the first maize reference genome, B73, and other elite lines. The lack of a genome assembly of A188 limits its use as a model for functional studies. Results Here, we present a chromosome-level genome assembly of A188 using long reads and optical maps. Comparison of A188 with B73 using both whole-genome alignments and read depths from sequencing reads identify approximately 1.1 Gb of syntenic sequences as well as extensive structural variation, including a 1.8-Mb duplication containing the Gametophyte factor1 locus for unilateral cross-incompatibility, and six inversions of 0.7 Mb or greater. Increased copy number of carotenoid cleavage dioxygenase 1 (ccd1) in A188 is associated with elevated expression during seed development. High ccd1 expression in seeds together with low expression of yellow endosperm 1 (y1) reduces carotenoid accumulation, accounting for the white seed phenotype of A188. Furthermore, transcriptome and epigenome analyses reveal enhanced expression of defense pathways and altered DNA methylation patterns of the embryonic callus. Conclusions The A188 genome assembly provides a high-resolution sequence for a complex genome species and a foundational resource for analyses of genome variation and gene function in maize. The genome, in comparison to B73, contains extensive intra-species structural variations and other genetic differences. Expression and network analyses identify discrete profiles for embryonic callus and other tissues. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02396-x.

produce genetically modified plants [4]. A popular maize transformation line, Hi-II, was isolated from offspring of a cross between A188 and B73, an elite maize reference inbred line [5,6]. Although highly valuable for plant regeneration and transformation, A188 is not agronomically desirable, having small ears and low grain yield. The line also exhibits a high degree response to environmental conditions, including sensitivity to abiotic and biotic stresses, such as drought, heat, and bacterial and fungal diseases, in comparison to elite maize lines [7]. A188, therefore, in addition to traits related to transformability, can serve as a model inbred line for the genetic dissection of many important agronomic traits, heterosis, and plant-environment interactions.
Efforts have been pursued to develop efficiency and quality strategies for maize genome sequencing and assemblies. The first maize reference genome for B73 was sequenced and assembled using bacterial artificial chromosomes (BACs) [8]. Since then, additional assemblies have been produced using so-called next-generation highthroughput sequencing, including both short-and long-read technologies [9][10][11][12][13][14][15]. Recently, long-read technologies were combined with optical DNA mapping to produce high-continuity maize assemblies, including Nested Association Mapping (NAM) founder lines [16][17][18]. Here, we used Nanopore long reads and optical DNA mapping to construct a chromosome-level maize genome of A188 for the discovery of structural variation as well as performed transcriptome and DNA methylome analyses of embryogenic callus.

Phenotype of A188
Inbred line A188 (PI 693339) has smaller ears and lower grain yield in comparison to the community reference line B73 and is amenable to plant transformation due to abundant production of the callus favorable to regeneration (Fig. 1, Additional file 1: Figure S1, Additional file 2: Table S1). Hybrids of A188 and B73 (PI 550473) exhibit extensive heterosis (Additional file 1: Figure S2).

Chromosome-level A188 assembly
Long reads, representing a 90X coverage, were generated from A188 genomic DNA using the Oxford Nanopore sequencing platform. The N50 of read lengths is 23.9 kb, and the longest read is 270.6 kb (Additional file 1: Figure S3). Genome assembly, performed using Canu, resulted in 1830 contigs, comprising approximately 2.2 Gb of total sequences. The N50 of contigs is 5.99 Mb (Additional file 2: Table S2).
Read depths for contigs were assessed using Illumina short reads generated independently from seedling and immature ear DNAs to identify potential contamination from organelle genomes or extraneous microbial DNA. Contigs from organelle genomes or extraneous microbial DNA were expected to have differential read depths between the two tissues. Based on this strategy, contigs identified as the chloroplast or mitochondrion sequences were replaced respectively with the previously complete assemblies of A188 organelle genomes [19,20] and contigs from extraneous contamination were discarded (Additional file 1: Figure S4). The remaining contigs were polished using raw Nanopore data and 80X PCR-free Illumina 2x250 paired-end whole-genome sequencing reads (Additional file 2: Table S3), followed by the scaffolding with 113 A188 Bionano Genome (BNG) optical maps (Additional file 2: Table S4), for which the total length is 2.17 Gb and the N50 is 103.4 Mb. The BNG aided assembly placed 875 contigs into 39 scaffolds, which consist of 2.15 Gb. Chromosome pseudomolecules were then generated using a genetic map constructed from 100 B73xA188 double haploid (DH) lines (Additional file 2: Table S5). The final assembly (A188Ref1) consists of 2.25 Gb, including 10 chromosomal pseudomolecules, a mitochondrial genome, a chloroplast genome, and 986 scaffolds or contigs ( Table 1).
The base accuracy of the A188Ref1 assembly was estimated at approximately 99.82% using the KAD pipeline [21]. Approximately 96.4% of the potential errors are in transposons or other repetitive sequences. The estimated accuracy of genic sequences was >99.97%. The completeness of the A188 assembly was assessed using the BUSCO software [22] and found to contain 97.25% (3189/3278) of the Liliopsida core gene set, similar to the 97.36% (3193/3278) in the B73 reference genome (B73Ref4) [9].

Presence of complex repeats and nuclear organelle sequences in A188Ref1
In total, 86.3% of the A188 genome sequence is annotated as repetitive elements. The long terminal repeat (LTR) retrotransposons Gypsy and Copia were the most prevalent elements, consisting of 44% and 23.9% of A188Ref1, respectively (Fig. 2, circos plot [23]). LTR centromere retrotransposon of maize (CRM) were largely co-localized with centromere-specific satellite repeat CentC, both of which were largely syntenic to the B73 centromeres [9]. Approximately 8.3% of A188Ref1 is annotated as DNA transposable elements (TEs), including helitron and Miniature Inverted-repeat Transposable   Elements (MITEs) (Fig. 2). Major knob clusters were found on the long arm of chromosomes 5 (5L), the short arm of chromosome 6 (6S), 7L, and 8L, and major subtelomeric repeats (4-12-1) were clustered on the distal regions of 1S, 3S, 4S, 5S, and 8L (Fig. 2). Through similarity alignments, the 45S and 5S ribosomal DNA (rDNA) clusters were localized on 6S and 2L, respectively (Fig. 2). Knob and rDNA locations were in agreement with previously reported A188 fluorescent in situ hybridization (FISH) data [24]. Most repetitive components were located in regions of low-recombination contexts except the 5S rDNA locus and subtelomeric clusters (Additional file 1: Figure  S5). Nuclear mitochondrial DNA (NUMT) and nuclear plastid (chloroplast) DNA (NUPT) were identified at 10 and 21 genomic loci, respectively (Fig. 3a, Additional file 1: Figure  S6). The largest nuclear organelle-like sequence (~136 kb) is a NUMT locus on the short arm of chromosome 8, which contains an array of DNA transposons likely inserted subsequent to the NUMT integration. FISH analysis corroborated the chromosome 8 location and confirmed a homolog on the distal end of 10S (Fig. 3b, c). In summary, the genomic locations of repetitive sequences and nuclear organelle sequences NUMT on A188 nuclear genomes. a NUMT sequence on 10 chromosomes of A188Ref1. Each dot on chromosomes designates a potential NUMT integration. Close-up alignments with the mitochondrion (mt) genome are shown along NUMTs. Each alignment requires at least 5 kb match and 95% identity. b Circos plot of alignments between the mt genome and 10 chromosomes. The same colors of green, orange, dark blue label duplicated regions in mt. "P" regions match the probe sequence used for FISH. Brown links highlight alignments on chromosomes 8 and 10. Note that the chromosomal scale is different from the mt scale. Numbers on the track are in Mb. c Physical mapping of a mt DNA (mtDNA) and knob repeats on the mitotic metaphase chromosomes of maize A188. The knob repeat probe (green signals) was used to identify the chromosomes. Two FISH sites of the mtDNA insertion on the chromosomes were detected: arrowheads, chromosome 8; arrows, chromosome 10. Bar = 10 μm are largely consistent with previous findings by FISH [25,26], supporting the largescale correctness of A188Ref1.

Gene annotation
Annotation of A188Ref1 was performed using the Maker pipeline with evidence from transcripts assembled with A188 long Nanopore direct cDNA sequencing data, A188 RNA-Seq Illumina short reads, and transcripts from other maize lines, as well as protein sequences from closely related plant species. The Maker genome annotation resulted in 40,747 high-confidence gene models with 62,142 transcripts (A188Ref1a1) ( Table 1). BUSCO evaluation showed that 97.8% of Liliopsida conserved genes were in A188Ref1a1. Comparison of protein sequences identified 52,971 orthologous pairs between A188 and B73, consisting of 27,273 A188 genes and 27,529 B73 genes. We also identified 178 gene clusters in A188 each of which contains at least three paralogous genes, comprising in total 694 genes. The clusters of genes encoding pectin methylesterase (PME) were identified on an unanchored scaffold c04_002 (two clusters with 25 and 18 genes), on chromosome 4 (one cluster with nine genes), and on chromosome 5 (one cluster with five genes) (Fig. 4a). Gene clusters also include eight clusters of 42 nucleotide-binding leucine-rich repeat (NLR) disease resistance (R) genes (Fig. 4a). One NLR gene cluster on chromosome 10 has 16 genes homologous to the rp1 gene that confers resistance to common rust [27] and was associated with Goss's wilt resistance [28] (Fig. 4b). Most paralogous clusters were not located in regions with high recombination (Fig. 4c). Exceptions include the rp1 locus, which has a high level of haplotype instability through frequent recombination among rp1 paralogs [29][30][31]. Divergent rp1 haplotypes were observed between A188 and B73 that contains 11 rp1 homologs at the syntenic locus (Fig. 4b).
We identified 2259 paralogous gene pairs of which one member was located in a high-recombination chromosomal compartment and the other in a low-recombination compartment ("Methods" section). Comparison of DNA methylation of A188 seedlings found that, on average, both CG and CHG methylation, where H represents A, C, or T, were higher near and within low-recombination paralogous genes as compared to highrecombination genes. No obvious differences were observed in CHH methylation ( Fig.  4d-f). Comparison of gene expression between members of the paralogous pairs using seedling RNA-Seq data showed most paralogs had similar expression levels and no expression bias to either high-or low-recombination genes was observed for those paralogs that did exhibit differential expression (Additional file 1: Figure S7). The result indicated that the genomic context of genes is a driver for a certain epigenomic modification but not a major driver for gene expression.

High-level structural variation between A188 and B73
Structural variation between the A188 and B73 genomes was identified through comparisons of whole-genome assemblies of both genomes using SyRI software (Additional file 3) [32] and through the analysis of whole-genome Illumina sequencing reads with Comparative Genomic Read Depth (CGRD) that is based on quantitative comparison of depths of short reads (Additional files 4 and 5) [33]. SyRI revealed~1.1 Gb of syntenic regions, 2302 translocations, and 4083 duplications in B73 and 2333 duplications in A188 using a minimum cutoff of 10 kb for each translocation or duplication event (Additional file 2: Table S6). In addition, SyRI identified 441.9 Mb of B73 and 543.8 Mb of A188 DNA sequences that were not aligned with the other respective genome. Further filtering with CGRD that compared read depths between the two genomes revealed 381.3 Mb of B73-specific sequences and 409.2 Mb of A188-specific sequences that represent presence and absence variance (PAV) or highly divergent sequences (HDS). These PAV/HDS regions contain 6728 genes in B73 and 7301 genes in A188 (Additional files 6 and 7). Gene ontology enrichment analysis indicated that genes related to endopeptidase inhibitor activity and extracellular activities are enriched in both PAV/HDS gene sets (Additional file 1: Figure S8).
Seventeen large inversions of 0.5 Mb or greater were identified between the two genomes (Fig. 5, Additional file 1: Figures S9-S17, Additional file 2: Table S7). Nine of the seventeen inversions are likely errors in B73Ref4 as the newly released B73Ref5 showed the same orientation as A188Ref1, including the largest inversion region (INV37083 on B73Ref4, 97.8-103.9 Mb on chromosome 4). FISH analysis of A188 and B73 corroborated the absence of inversion INV37083 (Additional file 1: Figure S18). Recombination and pairwise linkage disequilibrium (LD, R 2 ) values among single nucleotide polymorphisms (SNPs) within each inversion were determined, and out of eight remaining inversion candidates, six have recombination frequencies close to 0 and a high mean LD ranging from 0.56 to 0.79 of all pairs of SNPs that are separated by 0.2-0.3 Mb within Example of an NLR gene (rp1) cluster in A188 and their alignments with the B73 rp1 locus. Each rectangle box represents a gene with blue, tan, and red colors indicating plus, minus orientation, and rp1 homologous genes, respectively. All rp1 homologs are in the same minus orientation. Gray bands connect orthologs and orange bands highlight the top rp1 alignments with at least 98.5% identity and a 2500-bp match. c The scatter plot of numbers of genes per cluster versus the recombination rate estimated 1 Mb around the midpoint of the cluster. All clusters plotted are on 10 chromosomes. d-f Distribution of cytosine methylation in sequence contexts of CG, CHG, and CHH around paralogous genes. An average methylation rate per window across all examined genes from two replicates of seedling samples was determined and plotted versus the window order. A window in the gene body, from translation start site (TSS) to translation termination site (TSS), is 1/200 of the gene body in length. A window outside of the gene body is 20 bp an inversion, which are much higher than the genome-wide average LD of 0.2 between SNPs in separated by 0.2 Mb (Additional file 2: Table S7). These six inversions exhibiting marked recombination suppression characteristic of inversion [34], therefore, are strongly supported. The six inversions range from 0.7 to 2.1 Mb in size, of which two are located close to the centromere of chromosome 2 and four are on 3L, 4L, 5L, and 9L (Additional file 2: Table S7). Each of these six inversions can be identified between A188Ref1 and many other maize genomes, including the genomes of NAM founder lines (Additional file 1: Figure S19-S23, Additional file 2: Table S8). In total, the six inversion sequences harbor 69 genes in B73 and 75 genes in A188. The syntenic relationships of these genes were largely maintained between inverted sequences in the two genomes (example in Fig. 5d), although the gene sequences are divergent in a high The CGRD result using A188Ref1 as the reference genome. The Y-axis represents log2 values of ratios of read depths of B73 to A188, log2(B:A), signifying copy number variation (CNV). Regions with higher and lower sequence depths of B73 versus A188 were B73 plus (red) and B73 minus (blue), respectively. Green and orange represent conserved and ungrouped regions, respectively. b The SyRI result is displayed. Alignments of syntenic blocks larger than 10 kb and alignments of other rearrangements larger than 0.5 Mb are plotted. On each A188 and B73 chromosome, segments that were not aligned to the other genome or highly divergent with the other genome are highlighted. The red * labels a well-evidenced inversion. c The CGRD result using B73Ref4 as the reference genome. The similar color scheme to that in a is used. d Synteny of genes (rectangle blocks) in the well-evidenced inversion (ABinv4a) regions between A188 and B73. Blue and tan colors stand for plus and minus gene orientations. e A dot plot between the 1.8-Mb B73 region that was duplicated in A188 and its aligned regions in A188Ref1. f FISH of the PME probe on A188, B73, and F 1 (B73xA188). Cent4 probe (green) that is specifically from chromosome 4 centromere was used in F 1 FISH. Arrows and arrowheads point at PME signals of A188 and B73 chromosomes, respectively. Bar = 10 μm degree from each other (Fig. 5b, Additional file 1: Figures S10, S11, S12, S16). The divergence of these inversions indicated that the inversions were not recent events established in modern maize populations. Admixture structure analysis showed that both A188 and B73 haplotypes of 3/6 inversions (ABinv2a, ABinv2b, and ABinv3a) exist in teosinte, the maize wild ancestor (Additional file 1: Figure S24), and there is no clear evidence of the haplotype of the remaining three inversions (ABinv4a, ABinv5a, and ABinv9a) existed in teosinte (Additional file 1: Figure S25). Among landraces, both the A188 and B73 haplotypes of each of the six inversions could be identified based on SNP genotyping data, supporting that all these six inversions exist in landrace maize lines (Additional file 1: Figures S24 and S25).
CGRD analysis also identified an A188 duplication of a 1.8-Mb region from 8.68 to 10.45 Mb on chromosome 4 of B73Ref4 (Fig. 5c). In A188, a portion of the duplication was found in the unanchored scaffold c04_002 while most of the remaining duplicated sequences can be found in chromosome 4 (Fig. 5e). The duplication region overlapped with the Gametophyte factor1 (Ga1) locus conferring unilateral cross-incompatibility [35]. The underlying causal gene of B73, Zm00001d048936, encodes a PME, which is a wildtype allele. We designed a PME DNA probe that is from the duplication and repeatedly matches 35 loci in B73Ref4 and 78 loci in both the region on chromosome 4 and the scaffold c04_002 on A188Ref1. FISH using this probe resulted in strong hybridization signals on A188 chromosome 4S and weak signals on B73 chromosome 4S, indicating that the duplication occurred locally on 4S (Fig. 5f). The B73 Zm00001d048936 gene has no additional homologous copies in B73Ref4 but five homologous sequences can be identified on the duplicated sequence of A188Ref1, including the syntenic gene Zm00056a022745 that is identical to Zm00001d048936. Collectively, the result documented the complexity and the potential dynamic of the Ga1 locus of maize.

Associating structural variation with phenotypic variation
The CGRD result indicated that A188 had many more copies (A188plus) at a region from 155.23 to 155.24 Mb of chromosome 9 in B73Ref4 (Additional file 1: Figure S16). This region includes the carotenoid cleavage dioxygenase 1 (ccd1) gene catalyzing the cleavage of carotenoids to apocarotenoid products, which is located at the White cap locus (Wc1) conditioning kernel colors [36]. SyRI analysis supported a duplication of this region but failed to find a number of copies in A188. SyRI analysis also indicated the duplicated region is embedded in A188-specific sequences (Fig. 6a). Comparison of A188Ref1 with an A188 BNG optical map aligned to the duplication region indicated the incomplete assembly of the region. Previously, tandem repeats of an~27 kb sequence at the Wc1 locus were discovered [37]. Each repeat exhibits four discernible sites that can be detected via Bionano analysis, referred to as Type A repeat. Analysis of A188 sequences revealed a repeat variant containing an additional site, referred to as Type B repeat. Based on the BNG map, the A188 genome contains 13 intact tandem copies of the 27-kb sequence, consisting of 9 copies of Type A and 4 copies of Type B repeats, as well as partial copies of the 27-kb sequence on both ends of the array. Each repeat copy contains a ccd1 gene, indicating at least 13 copies of ccd1 in A188 (Fig. 6b), consistent with the A188plus result from the CGRD analysis. Neither intact Type A nor B repeat exists in B73, which, however, does contain a ccd1 gene.
A188 seeds are white, whereas B73 seeds are yellow (Fig. 1). Analysis of quantitative trait locus (QTL) of kernel colors of B73xA188 DH lines resulted in two major QTLs on chromosomes 6 and 9, both of which were discovered in a previous genome-wide association study [38], as well as a weaker peak on chromosome 2 (Fig. 6c, Additional file 1: Figure S26). Two known genes y1 and ccd1 in the major peaks are responsible for kernel colors (Fig. 6d) [37,39]. The dominant Y1 allele conditions yellow kernels [39]. Several variants exist between the A188 y1 (Zm00056a032392) and B73 Y1 (Zm00001d036345) alleles, including one amino acid polymorphism (Ser258Thr) in the Fig. 6 Structural variation and genetic analysis of the Wc1 locus. a Duplication alignments between A188 and a B73 region identified as A188plus by CGRD. b Tandem repeats of 13 intact copies of 27.4-kb sequences. Two DLE-1 restriction patterns in repeat units: Types A and B were identified. c The QTL result of kernel color using the DH population. Arrows point at locations of known causal genes. d A simplified carotenoid pathway. GGPP stands for geranylgeranyl diphosphate. e Seeds at 16 days after pollination (DAP16) were collected and used for quantifying gene expression (exp) of y1 and ccd1. Three biological replicates were used. Bars are color-coded based on the colors of mature seeds. Error bars represent standard deviation (SD). Letters on top of bars are statistical groups determined by Tukey tests. Y1 (wc1) and y1 (Wc1) stand for B73 and A188 alleles, respectively. Mature seeds from the same lines show slightly different colors from seeds of DAP16. Total carotenoids of mature seeds were measured and the values of mean ± SD are listed. Superscript letters are statistical groups determined by Tukey tests coding region (Additional file 1: Figure S27) and polymorphisms found in 5′ upstream and 3′ downstream regions, including a (CCA) n microsatellite variation in the 5′ untranslated region [40] (Additional file 1: Figure S28). Quantitative reverse transcription PCR (qRT-PCR) reveals higher expression of the y1 gene in B73 relative to A188 (Fig.  6e). In contrast, the B73 ccd1 expression was much lower than that of A188, presumably due to the differences in copy number (Fig. 6e). Because higher expression of functional alleles of the ccd1 and y1 genes is expected to reduce and increase the accumulation of carotenoids, respectively, the differences in the expression of the ccd1 and y1 genes in B73 and A188 explain yellow kernels of B73 and white kernels of A188 (Fig. 6d, e). Consistently, B73 mature seeds contained much high carotenoid contents as compared to A188 mature seeds (Fig. 6e). The expression levels of the alleles in two DH lines with different allele combinations of these two loci were similar, and the allele combination of y1 and ccd1 largely determined the level of carotenoids and seed colors (Fig. 6e).
In addition to kernel color, QTL analysis of cob glume color of which A188 is white and B73 is red mapped a single strong peak on chromosome 1S (LOD = 23.8) (Additional file 1: Figure S29). Pericarp color 1 (P1) encoding a MYB transcription factor located in the QTL peak was known to regulate pigment genes [41]. The CGRD result indicated that B73 had more copies of the P1 gene than A188, presenting another structural variation event associated with a phenotypic trait (Additional file 1: Figure S29).

Distinct gene expression and hypermethylation in calli relative to seedlings
Transcriptomic data were generated for 11 diverse tissues with three biological replicates each. Both principal component analysis and clustering of these tissue samples based on their genome-wide gene expression showed that the callus from tissue culture were closely related to root, leaf base, embryo, and ear, but distinct from middle leaf, leaf tip, and seedlings (Additional file 1: Figure S25, Additional file 8). A set of 734 callus featured genes were identified that exhibited at least 2-fold up-regulation in the callus as compared to any other tissues (Additional file 2: Table S9). Genes involved in cell wall biosynthesis, defense activity, heme binding, transmembrane transport, and transcription regulation are enriched in these featured genes (Additional file 1: Figure  S30). For example, a number of NLR and defense-related genes, including Pathogenesis-related protein 1 (PR1) (Zm00056a001451), were activated in the callus. The top six enriched transcription factor families are WOX, AUX/IAA, LBD, AP2, WRKY, and NAC, which included homologs of Baby boom (AP2) and Wuschel2 (Wox) genes relevant to cell division and expansion (Additional file 1: Figure S31) [42]. Of these callus featured genes, the homologs of Baby boom (Zm00056a020360), Wuschel2 (Zm00056a020673), and LBD (Zm00056a020860) are located in the vicinity of the chromosome 3 locus affecting callus development [43]. Three callus featured genes were located in the previously mapped regions (~3 Mb) [43], including Zm00056a020765 encoding protein upstream of flowering locus C (FLC), Zm00056a020767 encoding a zinc finger protein, and Zm00056a020775 encoding a caffeoyl shikimate esterase (CSE). The B73 CSE syntenic gene, Zm00001d042944, contains a 120-bp MITE insertion at 141 bp upstream of the gene start, while the A188 CSE gene does not (Additional file 1: Figure S32).
The callus and seedling tissues were selected for examination of genome-wide DNA methylation levels. The callus exhibited elevated methylation for all three sequence contexts as compared to the seedling, 89.3% vs 85.2% on CG, 74.5% vs 71.9% on CHG, and 3.2% vs 1.5% on CHH (Additional file 2: Table S10). The analysis of CG and CHG methylation over all genes did not find major differences between callus and seedling tissue (Fig. 7a, b). However, there were major differences in the level of CHH methylation (Fig. 7c). On average, there were no major changes in the level of CG or CHG methylation over repetitive elements but there was a consistent trend for slightly higher CG methylation callus for most classes of repetitive elements (p < 0.0001 from paired t-tests, Fig. 7d, e). Similarly, CHH methylation was slightly higher for most classes of repetitive elements with the most notable increase observed at MITE elements (p < 0.0001 from paired t-tests, Fig. 7f).
Differentially methylated regions (DMRs) were identified through comparison of the DNA methylation profiles of callus and seedling. In total, 6927 CG DMRs, 9631 CHG DMRs, and 11,275 CHH DMRs were identified (Additional file 2: Table S11, Additional files 9, 10, and 11). Hypermethylation in callus relative to seedling was the predominant type of DMRs for both CG and CHH methylation in both genic and intergenic regions while CHG exhibited roughly equal proportions of hypermethylation and hypomethylation DMRs with more hypermethylation in genic regions and more hypomethylation in intergenic regions (Fig. 7g). The analysis of the distribution of DMRs relative to genes revealed that the CG DMRs were enriched near TSS regions while CHH DMRs tended to be found in regions just upstream or downstream of genes, mirroring CHH island distributions (Fig. 7h) [44,45]. CHG DMRs exhibited different trends for localization for hypermethylation and hypomethylated DMRs with hypermethylation DNAs enriched at TSS and TTS regions and hypomethylated DMRs enriched in gene bodies (Fig. 7h). The high frequency of some types of DMRs near the TSS led us to assess whether these DMRs may be contributing to differential expression in callus relative to seedling tissue. Genes with DMRs were enriched for being differential expression (DE) in seedling relative to callus compared to genes without DMRs (χ 2 = 20.9, p-value = 4.9e−6). Based on prior studies in maize, we expected that gains of CG or CHG methylation near the TSS would be associated with down-regulation of expression while gains of CHH upstream of the promoter might be associated with up-regulation of expression [46,47]. We found that the DE genes with hypomethylated or hypermethylated DMRs at most regions exhibited roughly similar numbers of up-and down-regulated with exception at CG hypomethylation at 5′ upstream regions of genes and CHG hypomethylation in the gene body, both of which were associated with up-regulation of gene expression in the callus (Fig. 7i, Additional file 2: Table S12). These results reveal dynamic changes in some types of DNA methylation in callus relative to seedling and a marginal association of DNA methylation with gene expression changes.

Discussion
Here, the A188 genome assembly capitalized on long-read technologies, including Nanopore single molecule reads and long-range optical mapping, which adds a new high-quality reference genome to the collection of sequenced maize genomes [8][9][10][11][12][13][14][15][16]. The quality of the assembly was enhanced by the strategy of comparing read depths of short read data from two independent DNA sources to filter contigs before the scaffolding. The use of the independent DNA sources reduced contamination from DNA sequences of organelle genomes and microorganisms while preserving nuclearintegrated organellar sequences. In addition, a novel approach for the discovery of genome structural variation based on quantitative comparison of depths of sequencing reads, here named Comparative Genomic Read Depth or CGRD, was introduced. Detailed characterization of genomic structural variation in complex genomes such as maize is challenging. Comparisons using complete genome sequences based on their alignments would be an ideal method to reveal copy number variation and rearrangements. However, technically, alignment-based methods still suffer from a low ability of Violin plots of methylation on repetitive sequences. For each violin plot, the top half is the distribution of methylation in the callus and the bottom half is the distribution of methylation in the seedling. Each dot represents the median of methylation rates. Numbers stand for the mean methylation differences between the callus and the seedling, which are color-coded with blue and red to represent increased methylation and decreased methylation in the callus, respectively. All differences are significant (p-value < 0.0001) by paired t-test. g Barplots of DMRs on genic regions, including 2 kb beyond each of TSS and TTS, and the rest of the genome (intergenic regions). h Distribution of DMR sequences around genes. The definition of the gene body is the same as described in a-c. i Proportions of DE genes up-regulated in hyper DMR and hypo DMR regions (gene body, 1 kb 5′ upstream and 1 kb 3′ downstream regions). Numbers on top of bars are numbers of DE genes. Stars indicate significance (p < 0.05) from χ 2 tests for the independence of the DMR and DE changing directions. In g, h, and i, hyper and hypo stand for increased and decreased methylation in the callus relative to the seedling, respectively confident alignments of repetitive sequences. More critically, finding structural variation with assembled genome sequences is subject to the quality of assemblies. Unfortunately, assemblies of most plant genomes or other large complex genomes are generally not complete or error-free. B73Ref4, for example, is missing the topmost region of the short arm sequence of chromosome 6 (Additional file 1: Figure S13) and includes multiple assembly inversion errors. CGRD based on the comparison of depths of short reads complements the approaches that rely on whole-genome alignments, including SyRI [32]. In particular, the CGRD pipeline can detect copy number variation missed by SyRI due to incomplete assembly at structurally complex regions. CGRD identified a 1.8-Mb duplication at the Ga1 locus and a high-copy tandem duplication of Wc1 in A188, both of which were missed by SyRI. The two methods are complementary in that CGRD captures unbalanced structural variation due to copy number variation rather than balanced structural variation that SyRI can detect. Therefore, the combination of SyRI and CGRD provides an optimal strategy for the discovery of genomic structural variation, which is critical for further characterization of their impacts on gene expression and phenotypes.
Analysis of structural variation elucidated a repetitive structure of the ccd1 gene, which, in A188, consists of 13 copies. The high copy number of ccd1 corresponds to the high expression level of ccd1, which was previously observed and presumably leads to a high activity of the carotenoid cleavage enzyme and enhanced carotenoid degradation [37]. Furthermore, the expression of y1, which encodes for phytoene synthase and the entry reaction to the carotenoid pathway, in A188 is low during seed development, while the y1 expression in B73 seeds is relatively high [48]. Both alleles were highly expressed in some non-seed tissues, including leaves. The A188 y1 allele is likely functional since a low but perceptible level of carotenoids is produced at seed development. An additional minor kernel color QTL was identified from the DH lines and concordant with the QTLs from multiple other B73-derived biparental populations [49]. The underlying candidate gene, zep1 (Zm00001d003513) encoding zeaxanthin epoxidase, was also identified in an earlier association study [50]. However, functional validation for the involvement of zep1 in seed color is needed. At the same time, the three QTL loci are not sufficient to fully determine the kernel color variation of DH lines. Analysis with a larger B73xA188-derived population may reveal additional loci influencing kernel colors as not all DH lines shared the expected color based on the associated QTLs. In any event, carotenoid levels were expected based on the allele types of y1 and ccd1 and supported the hypothesis that higher levels of ccd1 and low y1 levels contribute to the differences in seed color of A188 and B73.
An important goal of A188 characterization is to gain insight into plant tissue culture. Development from a highly differentiated tissue to the callus for genetic engineering purposes involves a process of dedifferentiation to gain pluripotency [51]. The transition of differentiation status is, physiologically, stressful [52]. Somaclonal variation, including sterility, in plants produced through tissue culture may be the product of DNA damaging stress responses [53]. Transcriptomic data from this study revealed that, in fact, defense response genes were enriched among the callus featured genes that were up-regulated in the callus as compared to any other tissues. Hypermethylation is considered to be a protection mechanism against stresses, which enhances genome stability and safeguards genome integrity [54]. Our comparison between the callus and the seedling uncovered globally elevated methylation in the callus in all three sequence contexts. Consistently, hypermethylation in the callus relative to the immature embryo was found in a study that used another maize inbred line and methylated DNA immunoprecipitation sequencing (MeDIP-seq) [55]. In this study, 24 nt small RNA was shown to be positively correlated with DNA methylation. In rice, CG hypermethylation was seen in 1and 3-year callus relative to the shoot in the rice mutant MET1-2, which encodes a DNA methyltransferase with a major role in maintaining CG methylation. In wildtype rice, however, only CHH hypermethylation was observed [52]. In our study, the callus versus seedling comparison showed that the A188 MET1-2 homolog (Zm00056a035610) was~2× up-regulated in the callus, and mop1 (Zm00056a013519), a homolog of RNA-dependent RNA polymerase 2 that is involved in the production of 24 nt small RNA [56], was 5-6× up-regulated in the callus, indicating that the transcriptomic machinery was regulated to enhance global DNA methylation in the callus. In plants regenerated from calli, CG and CHG methylation tended to be lost as compared to non-regenerated plants and many methylation events were heritable [57]. Heritable hypomethylation in regenerated plants was observed in an earlier maize study [58]. In rice, as compared to nonregenerated plants in rice, pronounced hypomethylation was found in regenerated plants from tissue culture [59]. The discrepant DNA methylation levels between regenerated plants and calli indicated that most methylation gained from tissue culture is not stable or heritable. Collectively, DNA methylation was elevated during the formation of the callus, likely due to the cellular defense responses. The majority of DNA methylation gained appears to be demethylated during redifferentiation, resulting in hypomethylated regenerated plants.

Conclusions
The genome of a regenerable maize inbred line A188 was assembled with long reads and optical maps, producing a reference-quality genome sequence. Comparison of the A188 genome with the reference B73 genome identified structural variants, including those responsible for phenotypic discrepancies between A188 and B73. Examination of DNA methylation and gene expression with the newly generated A188 reference genome found hypermethylation in the callus as compared with the seedling and the activation of defense genes in the callus, indicative of the defensive state for cellular protection in the embryogenic callus.

Genetic materials
A188 (PI 693339) seeds were obtained from the North Central Regional Plant Introduction Station in Ames, IA. The A188 inbred line was derived from a cross between the inbred line 4-29 and the inbred line 64, also named A48, followed by four generations of backcross with 4-29. The line 4-29 was derived from the commercial variety Silver King and the line 64 was from a northwestern dent line [1]. Double haploid lines were developed from the F 1 of B73 (PI 550473) x A188 at the Doubled Haploid Facility at Iowa State University.
Nanopore A188 whole-genome sequencing A188 were grown in the greenhouse at 28°C and 23°C day/night, with a photoperiod of 14:10 h (light:dark). Nuclei were isolated from seedling leaves using a modified nucleus isolation protocol [60] and dissolved in buffer G2 (Qiagen). The lysate was used for DNA isolation with Qiagen DNeasy Plant Mini Kit (Qiagen) following the manufacturer protocol. A188 genomic DNA was size selected for 15-30 kb and above with the BluePippin cassette kit BLF7510 (Sage Science) high-pass-filtering protocol, followed by a library preparation with the SQK-LSK109 kit (Oxford Nanopore). Each DNA library was loaded on an FLO-MIN106D flowcell and sequenced on MinION (Oxford Nanopore). The basecaller Guppy (version 3.4.4) was used to convert FAST5 raw data to FASTQ data with default parameters.

Illumina A188 whole-genome sequencing
Three independent A188 leaf samples were collected for extracting nuclear DNAs. Two were used for PCR-free paired-end 2x125 bp Illumina sequencing and one was used for PCR-free paired-end 2x250 bp Illumina sequencing on Hiseq2500 at Novogene. In addition, genomic DNA was extracted from A188 immature ears for additional PCRfree paired-end 2x250 bp Illumina sequencing. Therefore, comparable 2x250 bp data were generated from the leaf and ear tissue samples. The 2x125 bp Illumina sequencing data were comparable with the previously generated 2x125 bp B73 whole-genome sequencing data (SRR4039069 and SRR4039070) [61], both of which were used for CGRD analysis.

Contigs filtering
Leaf and ear 2x250 bp data were aligned to the contigs with the "mem" module in bwa (0.7.12-r1039) [63]. Uniquely mapped reads with less than 15% mismatches were used to determine read count per contig with the "intersect" module of BEDTools (v2.29.2) [64]. The log2 of the ratio of read counts normalized by using total reads of leaf and ear samples was calculated for each contig. The contigs with a log2 value larger than 0.5 were considered as the contigs with variable counts from leaf and ear samples. The contigs (N = 21) that had variable counts and less than 100 kb and were not anchored to B73Ref4 via Ragoo (version 1.2) [65] were discarded. In addition, the contigs (N = 16) smaller than 15 kb were also discarded.
Through analysis of read counts, the contigs that had variable counts and matched with the previously sequenced mitochondrion genome sequence (Genbank accession: DQ490952.1) and the chloroplast genome sequence (Genbank accession: KF241980.1) were identified. One chloroplast contig and 13 mitochondrion contigs were found. The chloroplast contig had almost identical sequences to the Genbank accession KF241980.1. The failure of assembling mitochondrial contigs into one was likely due to heterogeneous forms of mitochondria. In A188Ref1, the previously assembled DQ490952.1 and KF241980.1, were used to represent the mitochondrion and chroloplast genomes, respectively.

Sequence polishing of assembled contigs
After filtering contigs that were derived from organelles or contamination, the remaining contigs were first polished with raw Nanopore reads that contained signal information using Nanopolish (0.11.0) (github.com/jts/nanopolish). Briefly, Nanopore reads were aligned with the contigs using the aligner Minimap2 (2.14-r892) [66]. Polymorphisms, including small insertions and deletions as well as single nucleotide polymorphisms, were called and corrected. The Nanopolish polishing was performed twice, followed by twice polishing with Illumina sequencing data using Pilon (version 1.23) [67]. In each Pilon polishing, reads were aligned to contigs with the module of "mem" in bwa (0.7.12-r1039) [63]. Contigs were corrected with the parameters of "--minmq 40 --minqual 15" using Pilon.

Hybrid scaffolding with Bionano data and polished contigs
Bionano raw molecules were filtered to remove molecules less than 100 kb. The remaining molecules were assembled into Bionano maps with the assembly module in the software Bionano Tools (v1.0). Five times extension and merge iterations and noise parameters were automatically determined by using the parameters of "-i 5 -y". The hybrid scaffolding module from the Bionano Tools was used for scaffolding polished contigs. The conflict filter level for both genome maps and sequences were set to 2 by using the parameters of "-B 2 -N 2".

Construction of a B73xA188 genetic map
Genomic DNA of DH lines was extracted by using BioSprint 96 DNA Plant Kit (Qiagen) and normalized to 10 ng/μL for Genotyping-By-Sequencing (GBS) modified from tunable GBS [68]. Briefly, for each genomic DNA sample, the restriction enzyme Bsp1286I (NEB) was used for DNA digestion for 3 h at 37°C, followed by ligation with a barcoded single-stranded oligo with T4 DNA ligase (NEB) for 1 h at 16°C. Enzymatic activity was inactivated at 65°C for 20 min and all samples of ligated DNA were pooled, followed by purification with Qiagen PCR purification kit (Qiagen). The purified ligated DNA was subject to PCR amplification with Q5 high-fidelity DNA polymerase (NEB), followed by purification with Agencourt AMPure XP (Beckman Coulter). The final sequencing library product was prepared by size selection at the range of 200 to 400 bp by a Pippin Prep run with 2% agarose gel cassettes (Sage Science). Illumina sequencing was performed on a HiseqX 10 at Novogene (USA).
Raw FASTQ data were demultiplexed to multiple samples and trimmed to remove barcode sequences and low-quality bases with Trimmomatic (version 0.38). Clean reads were aligned to polished contigs with the "mem" module of bwa and uniquely mapped reads with less than 8% mismatches were used for SNP analysis. SNPs were discovered by HaplotypeCaller of GATK (version 4.1.0.0) and filtered by SelectVariants of GATK to select biallelic variants [69]. SNP sites with at most 80% missing data, at least 10% minor allele frequency and at most 5% heterozygous rates remained. A segmentation (or binning) algorithm was implemented to determine genotypes of chromosomal segments in each DH line [68]. Genotypes of bin markers of 100 DH lines were used to construct a genetic map (BAgm.v01) with MSTmap [70].
Another genetic map was built using A188Ref1 as the reference genome with 137 DH lines (100 DH lines for BAgm.v01 and 37 additional DH lines). Recombination data was inferred from the genetic map (BAgm.v02) based on A188Ref1 (Additional file 2: Table S13).

ALLMaps to build pseudomolecules
The genetic map that was built with GBS markers was used for further scaffolding. The GBS markers were developed using polished contigs as the reference genome. Each scaffold harbored more than 10 markers. In total, 29 scaffolds were on the map. Scaffolds were aligned to B73Ref4 via NUCMer [71]. Based on the orientation of scaffolds relative to B73Ref4 chromosomes, the order of markers in each linkage group was either kept the same order or flipped to match their orders in B73Ref4. The software ALLMaps (JCVI utility libraries v1.0.6) [72] was run with default parameters, constructing 10 pseudomolecules corresponding to ten A188 chromosomes.

BUSCO assessment
Benchmarking Universal Single-Copy Orthologs (BUSCO) [22] was run in a mode of "genome" to assess the completeness of the assembly with default parameters. BUSCO was run in a mode of "transcriptome" to assess the completeness of the gene annotation with default parameters. Both assessments used the Liliopsida database (liliopsida_ odb10) that consisted of 3278 conserved core genes.

Estimation of base errors using KAD analysis
The module "KADprofile.pl" in the KAD tool (version 0.1.7) [21] was used to estimate errors in A188Ref1. The input read data were the merged trimmed Illumina 2x250 bp reads from leaf and immature ears. The k-mer length of 47 mer was used.

Estimation of recombination rates
Genetic distances of non-overlapping 1-Mb windows were estimated. Non-overlapping 1-Mb windows were generated by the module of "makewindows" in BEDTools (v2.29.2) [64]. The last window of each chromosome was discarded due to the smaller size than 1 Mb. The prediction of genetic distance per window utilized a method developed previously [73]. Briefly, a generalized additive model (GAM) was used for the prediction of the genetic distance of any physical interval.
The similar method was used to estimate recombination rates around each gene and repetitive element. For example, for a given element, we first find the midpoint of the element. The genetic positions were then predicted, by GAM, for the position 0.5 Mb upstream and the position 0.5 Mb downstream. The distance of the genetic positions was then used to represent the recombination context of the element.
The recombination rates that are lower than 0.6 cM/Mb and higher than 3 cM/Mb were categorized to low recombination and high recombination, respectively.
Illumina RNA-Seq, transcriptomic assembly, and differential expression Thirty-three RNA samples were extracted from 11 diverse tissue types of A188 with three biological replicates using RNeasy Plant Mini Kit (Qiagen) (Additional file 2: Table S14). Briefly, the 11 tissues included the root and the above-ground of 10-dayold seedling, three different parts of the 11th leaf tissue at V12, the meiotic tassel, anther, and immature ear at V18, the endosperm and embryo 16 days after pollination, and the callus after 39 days culture of DAP11 immature embryos. RNA quality control, library preparation, and sequencing were performed on an Illumina Novaseq 6000 platform at Novogene. Trimmomatic (version 0.38) [76] was used to trim the adaptor sequence and low-quality bases of RNA-Seq raw reads. The parameters used for the trimming is "ILLUMINACLIP:trimming_db:3:20:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:13 MINLEN:40". The trimming adaptor database (trimming_db) includes the sequences: adaptor1, TACACTCTTTCCCTACACGACGCTCTTCCGAT CT; adaptor2, GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT. Only paired reads both of which were at least 40 bp after trimming were retained for further analysis.
Trimmed reads were aligned to A188 (A188Ref1) using HISAT2 (version 2.1.0) with the parameters of "-p 8 --dta --no-mixed --no-discordant -k 5 -x" [77]. Alignments whose paired reads were concordantly paired were kept. The software StringTie2 (version 2.1.0) [78] was used to assemble the transcriptome with alignments from a dataset of each A188 sample with the default parameters. In total, 33 transcriptome assemblies from 33 samples were generated. All transcriptome assemblies were merged to build an A188 Illumina transcriptome assembly with the merge function in StringTie2.

Differential expression of the callus relative to other tissue types
Trimmed reads were aligned to A188Ref1 with STAR (2.7.3a) [79]. Uniquely mapped reads with at least 96% coverage and 96% identity were used for determining read counts per gene. DESeq2 (version 1.26.0) [80] was used to identify differential expression between the callus and each of other tissue types. Multiple tests were corrected with the FDR (false discovery rate) approach [81]. The FDR of 5% was set as the threshold.

Genome annotation
The Maker (2.31.10) pipeline was used for genome annotation [84]. The genome was masked by using Repeatmasker (4.0.7) [85] with the A188 repeat library built by the Extensive de novo TE Annotator (EDTA, v1.8.4) [86]. Two rounds of the maker prediction were performed. At the first round, the A188 assembled transcripts and B73Ref4 protein data were used as EST and protein evidence, respectively. The parameters "est2genome=1" and "protein2genome=1" were set to directly produce gene models from transcripts and proteins. At this round, no ab initio gene predictors were used. Prior to the second maker round, a snap model was trained using the confident gene set from the first round. Gene models produced from round 1 were input as one of the predicted gene models. These gene models were competed with gene models predicted by three gene predictors: snap (2013_11_29) [87], augustus (3.3.3) [88], and fgenesh (v.8.0.0) (softberry.com). ESTs from relative maize genotypes and proteins from closely related species were provided as additional evidence. Gene model output from Maker were further filtered. First genes matched with the following criteria "-evalue 1e-50 -qcov_hsp_perc 60" to the transposon database in Maker were filtered. Second, a transcript retained if it carried Pfam domains from the result of InterProScan (version 5.39-77.0) and/or had an annotation edit distance (AED) less than 0.4, which measured the level of discrepancy of an annotation from supporting evidence.

Functional annotation of transcripts
BLASTP was used to map all proteins to the SWISS-Prot database (https://www. uniprot.org/) with the e-value cutoff of 1e−6. Gene ontology (GO) was extracted from InterProScan.

Identification of a major transcript per gene
For a gene containing multiple transcripts, a major transcript per gene was selected if a transcript had the highest non-zero FPKM (Fragments Per Kilobase of transcript per Million mapped reads) determined from Illumina RNA-Seq datasets of diverse tissues by Cufflink (v2.2.1) [89], and/or the lowest BLASTP e-value to the SWISS-Prot database, and/or the longest transcript length. The BLASTP e-value had a priority relative to the transcript length. If data were not sufficient to make a decision, the one with the longest length was selected.

Paralogs in A188 and orthologs between A188 and B73
Paralogs in A188 and orthologs between A188 and B73 were identified with OrthoMCL [91]. Briefly, protein sequences of major transcripts with at least 20 amino acids were used for all-to-all BLASTP with the e-value cutoff of 1e−5. The BLASTP result was input to OrthoMCL to identify paralogous and orthologous groups.

Identification of gene clusters
A gene cluster was defined if at least three genes from a group of A188 paralogs identified by OrthoMCL were physically closely located on a chromosome. The maximum distance is 250 kb for two neighboring genes in a cluster.

Annotation of NLR genes
The NLR genes of A188Ref1 were annotated using the NLR-Annotator pipeline [92].

Analysis of NUMT and NUPT
The "nucmer" command from the software MUMmer 4 [93] was used to align the A188 mitochondrion or chloroplast genomes to A188Ref1. For mitochondrial alignments, each required at least 5 kb and 95% identity. For chloroplast DNA alignments, each required at least 3 kb and 95% identity based on the minimal requirement for a sufficient FISH signal [26]. Multiple alignments with the distance less than 100 kb were clustered into a block, considered to be a nuclear integration event.

Comparative genomic analysis via SyRI and CGRD
The "nucmer" command was used for whole-genome alignment of 10 chromosomal pseudomolecules between A188Ref1 and B73Ref4. The parameter of "--maxmatch -c 500 -b 500 -l 50" was used in the command "nucmer" and the parameter of "-i 95 -l 1000 -m" in the command of delta-filter, which resulted in best alignments with at least 1-kb matches and at least 95% identity between the two assembled genomes. The "show-coords" command with the parameter of "-THrd" was run to convert alignments to a tab-delimited flat text format. Alignment results were then used for identifying genomic structural variation and nucleotide polymorphisms through SyRI (v1.2) [32] with the parameter of "--allow-offset 100". Syri analysis discovered genome duplication, translocation, inversion, as well as syntenic, unaligned, divergent sequences. SNPs, small insertions, and deletions were identified as well. SyRI analyses using minimap2 alignments were also performed between A188Ref1 and each of the newly assembled genomes of NAM founders, including version 5 of B73Ref5 [18], as well as genome assemblies of Mo17 [10] and SK [11]. The parameter used for minimap2 alignments is "-ax asm10 -t 16 -K 800M -f 500 --eqx" [83].
The CGRD pipeline (v0.1) (github.com/liu3zhenlab/CGRD) was employed to find copy number variation (CNV) through comparing depths of Illumina reads from A188 and B73 with the default parameters [33]. A value of the log2 read depth ratio per sequence segment (LogRD) is the indication for CNV. For a segment, the LogRD is close to zero if sequences of two genotypes are identical and no CNVs. The sufficient deviation of the mean of LogRD from zero is likely due to CNV or a high level of divergence. CGRD was performed using A188Ref1 as the reference genome and identified sequences of A188Ref1 showing conserved (B73 = A188), copy number plus (B73 > A188), and copy number minus (B73 < A188) in B73 relative to A188. When B73Ref4 was used as the reference, the analysis found sequences of B73Ref4 showing conserved (A188 = B73), copy number plus (A188 > B73), and copy number minus (A188 < B73) in A188 relative to B73.

Identification of PAV or highly divergent sequences (HDS)
SyRI analysis listed B73Ref4 sequences that were not aligned to A188Ref1, and vice versa, as well as insertion/deletion polymorphisms between the two chromosomal sequences. Unaligned sequences or insertion/deletion polymorphisms identified by SyRI were compared with CGRD segments. For each SyRI event, a supporting score of read depth data from CGRD was determined by using the formula of P n i −LogRD i ÂO i L , where i represents the ith overlap between a CGRD segment and a SyRI event; LogRD stands for the LogRD of the CGRD segment and only negative values were taken into calculation; O is the overlapping length in bp; L is the length in bp of the SyRI event; and n is the total number of overlaps. The resulting value from the formula represents the degree of the differentiation in read depth between the two genotypes for the SyRI event. The higher the number, the more confidence the PAV or HDS event. A SyRI event is considered to be a PAV or HDS if a supporting score is larger than 3.

Identification of large inversion events
Inversion between A188Ref1 and B73Ref4 were revealed by SyRI. Large events with both A188 and B73 sequences larger than 0.5 Mb were extracted. First, the inversion sequences of B73Ref4 were aligned to B73Ref5 to confirm the inverted orientation relative to A188Ref1. For a given inversion, if >80% B73Ref4 sequences were aligned to B73Ref5 in the plus orientation, the inversion was supported by B73Ref5. If <20% B73Ref4 sequences aligned to B73Ref5 were in the plus orientation, the inversion was considered to not be supported by B73Ref5. Second, the recombination frequency between the start and the end of an inversion event was estimated and adjusted to cM per Mb. Third, SNPs between the two genomes and located on the inversion were identified. The common SNPs genotyped in the maize 282 [94] population were extracted for determining linkage disequilibrium (LD) between SNPs at a distance of 0.

Structure analysis of inversions in maize HapMap2 population
The software STRUCTURE (v2.3.4) [96] was used to analyze the inference of population structure for A188 inversions in maize HapMap2 population [97]. A188 and B73 SNPs between inversion regions were discovered by SyRI. HapMap2 genotyping data overlapping with inversion SNPs were extracted and the subset of SNPs with the missing rate less than 20% were input for STRUCTURE analysis. The major alleles, minor allele, and missing locus in SNP dataset were converted to 0, 1, and −1, respectively. K = 2 as the cluster number and 10 replicate runs of the admixture model were used, with a burn-in of 10,000 iterations and a run length of 20,000 steps.

Fluorescence in situ hybridization (FISH)
Mitotic and meiotic chromosomes were prepared as described by Koo and Jiang (2009) with minor modifications [98]. Root tips were collected from seedling plants and treated in a nitrous oxide gas chamber for 1.5 h, fixed overnight in ethanol:glacial acetic acid (3:1), and then squashed in a drop of 45% acetic acid. Anthers were squashed in 45% acetic acid on a slide and checked under a phase microscope. All preparations were stored at −70°C until use. DNA probes of the CentC, Knob, Cent4 [99], and the probes for examining NUMTs, the PME cluster, and a potential large inversion on chromosome 4 (Additional file 2: Table S15) were labeled with digoxigenin-11-dUTP (Roche, Indianapolis, IN), biotin-16-dUTP (Roche), and/or DNP-11-dUTP (PerkinElmer), depending on whether two or three probes were used in the FISH experiment [99]. The FISH hybridization procedure was according to a previously published protocol [100]. After post-hybridization washes, the probes were detected with Alexa Fluor 488 streptavidin (Invitrogen) for biotin-labeled probes and rhodamine-conjugated anti-digoxigenin for dig-labeled probe (Roche). The DNP-labeled probe was detected with rabbit anti-DNP, followed by amplification with a chicken anti-rabbit Alexa Fluor 647 antibody (Invitrogen). Chromosomes were counterstained with 4′,6-diamidino-2-phenylindole (DAPI) in Vectashield antifade solution (Vector Laboratories). The images were captured with a Zeiss Axioplan 2 microscope (Carl Zeiss Microscopy LLC) using a cooled CCD camera Cool-SNAP HQ2 (Photometrics) and AxioVision 4.8 software. The final contrast of the images was processed using Adobe Photoshop CS5 software (Adobe).

QTL mapping
Kernel colors of 125 B73xA188 DH lines were scored as 1 to 6 (1 = white, 6 = yellow, and 2 to 5 indicated colors between white and yellow). QTL mapping of kernel color was performed by using scanone function with the Haley-Knott regression method in the R package rqtl [101]. The LOD cutoff was the 5% highest LOD value from 1000 permutations of phenotypic data.

qRT-PCR
qRT-PCR was used to measure the gene expression of ccd1 and y1 gene in genotypes of A188 and B73 as well as two DH lines DH305 and DH312. Immature ears of the four genotypes were harvested from the summer nursery in Manhattan Kansas at 16 days after pollination (DAP16). Fifteen kernels were randomly sampled from the middle of an ear, five kernels of which were pooled as a biological replication for RNA isolation. cDNA was synthesized with Verso cDNA Kit (Thermo Scientific) following the manufacturer's protocol. qRT-PCR was performed in a reaction of 10 μL with the IQTM SYBR Green Supermix reagent (BioRad) on the CFX96 Real-Time PCR System (BioRad). The thermocycling conditions for the PCR included an initial denaturation at 95°C for 3 min, followed by 40 cycles of denature at 95°C for 15 s, annealing, and extension at 60°C for 30 s. The housekeeping reference gene actin1 was used as the internal control. Cycle threshold values (Ct) of two technical replicates were averaged and used to quantify relative gene expression levels. The relative expression levels of each of ccd1 and y1 genes in each sample were calculated using the formula 100 x 2 actinCt-geneC , where actinCt and geneCt stand for the Ct values of actin1 and ccd1 (or y1), respectively. The primers used are as follows: actin1: act1_qrt_2F and act1_qrt_2R; ccd1: ccd1_qrt_5F and ccd1_qrt_5R; y1: y1_qrt_4F and y1_qrt_4R. Sequences of primers are in Additional file 2: Table S15.

Kernel carotenoid content measurement
The seed total carotenoids of different genotypes with three biological replicates were measured following the protocol described by Mishra and Singh [102]. Briefly, 10 maize seeds from each replicate were milled into fine flour using mortar and pestle. After the flour was filtered using muslin cloth, 0.5 g of the fine powder was dissolved into 6 mL of 1% butylated hydroxytoluene. The well-mixed samples were treated following the protocol [102], and the optical density at 450 nm of the treated sample was measured using Genesys 20 spectrophotometer (Thermo scientific). The concentration of total carotenoids was calculated using the formula described by Mishra and Singh [102].

Whole-genome bisulfite sequencing (WGBS)
Genomic DNA from two biological replicates of the seedling and callus samples that were used for RNA-Seq were subjected to WGBS on a Novaseq 6000 at Novogene (USA). A Bismark pipeline (v0.22.1) was adapted to process bisulfite sequencing DNA methylation data [103]. Briefly, raw reads were subjected to Trimmomatic trimming (v0.38) [76] to remove adaptor and poor-quality sequences. Bowtie2 (v2.3.5.1) [104] built in Bismark was used for the alignment and alignments of duplicated reads were removed before methylation calling. The methylation levels per cytosine site of all three sequence contexts (CG, CHG, and CHH) were determined, which were used for identifying differentially methylated regions (DMRs) with the DSS R package (v2.34.0) (github.com/haowulab/DSS).

DNA methylation around genes and on repetitive sequences
Genomic regions (gene body) from the translation start site (TSS) to the translation termination site (TTS), which were based on genomic locations of major transcripts, were equally divided into 200 windows. For each gene, the 2 kb in the 5′ upstream region and the 2 kb in the 3′ downstream region of each gene were also extracted. The DNA methylation rate in three sequence contexts (CG, CHG, and CHH) on each window of the gene body or each 20 bp in upstream and downstream regions was separately determined to examine the distribution of DNA methylation on and around genes.
DMRs were located in the three regions, 5′ upstream 1 kb, gene body, 3′ downstream 1 kb. For each region, the independence between changes of DNA methylation, increased or decreased in the callus versus the seedling, and regulation in gene expression, up-or down-regulated in the callus versus the seedling from DE analysis, was examined through χ 2 statistical test. Tests were performed for all three methylation types: CG, CHG, and CHH.
DNA methylation rates per 100 bp of repetitive sequences were determined. Annotation of repetitive types was from EDTA and additional 45S rDNA alignment analysis. Paired t-test was performed between the two tissues: callus and seedling.

Tissue network and principal component analyses of A188 tissues
The A188 tissue network was constructed with the R package WGCNA (version 1.66) [105] using the expression of 29,222 genes in 33 RNA-Seq datasets from 11 A188 tissue types. WGCNA was performed to cluster A188 tissue samples with the parameters minModuleSize = 6 and soft-thresholding power = 9. The Gephi software (version 0.9.2) [106] was used to visualize tissue networks with the module and connectivity information from the WGCNA result. Principal component analysis (PCA) was also performed using the R functions prcomp with the expression per gene averaged from three replicates per tissue type.

Gene ontology (GO) enrichment analysis
The enrichment analyses were performed to determine if a certain GO was overrepresented in a selected group of genes. The resampling method in GOSeq [107] was employed.