Biological material
Hexaploid wheat (for example, ‘bread’ or ‘common’ wheat) formed around 8,000 years ago through a natural hybridization between cultivated tetraploid wheat (AABB genome) and a wild wheat relative, Ae. tauschii (DD genome) [60]. Commonly known as bread wheat, the hexaploid species is widely cultivated throughout the world. The tetraploid wheat species (also referred to as ‘Durum’ or ‘pasta’ wheat) represents an older group of wild and cultivated material. Durum wheat is the modern form of a 10-millenia aged crop complex represented by various taxa of the same Triticum turgidum spp. Durum wheat (Triticum turgidum ssp. turgidum var. durum (Desf) Husn.) is represented by landraces and elite inbred lines. T. turgidum is domesticated from wild emmer (Triticum turgidum ssp. dicoccoides) and is allotetraploid (2n = 4x = 28, genomes AABB). Durum wheat is a selfing species and commercial varieties are mostly pure lines. The diploid D genome species, Ae. tauschii, is a wild annual grass native throughout central Asia.
‘Synthetic W7984’ is a contemporary reconstitution of hexaploid wheat formed by hybridizing a tetraploid wheat Triticum turgidum L. subsp. durum var ‘Altar 84’ (AABB genotype) with the diploid goat grass Ae. tauschii (219; CIGM86.940) (DD genotype). Following chromosome doubling, this synthetic hexaploid is interfertile with bread wheat and is typically regarded as a variety of T. aestivum.
T. aestivum var ‘Opata M85’ is a hexaploid bread wheat cultivar developed in the wheat breeding program at the International Wheat and Maize Research Center (CIMMYT). It is a medium quality, medium maturity hard white spring wheat.
Synthetic W7984 and Opata M85 are parents of the widely used DH genetic reference population ‘SynOpDH’ [27]. For this population a total of 215 DH lines were produced from two F1 plants. The F1s were made from a cross between two single plants using W7984 as female and Opata as male. From the parental cross, two F1 plants were used to form the DH lines using the maize pollinator method [27].
Seeds for the Synthetic W7984, Opata M85 accessions and SynOpDH lines used in this study can be obtained upon request from the Wheat Genetics Resource Center at Kansas State University.
Shotgun sequencing of the synthetic wheat W7984
WGS Illumina libraries were prepared using DNA isolated from etiolated seedlings. For each of the parental lines, tissue from a minimum of 20 plants was sampled and pooled together for DNA extraction. A standard CTAB (cetyltrimethyl ammonium bromide) extraction was used with RNase treatment. For DH lines, six seedlings were sampled and DNA was extracted using the Qiagen BioSprint 96 Plant DNA extraction kits and robot. TruSeq Illumina fragment libraries of size approximately 250 bp and approximately 500 bp were sequenced using 2×150 chemistry on a HiSeq 2000 instruments. A summary of the dataset can be found in Table S1 in Additional file 1. Three ‘800 bp’ fragment libraries were prepared and sequenced using long run chemistry on the Illumina HiSeq 2500, producing nominal paired 250 bp reads. Two of the three attempted ‘800 bp’ libraries showed substantial bimodality when aligned to preliminary assemblies, including not only the desired peak insert size at approximately 800 bp but also a large collection of pairs with short inserts (<400 bp). All sequences were used for contig building, but only unimodal libraries were used for scaffolding. Two LFPE (ligation-free paired end) mate pair libraries were generated as follows. DNA fragments were generated using the 5500 SOLiD Mate-Paired Library Construction Kit (SOLiD®). Genomic DNA (5 μg) was sheared using the Covaris E210 (Covaris (Woburn, MA, USA)) and gel size selected to target an insert size of 1.5 kbp (library OAGT) and 4 kbp (PSWH). The sheared DNA was end repaired, and ligated with biotinylated internal linkers. The DNA was then circularized using intra-molecular hybridization of the internal linkers. The circularized DNA was treated with plasmid safe to remove non-circularized products. The circularized DNA was nick translated and treated with T7 exonuclease and S1 nuclease to generate fragments containing internal linkers with genomic tags on each end. The mate pair fragments were A-tailed and purified using Streptavidin bead selection (Invitrogen). The purified fragments were ligated with Illumina adaptors and amplified using 10 cycles of PCR with Illumina primers to generate the final library. Quantitative PCR was used to determine the concentration of the libraries and were sequenced on the Illumina Hiseq. The distribution of insert sizes is measured to be approximately 1.0 kbp for OAGT and approximately 4.2 kbp for PSWH (Figure S5 in Additional file 1).
Sequencing of T. aestivum ‘Opata M85’ and the SynOpDH population
To identify variants that differentiate Synthetic W7984 from Opata M85, we produced approximately 19× shotgun coverage for Opata M85 (Table S2 in Additional file 1). Since de novo assembly was not our aim, no mate pairs were generated. 51-mer depth is shown in Figure 1. Note that each read of length R is tiled by R - k + 1 k-mers, and each sequencing error affects k k-mers and therefore the k-mer depth is reduced by approximately ke/2, where e is the per base error rate and the factor of ½ roughly accounts for the fact that most errors occur near the end of a read. Thus, although the raw shotgun coverage is approximately 19×, the peak 51-mer frequency is approximately 11 × .
De novo whole genome assembly
Assembly was performed using meraculous [29] and is available for download [61]. The Perl code used to perform the assembly is available online [62] (with exceptions noted below). Several modifications to the core meraculous code-base were made to improve the performance of the assembler for this data set. These modifications are available for download [63].
The primary purpose of these code variants is to more fully take advantage of long (251 bp) reads in the assembly when the initial ‘UU’ contig generation procedure yields a highly fragmented preliminary result. In addition, a high-performance parallel version of the contig-generating k-mer-graph traversal phase of the assembly was developed with Unified Parallel C (UPC) and run on the NERSC Edison supercomputer (a Cray XC30) saving several days of compute time over the standard Perl implementation [64]. This high-performance implementation is based on a distributed hash table employing communication optimizations. We also leverage a lightweight synchronization scheme that relies on a state machine. De Bruijn graph traversal along uncontested ‘UU’ paths [29] took approximately 110 seconds on 3,072 cores or approximately 67 seconds on 6,144 cores. This code is available upon request.
The assembly was performed using an initial k-mer length of 51 (parameter -m = 51) and minimum k-mer frequency of three (parameter -D = 3). Contigs were generated using all short fragment libraries (but excluding mate-pair libraries). An initial round of scaffolding was performed using reads from all fragment libraries that were found to ‘splint’ pairs of contigs by 51-mer alignment. This splint-only-scaffolding protocol has not been used in previous meraculous assemblies, but was developed specifically to cope with the unique combination of insert sizes, depths of coverage, and genome complexity presented by this project. A minimum of three splinting alignments was required to accept a scaffolding link at this stage (parameter -p = 3).
Three additional rounds of scaffolding were performed following standard meraculous protocol for short (200 to 500 bp), medium (700 to 1000 bp), and long (4 kbp) libraries, each using a minimum of two spanning-pair alignments to accept a scaffolding link (parameter -p = 2). For the mate-pair libraries (OAGT, PSWH) reverse complementation and 3′ truncation (parameters -R, −U 3, respectively) were used to accommodate these library types. Additionally, short-pair elimination (parameter -D 600) was used for the UAXO library to deal with its moderate bi-modality, and the library H0036 was entirely excluded from this form of scaffolding due to extreme bimodality (Figure S5 in Additional file 1). Finally, gap-closing was performed using optional parameters -A, −D = 3, −R = 1.75.
With the exception of the contig-generation phase noted above, computations were performed on the JGI Genepool system (a 450-node sub-cluster with eight 48Gb, Intel Xeon L5520 2.27 Ghz cores per node and a dedicated 32-core 500 Gb SMP (Symmetric MultiProcessing) node were used). The k-mer counting and graph-generation steps required 5.6 k core-hours across 288 jobs. The read-alignment phase required a total of 30.8 k core-hours across 8.4 k jobs. The gap-closure phase required 3.5 k core-hours across 2.8 k jobs. These phases represent the vast majority of the computational resources required.
Contaminant screening of the assembly
Chloroplast, mitochondrial, prokaryotic, and fungal contaminants were sought by aligning the wheat scaffolds using blastx (parameters: −p blastx -a 7 -Q 11 -f 12 -W 3 -F ‘m S’ -U -e 1 -m 8 -b 10000 -v 10000) against the NCBI non-redundant proteins [65] for each category as the database. Ribosomal DNA was identified using megablast (parameters: −a 7 -b 0 -f T -D 3) against the NCBI non-redundant rDNA set. All alignments were initially filtered for a bit score ≥300, and scaffolds indicating a significant alignment were classified into bins. A total of 17,054 scaffolds (26.8 Mbp) were identified as likely contaminants, with 5,766 mitochondrion (5.6 Mbp), 451 chloroplast (338 kbp), and 10,837 prokaryote (21 Mbp). Contaminants included known sequencing-related microbial contamination, including Delftia spp. and Stenotrophomonas spp., but not obvious microbial or fungal commensals or pathogens associated with wheat. All subsequent analyses of the assembly excluded these contaminant scaffolds, unless otherwise noted.
Validation of assembly versus known transcripts and completeness relative to known transcribed genes
To assess the completeness of the genome assembly with respect to known transcribed sequence, we used a collection of 6,137 flcDNAs in the ‘Triticeae full length cDNA database’ [66] from T. aestivum var ‘Chinese Spring’ generated by Mochida et al. [35]. These flcDNAs are from hexaploid bread wheat and are expected to match our W7984 assembly with the exception of intra-specific polymorphisms and presence/absence or copy number variation. In contrast, they are expected to match the IWGSC ‘Chinese Spring’ assemblies identically. We used flcDNA rather than short-read RNAseq because the cDNA data are longer, of higher quality, and as clones are not subject to confounding effects arising from attempting to assemble homeologs in distinct scaffolds. We cleaned the flcDNAs by (1) trimming polyA tails with BioPerl ‘TrimEST’; (2) identifying non-wheat contaminations, using BLAST [67]; and (3) identifying putative transposable elements by comparison with RepBase [68].
Contamination
We identified three T. aestivum flcDNAs in GenBank as being in fact human sequences (RFL_Contig2039, 3209, and 5006) showing near 100% identity to human genes. These are presumably low-level contaminants of the wheat cDNA libraries. These sequences were excluded from further consideration.
Transposable elements
We found 99 T. aestivum flcDNAs from the Mochida et al. set (99/6,137 = 1.6%) with substantial BLAST alignments (BLASTN default word size, e-10, no DUST filter; >90% identity over >50% of their length) to RepBase entries. These were considered to be transposable elements and not considered in subsequent analyses.
Putative non-wheat sequences
To identify other likely non-wheat contaminations in Mochida et al. [35], we used BLASTN (e-10, no DUST filter; >90%) versus the GenBank non-redundant nucleotide database, and excluded from further consideration flcDNA sequences that (a) had no alignment to both our W7984 assembly and the ‘Chinese Spring’ assembly (>80% length, 1e-10) and (b) did not hit grass sequences in GenBank (>90% identity, >10% length). We found 52 flcDNA sequences that did not align to either assembly. Of these, 17 had alignments to grasses and were kept in further analyses; 32 had no GenBank hits to plants; 3 had only weak hits to non-grasses. These last two categories were not considered further.
Thus, after filtering for contaminants and transposons we consider 6,000 known, non-transposon T. aestivum flcDNAs = (6,137 initial flcDNA from Mochida et al.) - (99 RepBase transposon-related) - (3 human contamination) - (35 likely non-grass contamination not found in either assembly).
We also identified flcDNAs that have 10 or more alignments (>80% identity, >50% length) to one or both of the hexaploid wheat assemblies (126 to W7984, 198 to ‘Chinese Spring’). These are also likely to be repetitive elements, but may include recently diverged large gene families. These are included in all analyses.
Alignment to W7984 and ‘Chinese Spring’ assemblies
Non-transposon, non-contaminant cDNA sequences were aligned to both the meraculous W7984 WGS assembly database and to the IWGSC chromosome sorted ‘Chinese Spring’ assembly database with BLAST (BLASTN default word size, e-10, no DUST filter), initially requiring >80% identity over >50% of the cDNA or mRNA length. The high-scoring pairs (HSPs) of cDNAs aligned to genomic sequence correspond to exons, and minimally overlapping HSPs to a given scaffold were combined to produce a single percentage coverage (Total bases aligned/Total bases in cDNA) and percentage identity (Total positions matched/Total aligned positions excluding gaps).
Shotgun sequencing-based genotyping of the SynOpDH population
To genotype the SynOpDH mapping population we lightly shotgun sequenced 90 individuals. All sequencing was from unamplified fragment libraries nominally with 500 bp inserts, with 2×150 paired-end Illumina reads run on the HiSeq2000. Of these, three samples had less than 1× coverage, with the remaining samples having 1 to 2× read coverage (median: 1.38×, mean 1.37×, standard deviation 0.20×). (The estimated coverage was computed by dividing the total number of base pairs by 17 Gbp, without any attempted correction for contamination, adapters, and so on.)
A data summary is provided in Table S3 in Additional file 1. Briefly, sequences were indexed and pooled using Illumina TruSeq with indices as specified in Table S3 in Additional file 1. Estimated read depth is based on total sequence (Number of raw reads × Read length) divided by an estimated genome size of 17 Gbp. It does not include any correction for organellar contamination or artifacts. The ‘% artifact’ was estimated from 1% of reads; it was based on k-mer matches to a database of known sequencing artifacts at JGI. The ‘% organelle’ is estimated by comparing reads to the mtDNA and cpDNA of wheat.
The k-mer frequency distribution for the pooled reads of the mapping population is shown in Figure S7 in Additional file 1.
Note: SynOpDH IDs 0010, 0019, 0026, 0028, 0033, 0034, and 0117 were found to have deletions in chromosome 2D, 0031 in chromosome 3B, and 0083 in chromosome 7D. IDs 0030 and 0118 were found to have high rates of heterozygous markers, which is attributed to contamination. Data for these IDs were excluded from consideration in building the framework map.
Genetic map construction with POPSEQ
Read mapping and SNP calling
Shotgun sequence reads were mapped against all contigs ≥1 kbp of the meraculous W7984 WGS assembly using BWA-MEM version 0.7.7 [69]. Sorting of BAM files and duplicate removal were performed with PicardTools 1.100 [70]. SNPs and genotypes were called with the samtools mpileup/bcftools pipeline (version 0.1.19) [71]. The parameters ‘-B’ and ‘-D’ were supplied to samtools mpileup to disable BAQ calculation and record per-sample read depth. Genotype calls were filtered and converted into genotype matrix with an AWK script (available as Text S3 of Mascher et al. [54]). SNP calls with quality scores below 40, more than 90% missing data, or a minor allele frequency below 5% were discarded. The full genotype matrix is available as Dataset S2 in Additional file 4. The same procedures were also performed to produce a genotype matrix from the results of read mapping and SNP calling against the IWGSC assembly of cv. Chinese Spring [5].
Framework map construction
High-quality consensus genotypes were constructed for the meraculous scaffolds similar to the method described by Mascher et al. [24]. Only SNP positions at which both parents had successful genotype calls and were homozygous for opposite alleles were considered. Heterozygous calls in the DH progeny were set to missing. At least three successful genotype calls per individual and 95% concordance across all SNP positions on a scaffold were required to assign a scaffold genotype to an individual. Scaffold consensus genotypes with at least 10 genotype calls for each of the two parental alleles and less than four missing calls in the progeny were used as potential framework markers. The Hamming distance between all pairs of framework markers was calculated with a C program [24]. Groups of markers with pairwise Hamming distance 0 were put into the same bin of markers and the only the marker with the fewest number of missing genotype calls was selected as the representative of the bin. A total of 1,335 bin representatives were used as input for genetic map construction with MSTMap [42]. MSTMap was called with the following parameters: population_type DH, distance_function kosambi, cutoff_p_value 0.0000005, objective_function ML. All input bins were clustered in one of 21 linkage groups corresponding to the 21 chromosomes of wheat and positioned at distinct genetic positions in the output of MSTMap. The final map length was 2,826 cM. The genetic positions of framework markers are available as Dataset S3 in Additional file 4. Preliminary maps indicated the presence of large-scale deletions encompassing entire chromosome arms in 10 of the 90 DH lines. Additionally, two individuals showed an excess of heterozygous calls. These individuals were not used for map construction. Thus, the final framework map was made with genotypic data from 78 DH lines.
Anchoring scaffolds onto the framework map
Scaffolds of the meraculous assembly were placed into the framework map by finding the nearest neighboring genotype vectors in the set of framework markers as described by Mascher et al. [24]. Scaffold consensus genotypes were constructed as described above, but only a single successful genotype call per scaffold was required. Consensus genotypes with more than 70% missing calls were discarded. Nearest neighbor search was done with a C program [24]. Scaffold consensus genotypes having a Hamming distance >3 to their nearest neighbor(s) were discarded. If a scaffold had more than one nearest neighbor, we required ≥90% of the markers to come from the same chromosome and the median absolute deviation of genetic positions to be ≤5 cM. The genetic positions of scaffolds are available as Dataset S4 in Additional file 4. The same procedures were used to place IWGSC contigs onto our framework map. The genetic positions of contigs of the Chinsese Spring are available as Dataset S5 in Additional file 4.
Comparison to other datasets
All contigs ≥1 kbp of the IWGSC assembly of cv. Chinese Spring were aligned against all meraculous scaffolds of W7984 with megablast [72]. Only HSPs longer than 500 bp and sequence identity ≥98.5% were considered. The longest HSP of each IWGSC contig was used to assign it to a meraculous scaffold. Sequences of 64 bp genotyping-by-sequencing tags mapped previously in the Synthetic W7984 x Opata M85 DH population [28] were aligned to the meraculous assembly of W7984 with BWA-MEM (version 0.7.7) [69]. Only tags with the best possible mapping score (uniqueness) of 60 were retained. Coding sequences of barley high-confidence genes [45] were aligned to meraculous scaffolds using BLASTN [73] considering only hits with identity ≥90% and alignment length ≥200. Genetic positions of barley genes were taken from Mascher et al. [24]. Genetic positions of different maps were compared against each other and plotted with standard functions of the R statistical environment [74].
K-mer based genetic map
Defining 50 + 1-mer markers
A high-performance k-mer counting algorithm [64] was developed and used to count 51-mer frequencies in each of the two parental fragment data sets as well as the pooled SynOpDH population data. Using 9,600 cores of the NERSC Edison system, this counting was performed in less than 30 minutes using a distributed memory of 2.7 TB. A set of 2.2 million potential markers was derived from these counts using constraints described in the Results section. These constraints were imposed using an extension of the mer-counting software on 960 cores of Edison in 3 minutes of compute time using a distributed memory of 866 GB. The SynOpDH sequences were then individually genotyped against this panel of 2.2 million 50-mer markers using an extension of the mer-counting software running on 1,920 cores of Edison, requiring 23 minutes of compute time. After eliminating two SynOpDH individuals with outlying heterozygosity rates, any remaining markers with heterozygous calls in any individual were screened, leaving 1.7 million high-quality 50 + 1-mer markers. The marker sequences and associated genotype calls are available as Dataset S6 in Additional file 4.
Efficient clustering into linkage groups
This marker set was clustered into 21 linkage groups using a novel clustering algorithm (BubbleCluster [41]), which takes advantage of the underlying linear structure of genetic maps to produce a clustering of the markers in just over an hour of run time using one core of a quad-core AMD Opteron 8378 server. For this clustering a LOD threshold of 9 was used, and the resulting clusters included 1.34 million markers with no missing data in at least 46 of the 88 retained individuals. No significant minor clusters were found beyond the largest 21, which ranged in size from 5.2 k to 127.5 k markers.
Establishing a framework map
A framework map was derived from the 100,000 markers placed in clusters with the least missing data in the genotype array using MSTmap. This map was found to be in strong agreement (see Results) with the alternative map, which used markers derived from more conventional SNP-finding methods (see above), and is noteworthy in that it is produced directly from analysis of the shotgun sequence, requiring neither an existing assembly nor map (and was generated in less than 3 hours of wall-clock time using software specifically tailored to produce ultra-high-density genetic maps in a high-performance computing environment). Map locations of 50 + 1-mer markers are given in Dataset S7 in Additional file 4.
Attaching scaffolds to the map
By the uniqueness property of the underlying k-mers in a meraculous assembly, the set of 50 + 1-mer markers may be directly and uniquely assigned to scaffolds in the assembly by BLAST (or other suitable alignment method) with wordsize 51; 84% of markers in the set are assignable to scaffolds by this technique. These are assigned to 442 k scaffolds spanning 5.28 Gbp (267 k scaffolds larger than 1 kbp spanning 5.23 Gbp). Markers placed in linkage group clusters are assigned to 321 k scaffolds spanning 4.51 Gbp of the assembly (215 k scaffolds larger than 1 kbp spanning 4.48 Gbp). Of scaffolds with two cluster-assigned markers attached, 48/45,805 (0.10%) are found to have markers with conflicting linkage group designations, indicating a very low rate of potential misassembly (or marker mis-assignment). The net separation of marker pairs across this set indicates an inter-chromosomal misassembly rate of no more than one per 3.3 Mbp. We note that this assembly-independent framework map can be extended by identifying k + 1-mer markers on scaffolds, and combining the (sparsely sampled) markers on each scaffold into a haplotype ’super-marker’ with limited missing data. The placement of 50 + 1-mer marker on scaffolds given in Dataset S8 in Additional file 4.
Nucleotide diversity
We determined the average SNP rate per kilobase between two wheat genotypes by counting all base positions on the concatenated chromosome arm assemblies of cv. Chinese Spring [5] that are polymorphic in the respective pair of accessions and had at least 1× coverage in both W7984 and Opata M85. This analysis was based on the short read alignment against the Chinese Spring assembly (see ‘Read mapping and SNP calling’). Then, we divided this number by the number of all bases of the Chinese Spring assembly that have at least 1× coverage in both W7984 and Opata M85. These calculations were performed separately for the entire genome, the three subgenomes and for coding sequences. The predicted positions of coding sequences on the Chinese Spring assembly [5] (version July 2014) were downloaded from [75]. To find deletions in W7984, we calculated the read depth of the alignments of reads of Opata M85 and W7984 against the assembly of Chinese Spring using the programs ‘samtools depth’ [71] and BEDtools [76].
Data access
All shotgun reads are deposited into the Short Read Archive, with the following accession numbers: SRP037990, Triticum aestivum SynOpDH mapping population; SRP037781, Triticum aestivum Synthetic Opata M85; SRP037994, Triticum aestivum Synthetic W7984.
The WGS assembly of W7984 is accessible from the European Nucleotide Archive (accession PRJEB7074). The assembly can also be downloaded as a single multi-fasta file from [77]. Digital object identifiers (DOIs) were created with e!DAL [78].