Genomics reveals new landscapes for crop improvement

The sequencing of large and complex genomes of crop species, facilitated by new sequencing technologies and bioinformatic approaches, has provided new opportunities for crop improvement. Current challenges include understanding how genetic variation translates into phenotypic performance in the field.


Background
Genomics, the analysis of an organism's complete DNA sequence, has been one of the most transformative infl uences on biological studies. Th e genome sequences of organisms are fundamentally important for understanding the functions of individual genes and their networks, for defi ning evolutionary relationships and processes, and for revealing previously unknown regulatory mechanisms that coordinate the activities of genes. Th ese genomics-based approaches are having a profound infl uence on both human disease diagnostics and treatment [1] and, equally importantly, on the improvement of crops for food and fuel production. In this review, we summarize progress in sequencing crop genomes, identify remaining technical challenges, and describe how genomics-based applications can aid crop improvement. We then assess the impact of genomics on plant breeding and crop improvement, showing how it is accelerating the improvement of staple and 'orphan' crops, and facilitating the utilization of untapped allelic variation. Finally, we speculate about the future impacts of genomics on plant biology and crop improvement by developing the concept of systems breeding, which integrates information on gene function, genome states, and regulatory networks across populations and species to create a predictive framework for estimating the contributions of genetic and epigenetic variation to phenotypes and fi eld performance.

Progress in crop genome sequencing and analysis
Advances in sequencing crop genomes have mirrored the development of sequencing technologies (Table 1). Until 2010, Sanger sequencing of bacterial artifi cial chromosome (BAC)-based physical maps was the predominant approach used to access crop genomes such as rice, poplar and maize [1][2][3]. Th e rice genome comprises complete sequences of individual BACs assembled into physical maps that are anchored to genetic maps, whereas for maize, the sequences of individual BACs were not completely fi nished. For poplar, grapevine, sorghum and soybean [2,[4][5][6], whole genome shotgun (WGS) reads of libraries of randomly sheared fragments of diff erent sizes and of BAC end sequences (BES) were assembled with powerful assembly algorithms such as ARACHNE [7,8]. Th e trade-off s that shaped genomesequencing strategies in the era before next-generation sequencing became available involved coverage, time and expense. Physical maps of BACs provide a good template for completing gaps and errors, but genome coverage of physical maps can be non-representative due to cloning bias. In addition, intensive hand-crafting is required to assess physical map integrity and to close gaps; this eff ort scales directly with genome size and complexity.
Th e sorghum genome [1][2][3]5] was the fi rst crop genome to be sequenced completely by the exclusive use of WGS sequence assemblies, which were then assessed for integrity using high-density genetic maps and physical maps. Th is pioneering analysis showed that scaff olds of Sanger sequence assemblies accurately span extensive repetitive DNA tracts and extend into telomeric and centromeric regions. Th e larger soybean genome was subsequently sequenced to similar high standards. Th e soybean genome is thought to be pseudo-diploid, derived from the diploidization of an allopolyploid in the past 50 million years [2,[4][5][6]9], and this project successfully demonstrated that WGS assemblies are not confounded by large-scale genome duplication events. WGD alloploids have a whole-genome duplication in recent lineage. b A contig is an unambiguous linear assembly of sequences with no physical gaps in coverage, but which can contain errors. c The terms supercontig, scaff old or metacontig are used interchangeably to describe a set of contigs that are linked by a known physical distance but that contain sequence gaps. These scaff olds are usually created using mate-pair reads and BAC end sequences. d Pseudomolecule is a term applied to a chromosome-scale assembly of contigs and scaff olds that is anchored to a long-range framework using genetic markers and other chromosome features, including cytogenetic features and deletions.
By 2010 to 2011, a mixture of sequencing technologies, all using WGS assembly methods, were being successfully applied to trees (apple, cacao and date palm), fruit (strawberry), vegetables (potato and Chinese cabbage) and forage crops (alfalfa relative) [10][11][12][13][14][15][16]. Th e Medicago and tomato [17] projects, which were initiated in the BACbased Sanger sequencing era, were completed using nextgeneration sequencing. Th e contiguity of assemblies varied according to genome composition and size, with very high contiguity being achieved in potato and alfalfa by alignment to BAC sequences. Th e Brassica genomes are among the most challenging to sequence with respect to achieving large-scale assemblies because they have undergone three recent whole-genome duplications followed by partial diploidization [18]. Polyploidy has a centrally important role in plant genome evolution and in the formation of important crop genomes. Figure 1 illustrates three examples of polyploidy and how these events contribute to crop genetic diversity in diff erent ways. In Brassica species, polyploidy has led to extensive structural heterogeneity and gene copy number variation when compared with their close relative Arabidopsis. Th e Brassica rapa genome sequence remains fragmentary, but alignments of Brassica chromosome segments to the Arabidopsis genome are exceptionally useful for advanced genetic analysis [19].
In its early stages, crop genomics relied on many smallscale science laboratories joining forces to generate the sequence data. However, this has changed radically with the emergence and leadership of large-scale genome sequencing centers, which focused their expertise and resources on important crop genomes. Two examples are the Joint Genome Institute (JGI) in the USA and the Beijing Genome Institute (BGI, Shenzen) in China, both of which provide exceptional expertise, capacity and levels of engagement with researchers. Th ese centers and others are currently sequencing the genomes of many crucially important food and fuel crops, and are working in collaboration with science groups worldwide on improving our functional understanding of these genomes.
Since 2012, analyses of the sequences of 12 crop genomes have been published, accounting for nearly half of the total published (Table 1) [20][21][22][23][24][25][26][27][28]. Th is explosion of data has been driven by cheaper and more eff ective sequencing technologies (primarily the Illumina [29] and Roche 454 [30] methods) coupled with increasingly sophisticated sequence and assembly strategies [31], which are generally delivered by large genome centers. Access to these technologies makes even a reasonably large crop genome project aff ordable and feasible within the period of a single research grant, and is having a major infl uence on strategies in crop genomics. For example, the role of multi-partner coordination has changed from raising and coordinating research funding and managing the distribution of research activities to a focus on data analyses, distribution and applications. Th ese changes will further accelerate and greatly diversify the range of plant species and varieties sequenced.
Th e date palm genome [12] was sequenced using just paired end reads and remains fragmentary. Although this could be partly attributed to repeat composition, it is clear that the use of mate-pair libraries of diff erent lengths, which provide accurately spaced pairs of sequence reads, substantially improves contiguity across medium-sized genomes of up to 1,000 Mb, as can be seen for citrus, diploid cotton, pigeonpea, chickpea and banana [21,24,25,32,33]. Contig and scaff old sizes were further increased in chickpea and pigeonpea by incorporating BES generated by Sanger sequence that have much longer read lengths paired over a 100 kb span. Increased lengths of Illumina reads, of up to 250 bases, are now available to users and should further improve contiguity. Using new assembly algorithms, the large genome of bamboo, a plant of major industrial and ecological signifi cance, has recently been published [34]. Table 1 shows progress in sequencing two much larger Triticeae genomes, those of diploid barley (5,100 Mb) [27] and hexaploid bread wheat (17,000 Mb) [28]. Both the exceptional scale and high repeat content (approximately 80%) of these genomes provide signifi cant challenges to straightforward WGS sequencing and assembly, with genes being separated by hundreds of kb of repeats such as nested retroelements [35]. In barley, a physical map of 67,000 BAC clones with a cumulative length of 4.98 Gb provided 304,523 BES reads as a framework for integration of 50X Illumina paired end and 2.5 kb mate pair reads. Contig median size was just 1.5 kb because the repeat content collapsed longer assemblies. Sequence assemblies were integrated with genetic and physical maps, and genic assemblies were assigned to chromosome arms. Th e chromosomal order of barley genes was then interpolated using synteny across multiple sequenced grass genomes and by ordering the genes according to the genetic or physical maps [36] (Figure 2).
Th e bread wheat genome is a recent hexaploid composed of three related genomes (A, B and D), each the size of the barley genome, which do not pair and recombine, leading to their independent maintenance [37] (Figure 1). Th e challenge for wheat WGS strategies was to provide independent assemblies covering and representing genes from each homoeologous genome. Th e two closest diploid progenitors of the A and D genomes were sequenced to identify polymorphisms that could be used to assess WGS gene assemblies. Low coverage (5X) Roche 454 sequence was generated, and orthologous gene sequences from multiple grasses were used to guide assemblies. Approximately 94,000 genes were assembled and positively assigned to the A and D genomes using genome-specifi c single nucleotide polymorphisms (SNPs), with the remaining assemblies tentatively assigned to the B genome. Wheat gene assemblies, which are fragmentary compared to barley gene assemblies, were assigned to chromosomes using high-density genetic maps and conserved gene order.
Th e current wheat and barley gene-based assemblies are suitable for developing genetic markers [38] and for creating genetic maps for map-based cloning and marker-assisted breeding. To increase the gene coverage and contiguity of the barley genome, BACs in the physical map are being multiplex-sequenced using Illumina methods. Th is will result in chromosome assemblies with fewer gaps and more precisely ordered genes. Th is should establish barley as the pre-eminent genomic template and genetic reference for the Triticeae. On-going eff orts in sequencing the bread wheat genome include sequencing purifi ed fl ow-sorted chromosome arms to increase gene coverage and the complete assignment of homoeologous genes to the A, B, or D genome [39]. Constructing physical maps of BAC libraries made from purifi ed chromosomes is also underway, with the chromo some 3B physical map [40] and BAC sequencing completed. Given suffi cient funding and time, this strategy will provide the necessary high-quality reference genome. Since homoeologous genes can now be assembled and assigned to their genome, WGS can be used to improve the contiguity of wheat gene sequences by using The progenitor of these Brassica species was hexaploid (compared to Arabidopsis) after two rounds of whole-genome duplication. Extensive gene loss, possibly via deletion mechanisms [18], has occurred in these species. Upon hybridization to form allotetraploid Brassica napus, gene loss is accelerated, producing novel patterns of allelic diversity [19]. (b) Bread wheat is an allohexaploid derived from the relatively recent hybridization of allotetraploid durum (pasta) wheat and wild goat grass, Aeglilops tauschii. The Ph1 locus in the B genome [37] prevents pairing between the A, B and D genomes, leading to diploid meiosis and genome stability. This maintains the extensive genetic diversity from the three progenitor Triticeae genomes that underpins wheat crop productivity. (c) Sugarcane (Saccharum sp.) is a complex and unstable polyploid that is cultivated by cuttings. Hybrids between S. offi cinarum, which has high sugar content, and S. spontaneum, a vigorous wild relative, have variable chromosomal content from each parent. The genomes are closely related to the ancestral diploid Sorghum [42].   I  I  I   II   III III  III   II  II  II long mate pair spans, in non-overlapping increments up to 40 kb using fosmid vectors [41], coupled to longer read lengths. New template preparation methods, such as Illumina Moleculo, which breaks assemblies down into separate 10 kb units, could be used to span large repeat units and to facilitate accurate assemblies covering large tracts of repeats. Although a colossal amount of sequencing is required, a whole-genome strategy for wheat, supplemented by the fl ow-sorted chromosome arm data, has the potential to provide users with a high-quality draft sequence relatively quickly and cheaply. Several industrially important species, such as the conifers Norway spruce (Picea abies) [42] and loblolly pine (Pinus taeda), have very large genomes (approximately 20,000 to 24,000 Mb, respectively). Th ey are being sequenced using WGS strategies involving fosmid pool sequencing and Illumina long-mate pair methods [43]. Th ese tree species have particular characters that Figure 2. The impact of whole genome sequencing on breeding. (a) Initial genetic maps consisted of few and sparse markers, many of which were anonymous markers (simple sequence repeats (SSR)) or markers based on restriction fragment length polymorphisms (RFLP). For example, if a phenotype of interest was aff ected by genetic variation within the SSR1-SSR2 interval, the complete region would be selected with little information about its gene content or allelic variation. (b) Whole genome sequencing of a closely related species enabled projection of gene content onto the target genetic map. This allowed breeders to postulate the presence of specifi c genes on the basis of conserved gene order across species (synteny), although this varies between species and regions. (c) Complete genome sequence in the target species provides breeders with an unprecedented wealth of information that allows them to access and identify variation that is useful for crop improvement. In addition to providing immediate access to gene content, putative gene function and precise genomic positions, the whole genome sequence facilitates the identifi cation of both natural and induced (by TILLING) variation in germplasm collections and copy number variation between varieties. Promoter sequences allow epigenetic states to be surveyed, and expression levels can be monitored in diff erent tissues or environments and in specifi c genetic backgrounds using RNAseq or microarrays. Integration of these layers of information can create gene networks, from which epistasis and target pathways can be identifi ed. Furthermore, re-sequencing of varieties identifi es a high density of SNP markers across genomic intervals, which enable genome-wide association studies (GWAS), genomic selection (GS) and more defi ned marker-assisted selection (MAS) strategies. facilitate their genome analysis, including the absence of whole-genome duplication in their ancestry, relatively inactive retroelements and the presence of a large multicellular haploid gametophyte, the sequence of which does not exhibit heterozygosity. Sugar cane, another important crop plant, is a hybrid between Saccharum offi cinarum and Saccharum spontaneum. Th ese species are closely related to sorghum [44] and have haploid contents of 8 and 10 base chromosomes, respectively. Both S. offi cinarum and S. spontaneum have a monoploid genome size close to that of sorghum (760 Mb), but they are highly autopolyploid (2n = 80 and 2n = 40-128, respectively), resulting in a genome size of >15 Gb for hybrid sugar cane. Commercial cultivars are derived by backcrossing hybrids to S. offi cinarum, resulting in lines that have diff erent chromosome contributions from each parental species [45]. Th e highly variable and heterozygous composition of commercial sugarcane genomes is a major challenge to genome sequencing. Th e sequencing of progenitor genomes, using WGS strategies and sorghum genes as templates, could create high-stringency orthologous gene assemblies. As in the analysis of the wheat draft genome, this strategy would generate information on ortholog copy number and identify sequence polymorphisms that could be used to genetically map desirable traits in the two progenitor species. Upon the development of commercial hybrids from sequenced progenitors, re-sequencing could identify desired genotypes and gene copy numbers.
A similar approach could be used for the biomass crop Miscanthus x giganteus, a sterile triploid derived from Miscanthus sinensis and tetraploid Miscanthus saccharifl orus. A recent genetic analysis has shown that M. sinensis has recently undergone whole-genome duplication [46] and a single dysploid chromosome fusion [47], neither of which occurred in the closely related sorghum genome [48]. Th e WGS strategy developed for wheat could be also applied to M. sinensis and its hybrids to determine gene copy numbers and to identify genetic variation in homoeologous gene copies.

Accessing and measuring sequence variation and the epigenome
It is reasonable to predict that within the next two years useful genome sequences will be available to support the genetic improvement of most of the important food and fuel crops. Crop improvement will depend, however, on the identifi cation of useful genetic variation and its utilization by breeding and transformation. Such variation can be identifi ed at a genome scale by comparison of multiple sequence reads to a single 'reference' . For example, in rice, low-coverage sequence of 1,083 Oryza sativa and 466 Oryza rufi pogon (the progenitor species of cultivated rice) accessions [49] provided deep insights into the domestication of rice and the geographical distribution of variation, while providing material for quantitative trait loci (QTL) and genome-wide asso ciation studies (GWAS) [50]. Th e gene spaces of maize and wheat varieties are being re-sequenced using sequence capture methods that are based on the solution hybridization of sheared genomic DNA with biotinylated long overlapping oligos designed from gene sequences [51,52]. Th e captured DNA is highly enriched in genic sequences, and its deep sequencing can distinguish closely related genes, including wheat homoeologs [53]. Th ese approaches will facilitate the high-throughput sequencing of the gene space of multiple lines of crops, even those with very large polyploid genomes. Th ese methods off er the ability to sequence rapidly the genomes or gene space of multiple accessions, wild relatives and even new species, which will undoubtedly accelerate the incorporation of unexplored and underutilized genetic variation into crops worldwide [54]. DNA sequence variation remains a primary focus, but extensive evidence from several crop species [55,56] suggests that epigenetic changes are responsible for a range of stably heritable traits, and that epigenetic variation can be both induced and selected for during domestication [57]. Th e methylation status of captured DNA can be measured using bisulfi te treatment followed by deep sequencing in a method called reduced representation bisulfi te sequencing (RRBS) [58]. Th ese important technological advances in sequence template preparation will permit the exceptionally detailed and cost-eff ective defi nition of variation in the sequences and epigenomes of multiple lines or species of crops, independently of their genome size and polyploid status [59].

Applying next-generation genomics to crop improvement
Accessing genome-wide sequence variation by resequencing signifi cantly improves the availability of information that can be used to develop markers, thereby enhancing the genetic mapping of agronomic traits. For example, in wheat, fewer than 500 SNP markers were available in 2008 [60] with that number increasing to 1,536 in 2010 [61], 10,000 in 2011 and over 90,000 in 2012 [38]. Th is relatively high-density SNP information is proving extremely useful across diff erent systems, including QTL mapping in bi-parental crosses and recombinant inbred lines, GWAS, and mapping QTL in advanced inter-cross lines such as those in multi-parent advanced generation inter-cross (MAGIC) [62] and nested asso ciation mapping (NAM) [63] populations. Th ese approaches generally identify loci and causal genes for traits with relatively large phenotypic eff ects. Th e genomic segments that contain desired allelic variation can then be bred and combined in a single genetic background using markers to track the segments through marker-assisted selection (MAS).
Many important agricultural traits such as yield, however, result from relatively small eff ects across multiple loci. Th is implies that these loci might not be optimally identifi ed through QTL or GWAS approaches and that their pyramiding through MAS will be ineffi cient. Th erefore, breeders have begun to address these problems by developing a knowledge base of associations of polymorphic markers with phenotypes in breeding populations [64,65]. Th ese associations are used to develop a breeding model in which the frequency of desired marker alleles is optimized, thereby maximizing the estimated breeding value [66,67]. Multiple cycles of selection are used to accumulate favorable alleles that are associated with desired phenotypes, although no causal relationship between a specifi c gene and a phenotype is established. Th is approach, termed genomic selection (GS) is incorporated into industrial-scale breeding processes that require very cheap high-throughput marker assays [68]. Next-generation sequencing of parental lines is infl uencing GS in several ways: by continuing to identify polymorphisms throughout the genome in both genic and inter-genic regions; by providing estimates of gene expression levels; and by providing information on the epigenetic states of genes ( Figure 2). Th e fi rst removes any limitations on marker density, while the latter two features are 'genomic features' that will surely have predictive power for complex traits. Speculatively, the encyclopedia of DNA elements (ENCODE) concept [69] of total genomic knowledge could eventually be in corporated into models for predicting performance from genomic information revealed by next-generation sequencing.
Breeding uses natural allelic variation to improve crop performance. Sequence variation can be experimentally enhanced using, for example, ethyl methanesulphonate (EMS) to alkylate bases. TILLING (targeted induced local lesions in genomes) [70] is then used to screen for base changes in genes of interest to assess gene function and to create advantageous alleles for breeding. It is now feasible to use genome capture to sequence an entire mutant population, even in complex polyploid genomes such as wheat [52]. Here, polyploidy provides an advantage by buff ering the infl uence of otherwise deleterious mutation loads.
Genetic manipulation using the Agrobacterium tumefaciens-mediated transfer of genes from any other organism is a mature technology that has been adapted for use in many of the crop species listed in Table 1. Th e precise modifi cation of gene sequences using zinc-fi nger nucleases (ZFN) that can be engineered to recognize specifi c DNA sequences has been applied to a target locus in maize [71]. More recently a new type of precision tool for genome engineering has been developed from the prokaryotic clustered regularly interspaced short palindromic repeats (CRISPR) Cas9 immune system [72,73]. Th e Cas9 nuclease is guided to specifi c target sequences for cleavage by an RNA molecule. Several types of genome editing are possible, such as the simultaneous editing of multiple sites, inducing deletions, and inserting new sequences by nick-mediated repair mechanisms.

Genomic features for future breeding
Genomics has radically altered the scope of genetics by providing a landscape of ordered genes and their epigenetic states, access to an enormous range of genetic variation, and the potential to measure gene expression directly with high precision and accuracy ( Figure 2). Th is not only has important practical advantages for breeding but also facilitates systematic comparison of gene functions across sequenced genomes, bringing the wealth of knowledge of gene function and networks obtained in experimental species directly into the ambit of crop improvement. Given a suitable cyber-infrastructure, the integration of biological knowledge and models of networks across species, in a two-way fl ow from crops to experimental species and back again, will begin to generate new layers of knowledge that can be used for crop improvement. One layer is provided by ENCODElevel analyses [69]; although yet to start in plants, these analyses can guide the interpretation of gene function and variation, thus providing new information to inform the prediction of phenotype from genotype. Another information layer is provided by the systems-level integration of gene function into networks, such as those controlling fl owering time in response to day-length and over-wintering ( Figure 2). Th ese networks have been identifi ed in Arabidopsis and rice, with allelic variation in key 'hubs' strongly infl uencing network outputs. Evolution ary processes, such as gene duplication, and the possible footprints of domestication can be mapped to networks such as those controlling fl owering time [74,75]. Such 'systems breeding' approaches can use diverse genomic information to increase the precision with which phenotype can be predicted from genotype, thereby accelerating crop improvement and helping to address food security.