Current challenges in de novo plant genome sequencing and assembly
© BioMed Central Ltd 2012
Published: 27 April 2012
Skip to main content
© BioMed Central Ltd 2012
Published: 27 April 2012
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community.
The plant kingdom is filled with amazing diversity and significance. Plants form the base of the food chain that provides food for all living organisms, and just 15 crop plants provide 90% of the world's food intake . Plant species are responsible for maintaining the balance of the carbon cycles , for developing and maintaining soil from erosion , and are promising sources of renewable energy . Plant byproducts are used in many human medicines , and plants have been essential model organisms for studying biological systems such as the role of transposons and epigenetics . For all these reasons and many more, there is great interest in sequencing plant genomes, but relatively few plant species have been sequenced compared with the hundreds of thousands of species around the world.
The first free-living organisms were sequenced less than 20 years ago, starting with simple microbial genomes , and increasing in complexity to the first eukaryotic genomes , the first multicellular species , and then on to plant genomes, including Arabidopsis thaliana (thale cress) , Oryza sativa (rice) , Carica papaya (papaya)  and Zea mays (maize) in 2009 , using first-generation capillary sequencing. Since then many others have been sequenced leveraging second-generation sequencing, including Fragaria vesca (strawberry) , Solanum lycopersicum (tomato)  and Cajanus cajan (pigeonpea) , and dozens more are nearing completion . This increase in sequenced plant genomes has largely been driven by technological improvements: whereas the first generation of automated DNA sequencing instruments could sequence thousands of base pairs per day, current state-of-the-art second-generation sequencing instruments can sequence many billions of bases per day for hundreds or thousands of dollars per gigabase instead of millions or billions of dollars per gigabase . These technologies have been applied to study thousands of genomes across the tree of life, enabling rich annotation of their gene networks , the development of comparative genomics approaches to infer evolutionary and domestication forces , the cataloging of genomic markers to optimize plant breeding , and numerous other studies that use the genome sequence as the backbone of the analysis .
In contrast to the tremendous advances in throughput, assembling sequencing reads remains a substantial endeavor, much greater than the sequencing efforts alone would suggest [22–24]. Large complex plant genomes remain a particularly difficult challenge for de novo assembly for a variety of biological, computational and biomolecular reasons. Plant genomes can be nearly 100 times larger  than the currently sequenced bird , fish  or mammalian genomes . In addition they can have much higher ploidy, which is estimated to occur in up to 80% of all plant species , and higher rates of heterozygosity and repeats  than their counterparts in other kingdoms. Furthermore, the gene content in plants can be very complex, as shown by the presence of large gene families and abundant pseudogenes with nearly identical sequences derived from recent whole genome duplication events and transposon activity . Plants tend to have high copy chloroplasts and mitochondria organelles, which complicate assembly of their remnants in the nuclear genome and skew coverage levels . Finally, it is often very difficult to extract large quantities of high-quality DNA from plant material, making it difficult to prepare proper libraries for sequencing.
For all of these reasons, sequencing and de novo assembling a plant genome can create a highly fragmented result. Instead of large contigs and scaffolds spanning large chromosome regions seen in recent vertebrate genome assemblies , there is a greater chance to assemble the sequencing reads into isolated gene islands among the background of high copy repeats . Furthermore, the gene sequences may not always be correct, considering that nearly identical gene families are notoriously difficult to assemble and may collapse into a mosaic sequence without necessarily representing any member of the family . If the level of fragmentation and mis-assembly is too great, downstream analysis will be noisy, and could even lead to false conclusions of the biology .
Knowing how to assemble these genomes accurately, how to best make use of the potentially highly fragmented assemblies and how to perform these applications at the lowest cost are important in today's funding environment. Genome assembly has always been an incremental process, and there are only a handful of truly finished large genomes today - even the latest release of the 'finished' human reference genome has millions of unresolved nucleotides . Therefore, we need to assess when an assembly is good enough to be useful to the community, and how the agencies can get the most out of the available funding. Finally, how can researchers stay afloat in the rapidly evolving landscape with technology evolving so quickly it is challenging to know what the guidelines for plant assembly will be in 12 months or beyond. Here we assess the state of the art of de novo assembly, assess what can be expected to develop, and review the best practices for the plant community.
Assembling any genome requires the proper combination of coverage, read length and read quality . If any of these factors are not met, then it is a mathematical certainty that the assembly will be fragmented into many small contigs. The Lander-Waterman model offers an analytic, if optimistic, prediction on the minimum coverage needed to assemble large contigs . Using this model, a minimum of 15-fold coverage is required to assemble 100 bp reads into large contigs. However, once coverage has been equalized for errors, ploidy, sequence biases and other complicating factors, the minimum required coverage level may be much higher and sequencing to at least 100-fold coverage is recommended .
This statistical model also does not consider repeat composition, and short reads alone may never have the information content to resolve complex repetitive sequences. Resolving large or complex repeats fundamentally requires longer spanning information to bridge across the repeats back to unique sequence in the form of longer reads, mate-pairs, long-range mapping information or a method for fragment localization . Read quality is also not directly considered in the Lander-Waterman model, but low-quality reads will reduce effective coverage and obscure true overlaps between sequencing reads, thus fragmenting the assembly and risking collapsing more repeats.
Overcoming these challenges depends on advances in both sequencing technology and assembly technology. Sequencing technology needs: (1) instrumentation improvements, including improvements in throughput, cost, read lengths and accuracy; and (2) molecular protocols, including developing new types of libraries and also new techniques for multiplexing samples to take advantage of the tremendous throughput available per instrument run. Assembly technology needs: (1) improved algorithms for accurately assembling complex genomes at scale; and (2) improved analytics to record, manipulate, analyze and visualize features to translate the salient assembly information to the broader plant biology community.
The highest capacity sequencing instruments available today, such as the Illumina HiSeq 2000, can sequence nearly 100 Gbp per day, and make it possible to sequence a 3 Gbp genome to high coverage for less than US$10,000 . Using these technologies, it is also possible to sequence paired-end or mate libraries ranging in size up to a few thousand base pairs. As such, even large plant genome projects can count on relatively inexpensive, deep coverage with approximately 100 bp reads and 1 to 5 kbp mate libraries. However, these short reads and small libraries have substantial limitations for large genomes with large repetitive content. Constructing high-quality draft genome assemblies for the largest plant genomes absolutely requires enhanced sequencing approaches to generate longer reads and mate-pair libraries, and protocols for localizing the sequencing and assembly problem.
One of the strongest needs is for protocols for efficiently generating a mix of larger libraries, such as 10 kbp, 40 kbp or 150 kbp in addition to standard 5 kbp libraries. Currently available protocols for these larger sizes, such as with fosmids , or bacterial artificial chromosome (BAC)-end sequencing , are effective but are laborious, costly and time consuming relative to the sequencing itself. Furthermore, the larger libraries inevitably have increased size variance and less reliable mate information. The sequencing itself needs to be improved to reduce the biases from GC composition, chimeric reads and mates, and other effects so that the coverage along the genome will be uniform and complete .
One promising approach for substantially longer reads and unbiased coverage is the rise of third-generation sequencing technologies such as that from Pacific Biosciences  and the newly announced instruments from Oxford Nanopore . These platforms promise to generate longer reads that can be used for sequencing through complex repeats, link gene islands and phase haplotypes. However, these technologies are relatively immature for immediate widespread application to all large genomes of interest. Sequencers from Roche/454 make it possible to sequence approximately 700 bp reads, but at greater cost than short read sequencing, and it may not be sufficient to span the largest repeats .
Optical mapping technologies are another possibility for generating very long range linking information between sequence contigs and have a successful history in plant genomics [43, 44], although the current worldwide capacity is also below the demand. New technologies such as nanocoding , and new instruments from commercial vendors, including OpGen  and BioNanoGenomics , are expected in the next couple of years and they could expand the capacity for optical mapping similar to that seen in sequencing.
A complementary approach to improved sequencing and mapping is to develop methods for localizing sequencing and thus simplifying the assembly problem. There is a successful history of BAC-by-BAC sequencing of plant genomes [10, 11], and this is effective in the sense that assembling an isolated BAC is far simpler than assembling the entire genome. However, this technology is now prohibitively expensive without significant enhancement. For example, sequencing large genomes such as maize using a BAC-by-BAC approach costs tens of millions of dollars and hundreds of thousands of BAC clones. While next-generation sequencing would certainly reduce this cost, it is not readily possible to efficiently use next-generation sequencing on the number of BAC clones needed. This, coupled with the high cost of making and storing the large numbers of libraries needed, greatly limits the feasibility of BAC-by-BAC sequencing in the next-generation world.
Versions of BAC-by-BAC using pools of BAC or pools of fosmids is an attractive option for localizing the problem, assuming such libraries can be efficiently made and barcoding protocols can be effectively applied to tag the molecules . However, to utilize the capacity of current sequencers fully, so many BACs need to be pooled in a lane that it would not effectively localize the assembly problem unless the BACs can be multiplexed and barcoded to a very high degree. Furthermore, preparing and storing these libraries will still require a substantial cost unless they can be made in a fully automated fashion. Alternative molecular isolation technologies that can be used for localizing individual chromosomes in the sample, such as flow sorting, are promising alternatives and are starting to become more widely available [49, 50].
Several short-read assembly packages have been proven for mammalian-sized genomes up to the 3 Gbp human genome, including ABySS , ALLPATHS-LG , the Celera Assembler [52, 53], Newbler , SGA  and SOAPdenovo . These assemblers can produce high-quality assemblies from short reads, although they generally require servers or clusters with 512 gigabytes of RAM and many terabytes of disk space available for a gigabase-sized genome . However, these servers are decreasing in costs and can be purchased for under US$35,000 from several major computer vendors , and supercomputing centers make them available without any cost . This is promising, but assembling the largest plant genomes currently being sequenced, such as the loblolly pine genome of approximately 21 Gbp , will increase the computational demands by nearly an order of magnitude, for which there is no proven technology. Enhanced algorithms for compression and distributing the computation are actively being researched .
Two major efforts to evaluate the state-of-the-art in assembly technology were published last year: the Assemblathon  and the Genome Assembly Gold-Standard Evaluation (GAGE) . Both projects evaluated the performance of various genome assemblers in a competitive framework with both simulated and real datasets. They showed there was great difference in the quality of the results depending on the assembler and pipelines used. Researchers planning to assemble a genome of any size are encouraged to study their results, such as the needs for error correction, recommended assemblers and evaluation criterion. However, the genomes studied in these projects were relatively small and simple compared with the most complex plant genomes. The plant community would be well served by hosting regular competitions with plant genomes, especially since all of the major assemblers have been developed targeting vertebrate genomes, and no assembler has been proven with higher levels of ploidy or heterozygosity.
Related to the de novo assembly problem, research is greatly needed to help improve the representation of assembled genomes, including creating graph-centric and population-aware formats that can represent the complexities of plant genomes, particularly those that are only partially assembled [60–62]. Incremental algorithms that can update the assembly and annotation as new data become available would also be extremely useful . Finally, continued research into assembly validation is necessary for determining when an assembly is correct and conclusions can be trusted [32, 63].
Sequencing and assembling a genome are often just the first stages of a larger study. Immediately following the assembly, the genome will need to be annotated to catalog genes and other features of interest , or aligned to other genomes to enable comparative genomics studies . Several sequencing-based assays, such as RNA-seq  and Methyl-seq , can be used with the assembly to study transcriptionally or epigenetically active regions of the genome, and population studies will often attempt to build higher-order relationships, such as gene networks, or relate genotype to phenotype.
Currently, pipelines are available for carrying out these operations and displaying results in a 'genome browser', but continued research is needed to make the pipelines and results more accessible to different types of user. Systems such as Galaxy , Gramene  and Drupal  are among the leading graphical systems for executing workflows, visualizing sequencing assay results, and enabling collaborative discussions, respectively, but they operate as separate systems. A fully integrated system such as has been proposed by iPlant , and the DOE Systems Biology Knowledgebase  initiatives would lower the barrier for learning to operate these functions. In either case it is critical that the community enhance these systems and the underlying algorithms to better support the complexity of plant genomes and their evolving assemblies.
The plant kingdom has incredible variation and diversity, and as a result each plant sequencing project seems to have its own unique analysis needs. Sequencing and assembly technologies are evolving so rapidly it is impossible to predict what will be available even one year in the future. Despite these complexities, certain trends are emerging as best practices.
Because of economic and technological reasons, the majority of sequence produced in the next 18 months will continue to originate from short reads of approximately 100 to 200 bp. Fortunately, sequences of this length can be assembled into high-quality draft assemblies for genomes as complex as human when sequenced in a mixture of libraries. In particular, Gnerre et al.  recommend 45× paired-end (2 × 100 bp at 180 bp), 45× short jump (2 × 100 bp at 3 kbp), 5× long jump (2 × 100 bp at 6 kbp) and 1× fosmid (2 × 26 bp at 40 kbp) to generate high-quality draft assemblies. Since the paired-end reads designed in this way overlap by approximately 20 bp, they can be preassembled into pseudo-long reads of approximately twice the original length using the built-in capabilities of ALLPATHS-LG  or by a standalone preassembler such as FLASH . Assemblers that do not include built-in error correction greatly benefit from then applying software such as Quake  to identify and fix sequencing errors before assembly. The larger libraries are then needed for ordering the initial contigs into progressively larger scaffolds.
For the largest and most complex plant genomes, even these libraries may not be sufficient to span the largest or more complex repeats, and it may be necessary to employ a hybrid approach using a combination of short and long reads, and even long-range mapping technologies or localization methods. Long reads over 800 bp are available today from Roche/454, albeit at higher cost than short read sequencing, and third-generation sequencing technologies promise to provide even longer reads. As sequencing costs and instrument runtimes continue to drop, researchers are also recommended to sequence a low coverage 'genome snapshot' to evaluate the genome and library composition before attempting to sequence the genome to high coverage.
Assembling and analyzing raw sequence data still require substantial bioinformatics effort and expertise. Before attempting a complex assembly, plant biologists are strongly encouraged to develop partnerships with bioinformatics laboratories that have sufficient skills and resources to handle the onslaught of data and diagnosis problems as they occur. Fortunately, the funding agencies are aware of these challenges, and it is our hope they would be responsive to requests for appropriate bioinformatics funding.
Principal investigators need to become better informed to the current best practices for genome assembly and develop a better understanding of the effort involved to sequence, assemble, annotate and analyze a new genome. More classes and training are needed for graduate and undergraduate students to learn the fundamentals of sequence analysis and quantitative techniques. Better training is needed to teach non-experts to use the software packages, and to educate everyone about the resources that are available. The plant sequencing community would benefit by forming and hosting plant genome analysis competitions in the spirit of the Assemblathon or GAGE to evaluate the state-of-the-art for assembly, annotation and other assays. The best practices of today are certain to change as new sequencing, mapping and computational technologies are introduced, and this will be the only way to monitor these developments.
We are still many years away from push-button sequencing and assembly of complex plant genomes into completely finished genomes at low cost. Nevertheless, it is now possible and affordable to sequence and assemble great numbers of interesting plant genomes into highly useful draft genome assemblies if one is mindful of the biotechnology and algorithmic challenges involved. The next frontier for plant genomics is to characterize the diversity of genomic variations across large populations, deeply annotate their functional elements, and develop predictive quantitative models relating genotype to phenotype. Improved sequencing technology and sequencing assays are certain to play a large role in these studies as well, and we envision a tight relationship between biology, biotechnology and analytics for years to come.
bacterial artificial chromosome
Genome Assembly Gold-Standard Evaluation.
We thank all of the participants of the meeting on the future of plant genome sequencing and analysis held at the Banbury Conference Center at Cold Spring Harbor in the summer of 2010. This work was funded, in part, by NSF award IOS-1135736, the US Department of Energy, Office of Biological and Environmental Research under Contract DE-AC02-06CH11357, and NIH RO1 HG006677-12.