Timber! Felling the loblolly pine genome
© Hamilton and Buell; licensee BioMed Central Ltd. 2014
Published: 31 March 2014
Conventional short read sequences derived from haploid DNA were extended into long super-reads enabling assembly of the massive 22 Gbp loblolly pine, Pinus taeda, genome.
See related research http://genomebiology.com/2014/15/3/R59
The first plant genome sequenced was that of Arabidopsis thaliana, a homozygous inbred diploid, which had been selected specifically for its relatively small size (119 Mb assembly), limited repetitive sequence content, and homozygosity. However, not all species in the plant kingdom have simplistic genomes like Arabidopsis. Indeed, for the vascular land plants, the mean C value for angiosperms is 5.79 pg (5,662 Mbp), while that for gymnosperms is 18.08 pg (17,682 Mbp). These large genome sizes are attributable to amplification of repetitive sequences, whole or segmental genome duplication, and/or polyploidy - all of which pose a technical challenge in genome assembly with current short-read assembly approaches. In addition, many species are self-incompatible and/or outcrossing in nature, which can result in a high degree of heterozygosity that further confounds genome assembly. An article in this issue of Genome Biology overcomes many of these technical challenges to achieve a near complete assembly of the 22 Gbp loblolly pine genome .
Reducing genome complexity by exploiting biology
The article in this issue of Genome Biology by Neale et al., and two companion papers [5, 6], is an excellent example of how a clever yet simple computational approach, coupled with the availability of sufficient haploid genomic DNA, has enabled the assembly of the 22 Gbp heterozygous loblolly pine (Pinus taeda) genome. In conifer seeds such as those of the loblolly pine, the haploid maternally derived megagametophye surrounds the developing embryo, providing it with nutrients, and is analogous to the endosperm in maize kernels. Due to the relatively large size of the megagametophyte, sufficient DNA for 11 paired-end libraries was isolated from a single, haploid loblolly pine megagametophyte to bypass the heterozygosity present in diploid tissues, such as needles. Using a novel computational approach described below, the authors complemented these haploid-derived libraries with 48 mate-paired and 9 fosmid ditag libraries constructed from ample DNA isolated from diploid needles, and constructed an assembly that spans 23.2 Gbp (20.1 Gbp non-gapped sequence), with an impressive N50 scaffold size of 66.9 Kbp. A total of 50,172 genes were annotated in the assembly and the quality of their assembly was evident in the detection of 201 of the 248 core CEGMA  genes, of which 91% were full-length. Interesting features of the loblolly pine genome included a high percentage of transposable elements (79% of the assembly), detection of extremely large introns (the largest was 318 Kbp) and 1,551 genes unique to conifers, of which 154 were absent in two related conifer species, Picea abies and Picea sitchensis, and restricted to loblolly pine .
Innovations in genome assembly parallel innovations in sequencing technologies
The first plant genomes were sequenced using Sanger sequencing and initially utilized bacterial artificial chromosome (BAC)-by-BAC approaches, which later evolved into WGS. Assembly of the BACs and genomes was achieved using the overlap layout consensus (OLC) approach where all-versus-all pairwise read overlaps are identified, the read layout is calculated, and a contig consensus sequence is generated. The OLC assembly approach works well for high-quality long reads generated through Sanger sequencing. When ultra high-throughput next generation sequencing-by-ligation (SBL) and sequencing-by-synthesis (SBS) technologies emerged, a burst of innovation in genome assembly software occurred. The shorter read lengths, higher error rates and increased sequencing depth were incompatible with OLC assemblers due to very high memory usage and long assembly run times. A number of assemblers were developed based on the de Bruijn graph, an approach initially used for short reads generated by sequencing-by-hybridization and resurrected to assemble SBL and SBS reads efficiently. In the last few years, numerous improvements in short-read assembly have been made, including error correction of reads, trimming and filtering of low quality reads, merging of overlapping paired-end reads, scaffolding of contigs using large insert mate-pairs or RNA-sequencing reads, incorporating long reads, and gap filling using paired-end reads.
Computer scientists to the rescue: super-reads
For large and complex genomes, the computational requirements of de Bruijn graph-based short-read assemblers can exceed what is available, even on large memory multi-core servers. Approaches such as parallel assemblers that distribute the memory and computational requirements across a cluster of interconnected nodes are one way to address this issue. Other options include assemblers optimized for low memory usage or assemblers that run in a specialized cloud-based distributed computing environment. However, more innovation in genome assembly and/or reduction in genome complexity will be needed to access large, complex genomes using current short-read sequences. One such strategy is that developed for the loblolly pine genome, which introduces a novel approach for assembling large complex genomes [4, 5]. The core innovation is the reduction of approximately 15 billion error-corrected WGS reads (with an average length of 160 nucleotides) derived from haploid megagametophye tissue into approximately 150 million longer super-reads (with an average length of 362 nucleotides) using the MaSuRCA assembler . The super-reads represent a 27-fold reduction in raw sequence and 100-fold reduction in number of paired-end reads. The long, high-quality super-reads could then be assembled with an OLC assembler and scaffolded using short-read mate-pairs and fosmid ditag libraries derived from diploid genomic DNA, thereby generating a higher quality assembly. The hybrid approach of generating super-reads then assembling them using an OLC assembler leverages the cost efficiencies of short-read sequencing platforms and leaves the door open to utilizing reads from current and future long-read sequencing technologies.
With regard to biology, genome sequencing technology improvements that began in the late 1990s have continued through the present day, igniting research across a wide range of biological disciplines and creating entirely new fields of research, including genome biology and personal genomic medicine. None of this would have been possible without parallel improvements in algorithms, computer-processing capabilities, and applied computing solutions. Current long-read sequencing technologies such as Pacific Biosciences SMRT Sequencing  and Illumina TruSeq synthetic Long-Read Sequencing  are comparatively expensive for large genomes and are used mainly for scaffolding genomes and resolving haplotypes. Going forward, the cost for long-read sequencing will decrease and improvements in genome assembly software will leverage these long reads to generate higher quality and more complete complex genome assemblies at a reduced cost. Until then, the approach used by Neale et al. to sequence and assemble the loblolly pine genome provides a method of generating a high-quality draft assembly of a complex 22 Gbp plant genome using current and cost-efficient sequencing platforms.
Bacterial artificial chromosome
Overlap layout consensus
Whole genome shotgun sequencing.
Work in the Buell lab on plant genome assembly is funded by grants from the US National Science Foundation MCB-1121650 and DBI-1202724.
- Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, Cardeno C, Koriabine M, Holtz-Morris AE, Liechty JD, Martínez-García PJ, Vasquez-Gross HA, Lin BY, Zieve JJ, Dougherty WM, Fuentes-Soriano S, Wu L-S, Gilbert D, Marçais G, Roberts M, Holt C, Yandell M, Davis JM, Smith KE, Dean JFD, Lorenz WW, Whetten RW, Sederoff R, Wheeler N, McGuire PE, et al: Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol. 2014, 15: R59-10.1186/gb-2014-15-3-r59.PubMedPubMed CentralView ArticleGoogle Scholar
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyère C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007, 449: 463-467. 10.1038/nature06148.PubMedView ArticleGoogle Scholar
- Hirsch CN, Buell CR: Tapping the promise of genomics in species with complex, nonmodel genomes. Annu Rev Plant Biol. 2013, 64: 89-110. 10.1146/annurev-arplant-050312-120237.PubMedView ArticleGoogle Scholar
- Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A, Mane SP, Burns P, Davis TM, Slovin JP, Bassil N, Hellens RP, Evans C, Harkins T, Kodira C, Desany B, Crasta OR, Jensen RV, Allan AC, Michael TP, Setubal JC, Celton JM, Rees DJ, Williams KP, Holt SH, Ruiz Rojas JJ, Chatterjee M, et al: The genome of woodland strawberry (Fragaria vesca). Nat Genet. 2011, 43: 109-116. 10.1038/ng.740.PubMedPubMed CentralView ArticleGoogle Scholar
- Zimin A, Stevens KA, Crepeau M, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, Neale DB, Salzberg SL, Yorke JA, Langley CH: Sequencing and assembly of the 22-Gb loblolly pine genome. Genetics. 2014, 196: 875-890. 10.1534/genetics.113.159715.PubMedPubMed CentralView ArticleGoogle Scholar
- Wegrzyn JL, Liechty JD, Stevens KA, Wu L, Loopstra CA, Vasquez-Gross H, Dougherty WM, Lin BY, Zieve JJ, Martínez-García PJ, Holt C, Yandell M, Zimin A, Yorke JA, Crepeau M, Puiu D, Salzberg SL, de Jong PJ, Mockaitis K, Main D, Langley CH, Neale DB: Unique features of the loblolly pine (Pinus taeda L.) megagenome revealed through sequence annotation. Genetics. 2014, 196: 891-909. 10.1534/genetics.113.159996.PubMedPubMed CentralView ArticleGoogle Scholar
- Parra G, Bradnam K, Korf I: CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007, 23: 1061-1067. 10.1093/bioinformatics/btm071.PubMedView ArticleGoogle Scholar
- Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA: The MaSuRCA genome assembler. Bioinformatics. 2013, 29: 2669-2677. 10.1093/bioinformatics/btt476.PubMedPubMed CentralView ArticleGoogle Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, et al: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323: 133-138. 10.1126/science.1162986.PubMedView ArticleGoogle Scholar
- Kuleshov V, Xie D, Chen R, Pushkarev D, Ma Z, Blauwkamp T, Kertesz M, Snyder M: Whole-genome haplotyping using long reads and statistical methods. Nat Biotechnol. 2014, 32: 261-266. 10.1038/nbt.2833.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. The licensee has exclusive rights to distribute this article, in any medium, for 12 months following its publication. After this time, the article is available under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.