Evaluating short-read sequence data from the highly redundant, novel transcriptome of Polarella glacialis
© Gibbons et al; licensee BioMed Central Ltd. 2011
Published: 19 September 2011
Dinoflagellates are a diverse group of ecologically important eukaryotic algae, the global impact of which ranges from the large-scale primary production of oxygen  to devastating toxic algal blooms . These organisms have exceptionally large genomes (109 to 1011 bases)  and highly duplicated genes (which can occur thousands of times within a single genome) . These and other unusual characteristics have made dinoflagellates difficult to study using traditional molecular biology techniques. Sequence data for dinoflagellates are correspondingly sparse, and not a single genome sequence has been published to date.
As part of our project called Assembling the Dinoflagellate Tree of Life (DAToL), our laboratory has sequenced the transcriptome of Polarella glacialis. Its genome is estimated to be only 3 Gb in size, making it one of the smallest known dinoflagellate genomes. Because we had to rely on de novo assemblers that had been tested using data from organisms that are extremely divergent from dinoflagellates, we took special care in our attempts to validate the data. Before expanding our analyses to include additional dinoflagellates, we compared the results from different sequencing and assembly methods.
Total RNA was extracted from cultured P. glacialis. This sample was then divided and shipped to Macrogen for rRNA degradation, library preparation and sequencing. One library was sequenced on one-eighth of a Roche/454 GS FLX picotiter plate using Titanium chemistry. A second library was sequenced using one lane on an Illumina GAIIx sequencer for 78 cycles in both directions (paired end). The sequences were assembled using Newbler, MIRA, Oases and Trinity, and they were analyzed using various custom scripts.
The total amount of unassembled 454 sequence data added to less than one-third of the combined lengths of only those Trinity transcripts that had a significant BLAST hit against a sequence in GenBank, indicating that we did not achieve complete coverage with our 454 data.
Our primary hypothesis was that the longer read lengths of the 454 data might allow the corresponding assemblers to better resolve repetitive sequences, which could be instrumental for assembling conserved regions within highly duplicated genes. Our failure to obtain complete coverage with the 454 dataset undermined our ability to test this hypothesis, although we made several other interesting observations. Notably, despite the vast disparity in the depth of the coverage between the 454 and Illumina assemblies, we observed unique, apparently real sequences within some of the 454 contigs.
- Yang EJ, Choi JK, Hyun JH: Distribution and structure of heterotrophic protist communities in the northeast equatorial Pacific Ocean. Mar Biol. 2004, 146: 1-15. 10.1007/s00227-004-1412-9.View ArticleGoogle Scholar
- Wang DZ: Neurotoxins from marine dinoflagellates: a brief review. Mar Drugs. 2008, 6: 349e731-View ArticleGoogle Scholar
- Hou Y, Lin S: Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoflagellate genomes. PLoS ONE. 2009, 4: e6978-10.1371/journal.pone.0006978.PubMedPubMed CentralView ArticleGoogle Scholar
- Bachvaroff TR, Place AR: From stop to start: tandem gene arrangement, copy number and trans-splicing sites in the dinoflagellate Amphidinium carterae. PLoS ONE. 2008, 3: e2929-10.1371/journal.pone.0002929.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.