Skip to main content

Evaluating short-read sequence data from the highly redundant, novel transcriptome of Polarella glacialis


Dinoflagellates are a diverse group of ecologically important eukaryotic algae, the global impact of which ranges from the large-scale primary production of oxygen [1] to devastating toxic algal blooms [2]. These organisms have exceptionally large genomes (109 to 1011 bases) [3] and highly duplicated genes (which can occur thousands of times within a single genome) [4]. These and other unusual characteristics have made dinoflagellates difficult to study using traditional molecular biology techniques. Sequence data for dinoflagellates are correspondingly sparse, and not a single genome sequence has been published to date.

As part of our project called Assembling the Dinoflagellate Tree of Life (DAToL), our laboratory has sequenced the transcriptome of Polarella glacialis. Its genome is estimated to be only 3 Gb in size, making it one of the smallest known dinoflagellate genomes. Because we had to rely on de novo assemblers that had been tested using data from organisms that are extremely divergent from dinoflagellates, we took special care in our attempts to validate the data. Before expanding our analyses to include additional dinoflagellates, we compared the results from different sequencing and assembly methods.


Total RNA was extracted from cultured P. glacialis. This sample was then divided and shipped to Macrogen for rRNA degradation, library preparation and sequencing. One library was sequenced on one-eighth of a Roche/454 GS FLX picotiter plate using Titanium chemistry. A second library was sequenced using one lane on an Illumina GAIIx sequencer for 78 cycles in both directions (paired end). The sequences were assembled using Newbler, MIRA, Oases and Trinity, and they were analyzed using various custom scripts.


The total amount of unassembled 454 sequence data added to less than one-third of the combined lengths of only those Trinity transcripts that had a significant BLAST hit against a sequence in GenBank, indicating that we did not achieve complete coverage with our 454 data.


Our primary hypothesis was that the longer read lengths of the 454 data might allow the corresponding assemblers to better resolve repetitive sequences, which could be instrumental for assembling conserved regions within highly duplicated genes. Our failure to obtain complete coverage with the 454 dataset undermined our ability to test this hypothesis, although we made several other interesting observations. Notably, despite the vast disparity in the depth of the coverage between the 454 and Illumina assemblies, we observed unique, apparently real sequences within some of the 454 contigs.


  1. Yang EJ, Choi JK, Hyun JH: Distribution and structure of heterotrophic protist communities in the northeast equatorial Pacific Ocean.Mar Biol 2004, 146:1–15.

    Article  Google Scholar 

  2. Wang DZ: Neurotoxins from marine dinoflagellates: a brief review.Mar Drugs 2008, 6:349e731.

    Article  Google Scholar 

  3. Hou Y, Lin S: Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoflagellate genomes.PLoS ONE 2009, 4:e6978.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Bachvaroff TR, Place AR: From stop to start: tandem gene arrangement, copy number andtrans-splicing sites in the dinoflagellateAmphidinium carterae.PLoS ONE 2008, 3:e2929.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations


Rights and permissions

Reprints and Permissions

About this article

Cite this article

Gibbons, T.R., Concepcion, G.T., Bachvaroff, T.R. et al. Evaluating short-read sequence data from the highly redundant, novel transcriptome of Polarella glacialis. Genome Biol 12 (Suppl 1), P5 (2011).

Download citation

  • Published:

  • DOI:


  • Duplicate Gene
  • Complete Coverage
  • Molecular Biology Technique
  • Eukaryotic Alga
  • Single Genome