Transcriptome analysis of functional differentiation between haploid and diploid cells of Emiliania huxleyi, a globally significant photosynthetic calcifying cell

An EST analysis of the phytoplankton Emiliania huxleyi reveals genes involved in haploid- and diploid-specific processes and provides insights into environmental adaptation.


Background
Coccolithophores are unicellular marine phytoplankton that strongly influence carbonate chemistry and sinking carbon fluxes in the modern ocean due to the calcite plates (coccoliths) that are produced in intracellular vacuoles and extruded onto the cell surface [1]. Coccolithophores are members of the Haptophyta [2,3], a basal-branching division of eukaryotes with still uncertain phylogenetic relationships with other major lineages of this domain [4,5]. Intricately patterned coccoliths accumulated in marine sediments over the past 220 million years have left one of the most complete fossil records, providing an exceptional tool for evolutionary reconstruction and biostratigraphic dating [3]. Coccolith calcification also represents a potential source of nanotechnological innovation. Fossil records indicate that Emiliania huxleyi arose only approximately 270,000 years ago [6], yet this single morphospecies is now the most abundant and cosmopolitan coccolithophore, seasonally forming massive blooms reaching over 10 7 cells l -1 in temperate and sub-polar waters [7]. Many studies are being conducted to determine how the on-going anthropogenic atmospheric CO 2 increases affect E. huxleyi calcification, with conflicting results [8,9]. Because of its environmental prominence and ease of maintenance in laboratory culture, E. huxleyi has become the model coccolithophore for physiological, molecular, genomic and environmental studies, and a draft genome assembly of one strain, CCMP1516, is now being analyzed [10]. However, coccolithophorid biology still is in its infancy.
E. huxleyi exhibits a haplo-diplontic life cycle, alternating between calcified, non-motile, diploid (2N) cells and non-calcified, motile, haploid (1N) cells, with both phases being capable of unlimited asexual cell division [11,12]. Almost all laboratory and environmental studies on this species have focused only on 2N cells, and lack of information about the ecophysiology and biochemistry of 1N cells represents a large knowledge gap in understanding the biology and evolution of E. huxleyi and coccolithophores. More generally, a major question remaining in understanding eukaryotic life cycle evolution is the evolutionary maintenance of haplo-diplontic life cycles in a broad diversity of eukaryotes [13,14], and E. huxleyi represents a prominent organism in which new insights might be gained.
E. huxleyi 1N cells are very distinct from both calcified and non-calcified 2N cells in ultrastructure [12] and ecophysiological properties [15]. 1N cells have two flagella and associated flagellar bases, whereas 2N cells completely lack both flagella and flagellar bases. The coccolith-forming apparatus is present in both calcified and naked-mutants of 2N cells but is absent in 1N cells [7]. 1N cells are also differentiated from 2N cells by formation of particular non-mineralized organic body scales (and thus are not 'naked') [7,11]. 1N cells show different growth preferences relative to 2N cells [16] and do not have the exceptional ability to adapt to high light exhibited by 2N cells [15]. As 1N cells of E. huxleyi are not recognizable by classic microscope techniques, little is yet known about their ecological distribution. Recent advances in fluorescent in situ hybridization now allow detection of non-calcified E. huxleyi cells in the environment [17], although it is still impossible to distinguish 1N cells from non-calcified 2N cells. However, 1N cells of certain other coccolithophore species are recognizable due to the production of distinct holococcolith structures and appear to have a shallower depth distribution and preference for oligotrophic waters compared to 2N cells of the same species [18]. Recently, E. huxleyi 1N cells were demonstrated to be resistant to the EhV viruses that are lethal to 2N cells and are involved in terminating massive blooms of 2N cells in nature [19]. This suggests that 1N cells might have a crucial role in the long-term maintenance of E. huxleyi populations by serving as the link for survival between the yearly 'boom and bust' successions of 2N blooms.
The pronounced differences between 1N and 2N cells suggest a large difference in gene expression between the two sexual stages. In this study, we conducted a comparison of the 1N and 2N transcriptomes in order to: test the prediction that expression patterns are, to a large extent, ploidy level specific; identify a set of core genes expressed in both life cycle phases; identify genes involved in important cellular processes known to be specific to one phase or the other (for example, motility for 1N cells and calcification for 2N cells); provide insights into transcriptional/epigenetic controls on phase-specific gene expression; and provide the basis for the development of molecular tools allowing the detection of 1N cells in nature. For our analysis we selected isogenic cultures originating from strain RCC1216 because strain CCMP1516, from which the genome sequence will be available, has not been observed to produce flagellated 1N cells. Pure clonal 1N cultures (RCC1217) originating from RCC1216 have been stable for several years and can be compared to pure 2N cultures originating from the same genetic background [15,16]. We produced separate normalized cDNA libraries from pure axenic 1N and 2N cultures. Over 19,000 expressed sequence tag (EST) sequences were obtained from each library. Interlibrary comparison revealed major compositional differences between the two transcriptomes, and we confirmed the predicted ploidy phase-specific expression for some genes by reverse transcription PCR (RT-PCR).

Strain origins and characteristics at time of harvesting
E. huxleyi strains RCC1216 (2N) and RCC1217 (1N) were both originally isolated into clonal culture less than 10 years prior to the collection of biological material in this study (Table 1). Repeated analyses of nuclear DNA content by flow cytometry have shown no detectable variation in the DNA contents (the ploidy) of these strains over several years ([20] and unpublished tests performed in 2006 to 2008). Axenic cultures of both 1N and 2N strains were successfully prepared.
The growth rates of the 2N and 1N cultures used for library construction were 0.843 ± 0.028 day -1 (n = 4) and 0.851 ± 0.004 day -1 (n = 2), respectively. These rates were not significantly different (P = 0. 70). Two other 1N cultures experienced exposure to continuous light for one or two days prior to harvesting due to a failure of the lighting system. The growth rate of these 1N cultures was 0.893 ± 0.008 day -1 (n = 2). These cultures were not used for library construction but were included in RT-PCR tests. Flow cytometric profiles and microscopic examination taken during harvesting indicated that nearly 100% of 2N cells were highly calcified (indicated by high side scatter) and that no calcified cells were present in the 1N cultures [21] (Figure 1). No motile cells were seen in extensive microscopic examination of 2N cultures over a period of 3 months. 1N cells were highly motile, and displayed prominent phototaxis in culture vessels (not shown).
Both 1N and 2N cultures maintained high photosynthetic efficiency measured by maximum quantum yield of photosytem II (Fv/Fm) throughout the day-night period of harvesting. The Fv/Fm of phased 1N cultures was 0.652 ± 0.009 over the whole 24-h period; it was slightly higher during the dark (0.661 ± 0.003) than during the light period (0.644 ± 0.001; P = 9.14 × 10 -5 ). The Fv/Fm of 2N cells was 0.675 ± 0.007, with no significant variation between the light and dark periods. These data suggest that both the 1N and 2N cells were maintained in a healthy state throughout the entire period of harvesting.
Cell division was phased to the middle of the dark period both in 2N cultures and in the 1N cultures on the correct light-dark cycle ( Figure S1 in Additional data file 1). The 1N cultures exposed to continuous light did not show phased cell division. Nuclear extraction from the phased 1N cultures showed that cells remained predominantly in G1 phase throughout the day, entered S phase 1 h after dusk (lights off), and reached the maximum in G2 phase at 3 to 4 h into the dark phase (Figure 2). A small G2 peak was present in the morning hours and disappeared in the late afternoon. These data show that we successfully captured all major changes in the diel and cell cycle of actively growing, physiologically healthy 1N and 2N cells for library construction (below).

Global characterization of haploid and diploid transcriptomes General features, comparison to existing EST datasets, and analysis of transcriptome complexity and differentiation
High quality total RNA was obtained from eight time points in the diel cycle ( Figure S2 in Additional data file 1) and pooled for cDNA construction. We performed two rounds of 5'-end sequencing. In the first round, 9,774 and 9,734 cDNA clones were sequenced from the 1N and 2N libraries, respectively. In the second round, additional 9,758 1N and 9,825 2N clones were selected for sequencing. Altogether our sequencing yielded 19,532 1N and 19,559 2N reads for a total of 39,091 reads (from 39,091 clones). Following quality control, we finally obtained 38,386 high quality EST sequences ≥ 50 nucleotides in length (19,198 for 1N and 19,188 for 2N). The average size of the trimmed ESTs was 582 nucleotides with a maximum of 897 nucleotides (Table 2). Their G+C content (65%) was identical to that observed for ESTs from E. huxleyi strain CCMP1516 [22], and was consistent with the high genomic G+C content (approximately 60%) of E. huxleyi.
Sequence similarity searches between the 1N and 2N EST libraries revealed that only approximately 60% of ESTs in one library were represented in the other library. More precisely, 56 to 59% of 1N ESTs had similar sequences (≥ 95% identity) in the 2N EST library, and 59 to 62% of the 2N ESTs had similar sequences in the 1N EST library, with the range depending on the minimum length of BLAT alignment (100 nucleotides or 50 nucleotides). To qualify this overlap between the 1N and 2N libraries, we constructed two artificial sets of ESTs by first pooling the ESTs from both libraries and then re-dividing them into two sets based on the time of sequencing (that is, the first and the second rounds). Based on the same similarity search criteria, a larger overlap (73 to 79%) was found between the two artificial sets than between the 1N and 2N EST sets. Given the fact that our cDNA libraries were normalized towards uniform sampling of cDNA species,      250  200  150  100  50  0   Sybr Green I fluorescence   250  200  150  100  50  0  250  200  150  100  50  0   250  200  150  100  50  0  250  200  150  100  50  0   0  50  100 150  200 Figure S3 in Additional data file 1). The Shannon diversity indices were found close to the theoretical maximum for both libraries, indicating a high evenness in coverage and successful normalization in our cDNA library construction (Table 4). Crucially, the fact that the rank-size distributions of the two libraries were essentially identical also shows that the normalization process occurred comparably in both libraries ( Figure S3 in Additional data file 1).
Interestingly, a larger number of singletons was obtained from the 2N library (3,704 singletons, 19% of 2N ESTs) than from the 1N library (2,651 singletons, 14% of 1N ESTs), suggesting that 2N cells may express more genes (that is, RNA species) than 1N cells. To test this hypothesis, we assessed transcriptome richness (that is, the total number of mRNA species) of 1N and 2N cells using a maximum likelihood (ML) estimate [23] and the Chao1 richness estimator [24]. These estimates indicated that 2N cells express 19 to 24% more genes than 1N cells under the culture conditions in this study, supporting the larger transcriptomic richness for 2N relative to 1N (Table 4). To assess the above-mentioned small overlap between the 1N and 2N EST sets, we computed the abundance-based Jaccard similarity index between the two sam- ples based on our clustering data. This index provides an estimate for the true probability with which two randomly chosen transcripts, one from each of the two libraries, both correspond to genes expressed in both cell types (to take into account that further sampling of each library would likely increase the number of shared clusters because coverage is less than 100%). From our samples, this index was estimated to be 50.6 ± 0.9% and again statistically supports a large transcriptomic difference between the haploid and diploid life cycles.

Functional difference between life stages
In the NCBI eukarote orthologous group (KOG) database, 3,286 clusters (25.2%) had significant sequence similarity to protein sequence families (Additional data file 3 provides a list of all clusters with their top homologs identified in Uni-Prot, Swiss-Prot, and KOG, and also the number of component mini-clusters and ESTs from each library). Of these KOG-matched clusters, 2,253 were associated with 1N ESTs (1,385 shared core clusters plus 868 1N-unique clusters), and 2,418 were associated with 2N ESTs (1,385 shared core clusters plus 1,033 2N-unique clusters). The distributions of the number of clusters across different KOG functional classes were generally similar among the 1N-unique, the 2N-unique and the shared core clusters, with exceptions in several KOG classes ( Figure 4a). The 'signal transduction mechanisms' and 'cytoskeleton' classes were significantly over-represented (12.3% and 4.15%) in the 1N-unique clusters relative to the 2N-unique clusters (7.36% and 1.55%) (P < 0.002; Fisher's exact test, without correction for multiple tests). These classes were also less abundant in the shared clusters (6.06% and 2.02%) compared to the 1N-unique clusters (P = 3.49 × 10 -7 for 'signal transduction mechanisms'; P = 0.00395 for 'cytoskeleton'). In contrast, the 'translation, ribosomal structure and biogenesis' class was significantly under-represented (3.69%) in the 1N-unique clusters compared to the 2N-unique (6.97%) and the shared clusters (7.58%). Similar differences were observed when the 1N-unique and 2Nunique sets were further restricted to clusters containing two or more ESTs ( Figure S4 in Additional data file 1).
We used Audic and Claverie's method [25] to rank individual EST clusters based on the significance of differential representation in 1N versus 2N libraries. An arbitrarily chosen  The maximum likelihood (ML) estimate of transcriptome richness was calculated following Claverie [23] using the two separate rounds of EST sequencing. The Chao1 estimator of transcriptome richness and the Shannon diversity index was computed for each library separately and for the combined library using EstimateS with the classic formula for Chao1. The range of estimated coverage was calculated by dividing the number of clusters observed by the two estimates of transcriptome richness. The similarity of content of the 1N and 2N libraries was also determined: the Chao abundance-based estimator of the Jaccard similarity index (accounting for estimated proportions of unseen shared and unique transcripts) was 0.506 ± 0.009, calculated with 200 bootstrap replicates and the upper abundance limit for rare or infrequent transcript species set at 2. The maximum possible Shannon diversity index was calculated as the natural log of the number of clusters. threshold of P < 0.01 provided a list of 220 clusters predicted to be specific to 1N (Additional data file 4) and a list of 110 clusters predicted to be specific to 2N (Additional data file 5). A major caveat is that normalization tends to reduce the confidence in determining differentially expressed genes between cells. As a first step to examine the prediction, we were particularly interested in transcripts that may be effectively absent in one life phase but not the other. Namely, we focused on 198 (90.0%) that are specific and unique to 1N as well as 89 (80.9%) clusters that are specific and unique to 2N, which we termed 'highly 1N-specific' (Tables 5 and 6; Additional data file 4) and 'highly 2N-specific' clusters (Tables 7  and 8; Additional data file 5).

Taxonomic distribution of transcript homology varies over the life cycle
To characterize the taxonomic distribution of the homologs of EST clusters, we performed BLASTX searches against a combined database, which includes the proteomes from 42 selected eukaryotic genomes taken from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (see Additional data file 6 for a list of selected genomes from the KEGG database) as well as prokaryotic/viral sequences from the UniProt database. There were 4,055 clusters (31.1%; 1,731 shared, 1,083 1N-unique and 1,241 2N-unique clusters) with significant homology in the database (E-value <1 × 10 -10 ), with Viridiplantae, stramenopiles, and metazoans receiving Distribution of clusters and reads by KOG functional class and library Figure 4 Distribution of clusters and reads by KOG functional class and library. Distributions of clusters over KOG class for clusters shared between the 1N and 2N libraries and clusters unique to each library. Fisher's exact test was used to determine significant differences in the distribution of clusters by KOG class between the 1N-unique and 2N-unique sets (asterisks indicate the KOG classes exhibiting significant differences between the 1N-unique and 2N-unique sets); P < 0.002 without correction for multiple tests). The same test was applied to determine differences in the distribution of clusters by KOG class between the set of shared clusters and both 1N-unique and 2N-unique clusters (the at symbol (@) indicates KOG classes exhibiting significant differences between the 1N-unique and shared sets; P < 0.002 without correction for multiple tests). . These clusters were classified by the taxonomic group of their closest BLAST homolog (that is, 'best hit'). The distribution of the taxonomic group was found to substantially vary among the shared, 1Nunique and 2N-unique clusters. Shared clusters had a significantly higher proportion of best hits to stramenopiles compared to both 1N-unique and 2N-unique clusters, while 1Nunique clusters had a significantly lower percentage of best hits to stramenopiles than 2N-unique clusters. In contrast, metazoans received a significantly greater portion of best hits from 1N-unique than from 2N-unique and shared clusters. Consistent with the above functional analysis, the KOG class 'signal transduction mechanisms' was over-represented in clusters best-hitting to metazoans (11.0%) compared to all clusters with homologs in KEGG (5.0%) or clusters best-hitting to Viridiplantae (4.8%) (P = 2.9 × 10 -13 and 5.0 × 10 -6 , respectively; Fishers exact test). There was no difference among 1N-unique, 2N-unique, and shared clusters in the proportion of clusters with best hits to Viridiplantae ( Figure 5). However, among the Viridiplantae best hits, a significantly greater proportion of 1N-unique clusters was found to be best-hitting to Chlamydonomas reinhardtii ( Figure 5), the only free-living motile, haploid genome from Viridiplantae represented in our database.
Of all clusters best-hitting to either Viridiplantae, stramenopiles, or metazoans, the shared clusters had the highest percentage of clusters (53.6%) with homologs in all three groups, and the lowest percentage of clusters (3.1%) with homologs only in metazoans ( Figure S5 in Additional data file 1). Clusters with homologs in stramenopiles were significantly overrepresented among shared clusters and under-represented in 1N-unique clusters relative to 2N-unique clusters.
The vast majority (7,442 clusters; 57.0%) of the total EST clusters were orphans (Figure 6a). One of the main causes of the high orphan proportion might be the presence of many short EST clusters with only one or a few ESTs. The nonorphan clusters (having matches in UniProt, KOG, or the conserved domains database (CDD)) exhibited a significantly higher average number of reads per cluster (3.67, combining reads from both libraries) than orphan clusters (2.39; P < 0.0001, Mann-Whitney test). In a similar way, the orphan proportion decreased to 39.4% for the shared core clusters (Figure 6b), which have an average of 6.25 ESTs per cluster. However, a more detailed analysis indicated that the size of clusters (that is, the number of ESTs in the cluster) may not be the sole reason for the abundance of the orphan clusters. For instance, 58.6% of 1N-unique clusters with two or more ESTs were orphan clusters ( Figure 6c). Furthermore, an even higher orphan proportion (63.9%) was obtained when these 1N-unique clusters were limited to the 119 clusters repre- Only clusters with zero ESTs originating from the 2N library are shown. The number of 1N EST reads in each cluster and the P-value for significance of the difference between libraries are shown. When no Swiss-Prot homolog was detected, ID and homology values for the top Uniprot homolog are given (indicated by an asterisk), or the CDD name and homology values are given (indicated by †). Clusters are arranged by KOG class. Clusters in bold were chosen for RT-PCR validation. Additional data file 4 gives a complete list of all clusters predicted to be 1N-specific by statistical comparison of libraries. Only clusters with zero ESTs originating from the 2N library are shown, and only the orphans confirmed by RT-PCR are included in this table.
Homolog IDs are marked as in Table 5. Additional data file 4 gives a complete list of all clusters predicted to be 1N-specific by statistical comparison of libraries. Only clusters with zero ESTs originating from the 1N library are shown. The number of 2N EST reads in each cluster and the P-value for significance of the difference between libraries are shown. Homolog IDs are marked as in Table 5. Clusters are arranged by KOG class. Clusters in bold were chosen for RT-PCR validation. Additional data file 5 gives a complete list of all clusters predicted to be 2N-specific by statistical comparison of libraries.  Validation and exploration of the predicted differential expression of selected genes

KOG-assigned EST clusters predicted to be highly 2N-specific based on statistical comparison of libraries
We examined how well our in silico comparison of the two normalized libraries successfully identified gene content differentiating the two transcriptomes based on in-depth sequence/bibliographic analysis and RT-PCR assays (summarized in Tables S1 and S2 in Additional data file 7). We began with homologs of eukaryotic flagellar-associated proteins. This large group of proteins is well-conserved across motile eukaryotes. Genes for proteins known to be exclusively present in flagellar or basal bodies are expected to be specifically expressed in the motile 1N stage of E. huxleyi, whereas those for proteins known to also serve functions in the cell body may also be expressed in non-motile cells. Thus, flagella-related genes serve as a particularly useful initial validation step. Next, we examined several other clusters with strong in silico signals for differential expression between the 1N and 2N libraries. Finally, we explored clusters homologous to known Ca 2+ and H + transporters, potentially involved in the calcification process of 2N cells, and histones, which might play roles in epigenetic control of 1N versus 2N differentiation. In total, we tested the predicted expression patterns of 39 clusters representing 38 different genes. The predicted expression pattern (1N-specific, 2N-specific, or shared) was confirmed for 37 clusters (36 genes), demonstrating a high rate of success of the in silico comparison of transcriptome content.

Motility-related clusters
A total of 156 E. huxleyi EST clusters were found to be homologous to 85 flagellar-related or basal body-related proteins from animals or C. reinhardtii, a unicellular green alga serving as a model organism for studies of eukaryotic flagella/cilia [26][27][28] (Tables 9 and 10). This analysis combined a systematic BLAST searche using 100 C. reinhardtii motility-related proteins identified by classic biochemical analysis [27] with additional homology searches (detailed analysis provided in Additional data files 8 and 9). Of the 100 C. reinhardtii proteins, 64 were found to have one or more similar sequences in the E. huxleyi EST dataset. We could also identify homologs for six of the nine Bardet-Biedl syndrome (BBS) proteins known to be basal body components [29,30]. Excluding 64 clusters closely related to proteins known to play additional roles outside the flagellum/basal body (such as actin and calmodulin) and 10 clusters showing a relatively low level of sequence similarity to flagellar-related proteins, 82 of the 156 clusters were considered highly specific to motility. Remarkably, these clusters were found to be represented by 252 ESTs from the 1N but 0 ESTs from the 2N library (Table 9). In contrast, clusters related to proteins with known possible roles outside of flagella tended to be composed of ESTs from both 1N and 2N libraries, as expected (Table 10).
The abundance of 1N-unique EST clusters with the closest homolog in Metazoa ( Figure 5) appears to be partially due to the expression of genes related to flagellar components in 1N cells. In fact, 58 (37.2%) of the 156 motility-related clusters had best-hits to Metazoa in the KEGG database, compared to only 789 (14.1%) of all 5,614 non-orphan clusters (P = 2.9 × 10 -13 ).
Six core structural components of the flagellar apparatus were chosen for RT-PCR tests (Figure 7). These included three flagellar dynein heavy chain (DHC) paralogs (GS00667, GS02579 and GS00012), a homolog of the outer dynein arm docking complex protein ODA-DC3 (GS04411), a homolog of FAP189 and FAP58/MBO2, highly conserved but poorly characterized coiled-coil proteins identified in the C. rein- The taxonomic distribution of homology Figure 5 The taxonomic distribution of homology. Shown are the percentages of clusters with KEGG homologs that have the 'best hit' in each taxonomic group. Indicated are cases where the proportion of clusters best hitting to the taxonomic group differs between 1N-unique and 2N-unique (asterisks) or between 1N-unique and shared clusters (at symbol (@)), tested as above. The inset shows the proportion of all assigned clusters that are accounted for by best-hits to Chlamydomonas reinhardtii (a subset of those which are best-hits to Viridiplantae). The differences between 1N-unique and 2N-unique, and between 1N-unique and shared clusters were significant (P < 0.002). hardti flagellar proteome [27] (GS02724), and a homolog of the highly conserved basal body protein BBS5 (GS00844) [31]. All showed expression restricted to 1N cells; no signal could be detected for these five clusters in any 2N RNA samples. Curiously, three non-overlapping primer sets designed to GS000844 (BBS5) all detected evidence of incompletely spliced transcript products, suggesting its regulation by alternative splicing.

0%
GS05223, containing three ESTs from the 1N library and none from the 2N, showed a significant sequence similarity to C. reinhardtii minus and plus agglutinins (BLASTX, E-values 3 × 10 -5 and 8 × 10 -6 , respectively), flagellar associated proteins involved in sexual adhesion [32]. RT-PCR confirmed that expression of GS05223 was highly specific to 1N cells, being undetectable in 2N cells ( Figure 7). However, inspection of the BLASTX alignment between GS05223 and C. reinhardtii agglutinins revealed that the sequence similarity was associated with the translation of the reverse-complement of GS05223. We also found that all of the three ESTs in GS05223 contained poly-A tails, so must be expressed in the forward direction. Therefore, we concluded that GS05223 represents an unknown haploid-specific gene product that may not be related to flagellar functions.
Next we investigated four clusters that are homologous to proteins known to often have additional, non-flagellar roles in the cytoplasm, but that were represented only in the 1N library. Two clusters (GS02889 and GS03135) displayed homology to cytoplasmic dynein heavy chain (DHC), which is associated with flagella/cilia due to its role in intraflagellar transport. In animals and amoebozoa, it also has non-flagellar functions such as intracellular transport and cell division [33]; however, both clusters showed potential 1N-specific expression, being represented by two and five 1N ESTs and zero 2N ESTs, respectively, and RT-PCR confirmed the predicted highly 1N-specific expression pattern (Figure 7).
The flagellar-related clusters included five homologs of phototropin. In C. reinharditii, phototropin is found associated with the flagellum and plays a role in light-dependent gamete differentiation [34]. However, phototropin is a light sensor involved in the chloroplast-avoidance response in higher plants [35], so can have roles outside the flagellum. Clusters GS00132, GS01923, and GS00920 showed the highest similarities to the C. reinharditii phototoropin sequence (E-values 1 × 10 -22 , 1 × 10 -21 , and 1 × 10 -22 , respectively) and were all only represented in the 1N library (four, four, and three ESTs, respectively). In contrast, GS04170, which showed weaker The proportion of orphan clusters  Genome Biology 2009, 10:R114 Table 9 Distribution of EST reads and clusters related to proteins highly specific to cilia/flagella or basal bodies

Distribution of EST reads and clusters related to proteins highly specific to cilia/flagella or basal bodies
homology to phototropins (E-value 3 × 10 -9 ), was represented by four ESTs in the 2N library and zero from the 1N library. These four clusters all aligned well over the highly conserved LOV2 (light, oxygen, or voltage) domains [35,36] of C. reinhardtii and Arabidopsis thaliana phototropins ( Figure S7 in Additional data file 1). The fifth phototropin homolog, GS01944, was represented by ESTs from both libraries. GS01944 did not correspond to the LOV2 domain. GS00132 and GS00920 were selected for RT-PCR validation, which confirmed that expression of these clusters was indeed highly restricted to 1N cells (Figure 7), as predicted by in silico comparison of the two libraries.
We found that several of the selected flagellar-related EST clusters (GS00012, GS04411, GS00844, GS00132 and GS00920) showed a strongly diminished RT-PCR signal in the samples collected during the time of S-phase (Figure 7). Because many genes tested in this study did not display this pattern (for example, GS00217, GS00508, and GS00234), it might be due to real differences in the circadian timing of flagellar gene expression.

Use of digital subtraction to identify other 1N-or 2N-specific transcripts
Fourteen of the 199 clusters predicted to be highly 1N-specific and 10 of the 89 clusters predicted to be highly 2N-specific were tested by RT-PCR (Tables 5, 6, 7 and 8; Tables S1 and S2 in Additional data file 7). Twenty-three out of these 24 clusters did show the predicted strong phase-specific expression pattern, confirming that in silico subtraction of the two libraries identifies true phase-specific transcripts with a high success rate. Two (the DHC homologs GS00667 and GS00012) were discussed previously and the remaining 22 are discussed in this and the following sections.

1N-specific conserved flagellar-related cluster and 1N-specific possible signal transduction clusters
GS00242 had a moderate level of sequence similarity to the C. reinhardtii predicted protein A8J798 (E-value 4 × 10 -14 ) and the human spermatogenesis-associated protein SPT17 (Evalue 8 × 10 -11 ). Although A8J798 is not among the previously confirmed flagellar protein components listed in Table S3 of Pazour et al. [27], these authors identified peptides derived from A8J798 in the C. reinhardtii flagellar proteome (listed as C-6350001    a putative calmodulin-dependent kinase) was also confirmed by RT-PCR ( Figure S8 in Additional data file 1).

1N-specific Myb homologs
Myb transcription factors control cell differentiation in plants and animals [37][38][39]. Of the three Myb homologs predicted to be highly 1N-specific, GS00273 was chosen for validation because it had the highest homology to known Myb proteins (Gallus gallus c-Myb transcription factor; E-value 3 × 10 -34 ). The amino acid sequence derived from GS00273 was readily aligned over the conserved R2-R3 DNA binding regions of Myb family members [37] ( Figure S9 in Additional data file 1). RT-PCR confirmed that GS00273 was strongly differentially expressed in 1N cells (Figure 8).

1N-specific cluster GS02894
Cluster GS02894 displayed a sequence similarity to the E. huxleyi 'glutamic acid-proline-alanine' coccolith-associated glycoprotein (GPA) (E-value 7 × 10 -7 ) and was represented by six ESTs from the 1N library and zero from the 2N library. RT-PCR confirmed that GS02894 was highly differentially expressed in 1N cells ( Figure 8). Through visual inspection of alignment, we found that GS02894 in fact was aligned poorly with the GPA sequence ( Figure S10 in Additional data file 1) and that the alignment did not cover the Ca 2+ -binding loops of the EF-hand motifs previously identified in GPA. GS02894 thus represents a haploid-specific gene product of unknown function.

Orphan 1N clusters
GS01257 and GS01805 were orphan clusters highly represented in the 1N library by 25 and 16 ESTs and none in the 2N library in either case (P = 1.50 × 10 -8 and 7.66 × 10 -6 , respectively). RT-PCR confirmed that both showed highly 1N-specific expression patterns (Figure 8) be successfully identified with full-length sequencing or they might represent transcripts that do not encode proteins.

Other highly 1N-specific clusters tested by RT-PCR
A putative β-carbonic anhydrase (GS00157) and a putative cyclin (GS00508) both showed the predicted highly 1N-specific pattern of expression ( Figure S8 in Additional data file 1). Two other predicted highly 1N-specific clusters (GS01285 and GS02990) were also confirmed by RT-PCR and are discussed in a later section.

2N-specific SLC4 family homolog
GS05051 was a homolog of the Cl -/bicarbonate exchanger solute carrier family 4 proteins (SLC4) [40]. This cluster was represented by seven 2N ESTs and zero 1N ESTs, which comprised six separate mini-clusters that only partially overlapped; these might represent alternative transcripts. Primers designed to separate putative alternative transcripts both detected the expected products from 2N RNA samples but no product from 1N RNA samples in RT-PCR tests ( Figure  8), confirming strong differential expression and the existence of alternatively spliced transcripts.

2N-specific SNARE homolog
GS02941, represented by nine 2N ESTs and zero 1N ESTs, was homologous to the SNARE protein family syntaxin-1 involved in vesicle fusion during exocytosis [41]. GS02941 had a top UniProt hit to Dictyostelium discoidium Q54HM5, a t-SNARE family protein (E-value 3 × 10 -32 ) and a top Swiss-Prot hit to the Caenorhabditis elegans syntaxin-1 homolog STX1A (E-value 2 × 10 -19 ). RT-PCR confirmed that GS02941 expression was detectable exclusively in RNA from 2N cells using three independent primer sets ( Figure 8). The cluster was composed of six different mini-clusters, representing possible different alternative transcripts. Primers designed to mini-cluster e02941.1, one potential alternative transcript form, successfully amplified the predicted 317-nucleotide product but also amplified at least one other product of approximately 400 nucleotides. Only a single approximately 1,500-nucleotide product was amplified from genomic DNA. This suggests that the gene encoding GS02941 contains several (or large) introns that might be subjected to alternative splicing.

Orphan 2N clusters
GS02507, GS01164, GS01802, and GS11002 were orphan clusters highly represented in the 2N library with no reads from the 1N library. The longest open reading frames were 171, 309, 236, and 87 amino acids, respectively. GS02507, GS01164, and GS01802 could only be detected from 2N RNA samples, and not at all in 1N RNA samples (Figure 8). In contrast, GS11002 was easily detected in both 1N and 2N RNA samples ( Figure 8). PCR amplification of GS01802 from genomic DNA of 2N cells revealed two products, differing by about 50 nucleotides but both larger than the single 444 nucleotides product from cDNA. Only the larger band was von Dassow et al. R114.21 Genome Biology 2009, 10:R114 visible from 1N genomic DNA. This suggests that two alleles of GS01802 exist in 2N cells, differentiated by the length of an intron, and that only the larger of these alleles was inherited by the clonal 1N cells.

Other highly 2N-specific clusters tested by RT-PCR
GS00451 represents a putative aquaporin-type transporter. GS03351 was weakly homologous to a putative arachidonate lipoxygenase previously identified in E. huxleyi but to no other proteins in the searched databases, so it may represent a protein of unknown function. Both clusters were confirmed by RT-PCR to be highly 2N-specific ( Figure S8 in Additional data file 1). Two other predicted highly 2N-specific clusters, GS00463 and GS02435, are discussed in the next sections.

Ca 2+ and H + transport and potential biomineralization-related transcripts
We chose to specifically examine Ca 2+ and H + transporters that might play a role in calcification and to determine whether any of them might display highly 2N-specific expression (Table 11). Five clusters had homology to vacuolar-type Ca 2+ /H + antiporters (VCX1). Although these sequences were aligned with matching regions of known VCX1 proteins at the amino acid level ( Figure S11 in Additional data file 1), these clusters could not be well aligned at the nucleotide level (not shown), indicating that they represent paralogs. Only one of these, GS00304, showed possible 2N-specific expression, being represented by four ESTs in the 2N library and zero in the 1N library. GS00304 had a top Swiss-Prot hit to the A. thaliana VCX1 homolog CAX2_ARATH (E-value 3 × 10 -60 ). We confirmed by RT-PCR that GS00304 was strongly overexpressed in 2N cells using two independent primer sets (Figure 9).
Three clusters showed similarity to sarcoplasmic/endoplasmic membrane (SERCA)-type Ca 2+ -transporting ATPases (Table 11). However, none of these clusters showed strong evidence of differential expression by in silico comparison of the two libraries.
Six clusters displayed sequence similarities to the K + -dependent Na + /Ca 2+ exchanger (NCKX) family of Ca 2+ pumps. These clusters did not align well with each other at the nucleotide level, indicating that they are likely to be distant paralogs (and not alleles). Two of these (GS05506 and GS00463) were Clusters are arranged by KOG hit classification. Clusters in bold were tested by RT-PCR.

E. huxleyi EST clusters related to Ca 2+ and H + transporters
RT-PCR determination of expression patterns of selected genes potentially related to biomineralization only present in the 2N library (two EST reads, P = 0.1249, and eight EST reads, P = 0.00195, respectively). 2N-specific expression of GS00463 was confirmed by RT-PCR with two independent primer sets ( Figure 9).
Homologs of 11 out of the 14 subunits of vacuolar-type H + -ATPases were identified, comprising a total of 16 clusters. Seven of these clusters were represented by both 1N-and 2N-ESTs. Only two clusters showed potential differential expression. GS01501 (top Swiss-Prot hit to Saccharomyces cerevisiae V-ATPase V0 domain subunit c', E-value 4 × 10 -38 ) was present only in the 1N library (four ESTs, P = 0.03129) whereas GS09780 (top Swiss-Prot hit to A. thaliana V1 domain subunit F) was represented only in the 2N library (four ESTs, P = 0.03129). Five clusters were homologous to V0 domain subunit a, the presumed path for proton transport. These clusters did not align at the nucleotide level, thus likely representing distant paralogs (and not alleles). Of the V0 domain subunit a homologs, three shared the highly conserved 20 amino acid motif that contains the R735 residue critical for H + transport ( Figure S12 in Additional data file 1). The other two clusters, each represented by a single EST, were short and did not cover this conserved region. Clusters GS03783 and GS01934 were closely homologous (E-values 7 × 10 -38 and 4 × 10 -38 , respectively) to the V0 domain proteolipid subunit (subunit c/c') previously identified as a singlecopy gene in the coccolithophore Pleurochrysis carterae [42]. These two clusters aligned poorly at the nucleotide sequence level and showed divergence at the amino acid sequence level, thus probably representing paralogs.
The glycoprotein GPA was previously identified to be closely associated with E. huxleyi coccoliths by biochemical and immunolocalization studies [43]. Cluster GS09822 was aligned perfectly over its entire length with the amino-terminal 86 codons of the previously sequenced GPA (AAD01505; Figure S10 in Additional data file 1), with minor differences in the 3' UTR (not shown). Surprisingly, GS09822 was represented by one 1N EST and one 2N EST, suggesting expression in both non-calcified 1N cells and calcifying 2N cells, and RT-PCR confirmed that this transcript was abundantly expressed in both calcifying 2N and non-calcifying 1N cells (Figure 9), as predicted from inter-library comparisons.
A previous study identified 45 transcripts with potential roles in biomineralization using microarrays and quantitative RT-PCR comparing expression levels in strain CCMP1516 under phosphate-replete (non-calcifying) and phosphate-limited (weakly calcifying) conditions and in calcifying cells of strain B39 [44]. We attempted to determine whether any of these transcripts might show highly 2N-specific expression patterns (see analysis in Additional data file 10). Of the 45 transcripts in Table 3 of Quinn et al. [44], only 23 could be unambiguously identified in public databases based on the provided information and three were each associated with more than one unique EST sequence in GenBank. Fifteen of these transcripts had BLAST matches to clusters in our dataset; ten of these clusters were represented by both 1N and 2N ESTs. Four of the remaining five were represented by only single ESTs from the 2N library. The last cluster, GS03082, similar to GenBank EST sequence DQ658351 from CCMP1516, was composed of two ESTs from the 2N library and zero from the 1N library. However, the transcript for GS03082 was easily detected in RNA from both 1N and 2N cells ( Figure 9). Thus, we could not confirm 2N-specific expression of the transcripts described in [44].

Possible epigenetic regulation of 1N versus 2N differentiation by histones
We selected the KOG class 'chromatin structure and dynamics' for closer examination because chromatin packaging might differ between 2N cells and 1N cells as the cells are similar in size but contain different DNA quantities. Also, chromatin factors are known to regulate gene expression. Within this class, two clusters with homology to H4 histones were found to exhibit potential differential expression. GS02435 was composed of six ESTs from the 2N library and zero from the 1N library (P = 0.0078). In contrast, GS09138 was composed of 13 ESTs from the 1N library and 0 from the 2N library (P = 6 × 10 -5 ). A sequence alignment analysis of GS09138 and two other H4 histone homologs (GS07034 and GS07988) showed that these shared high nucleotide identity over the coding region and 100% amino acid sequence identity ( Figure S13 in Additional data file 1), suggesting that 1N and 2N cells may preferentially utilize alternative genes for what appear to be the same functional gene product. The 2Nspecific GS02435 differed from other H4 histone homologs in the predicted amino acid sequence. The other H4 histone homologs were almost identical along their 103 amino acid predicted length to H4 histones from other eukaryotes but the longest reading frame of GS02435 exhibited an additional ≥ 50 residues in its amino-terminal sequence and lacked 3 carboxy-terminal conserved residues, making this predicted protein at least 27 amino acids longer (by taking the most downstream starting methionine codon) than the typical 103 amino acid residue H4 histones ( Figure S13 in Additional data file 1). We confirmed by RT-PCR that GS02435 was detectable only in 2N RNA samples ( Figure 10). Surprisingly, genomic DNA-positive controls showed that GS02435 was detected only in 2N genomic DNA and not in 1N genomic DNA ( Figure 10). All of the other clusters examined in this study were detected in both 1N and 2N genomic DNA ( Figures  7, 8, 9 and 10). The absence of GS02435 from the 1N genome was confirmed by PCR using three independent, non-overlapping PCR primer sets.

RT-PCR determination of expression patterns of selected histone genes
There were five clusters with homology to the H2A histone. Alignments of the predicted polypeptides with other eukaryotic H2A histones showed high conservation ( Figure S14 in Additional data file 1). GS10455 and GS07154 were identical to each other across the predicted amino acid sequences, although they diverged in nucleotide sequence, particularly in the predicted 5' and 3' UTRs. GS06864 and GS07501 were also identical in predicted amino acid sequence but diverged in nucleotide sequence. GS06749 was divergent from all the other E. huxleyi predicted H2A homologs, yet it still grouped well within other eukaryotic histone H2As in preliminary phylogenetic analysis. In particular, it was grouped within the H2A variant class H2AV ( Figure S15 in Additional data file 1). GS06749 was composed of four ESTs from the 1N library and three ESTs from the 2N library, and RT-PCR confirmed that it was well-expressed in both 1N and 2N RNA samples ( Figure  10). Only one H2A histone homolog, GS10455, showed signs of differential transcription, albeit not statistically significant (two ESTs in the 1N library compared to zero in the 2N library, P = 0.1251). We confirmed by RT-PCR that GS10455 was highly expressed in 1N cells with no detection in 2N phase cells ( Figure 10).

Potential use of the new EST dataset for environmental surveys and understanding the recent evolution of the Emiliania huxleyi morpho-species
Two EST datasets were already available from different E. huxleyi strains, but in both cases only 2N, day-phase transcripts were represented. The new E. huxleyi EST dataset, from 1N and 2N life phases integrated over the day-night cycle, dramatically expands the existing transcriptomic information of this species. The three EST datasets come from strains with widely different geographic origins and morphotypes. The average sequence identity among ESTs from different genetic backgrounds was ≥ 99.5%. Therefore, the limited overlap between the EST sets may be due to physiological or technical differences in the generation of cDNA libraries (for instance, the cDNA libraries here were integrated over the diel and cell cycles, whereas the other cDNA libraries were constructed only from cells harvested during the day, presumably in G1 phase), but is not likely due to sequence divergence within the E. huxleyi species-complex. This has several important implications. Practically, this suggests that EST and genomic sequence information from laboratory cultures can be successfully used to design probes for investigating in situ gene expression of E. huxleyi cells in environmental samples (for example, using microarrays or quantitative RT-PCR). Such probes will be particularly useful as this species frequently dominates phytoplankton communities. Second, the limited sequence variability among strains is consistent with the fossil records, indicating a very recent origin of E. huxleyi, which may have rapidly colonized and adapted to a wide range of ocean environments. Limited intra-strain sequence variability suggests that the adaptation perhaps instead involved changes in gene regulation and gain/loss of genes.

Transcriptome differentiation of haploid and diploid cells
The dramatic phenotypic differentiation between 1N and 2N cells is reflected in the limited overlap between the 1N and 2N EST libraries. Both libraries were normalized, which suppresses highly abundant transcripts to enhance the probability that rare transcripts are sampled. However, the high rate of RT-PCR validation of potentially differentially expressed genes and the fact that homologs of motility-related proteins were distributed exactly as expected according to library origin supports the successful use of in silico subtraction of two normalized libraries in this case. The 82 EST clusters related to proteins known to be highly specific for flagella originated exclusively from the 1N library, consistent with the fact that only 1N cells synthesize these structures. In contrast, many clusters homologous to proteins that can have non-flagellar functions originated from both libraries. For example, six of the nine clusters homologous to actin included ESTs from the 2N library. Because of the use of normalization, we focused our analyses on estimates of presence/absence differences and differences in the representation of functional classes of genes, rather than quantitative expression differences of specific genes between the two transcriptomes. Our analysis likely underestimates the true transcriptomic difference between the two cell phases.
The two libraries were estimated to share only 50% of total transcript clusters (by the abundance-based Jaccard similarity index). Such a level of transcriptome differentiation has been seen between mammalian germ and somatic cells [45][46][47] but it is much greater than that seen in vascular plant germ cell development, where only less than 10% of transcripts in mature male pollen are exclusively expressed in that tissue [48,49].
The estimated transcriptomic richness of 2N cells was approximately 20% larger than that of 1N cells. The same tendency has been seen in our preliminary analysis of 2N versus 1N ESTs from the closely related coccolithophore Gephryocapsa oceanica (unpublished data). The 1N cells in this study are clonal. It cannot be ruled out that the 2N cells have undergone sexual recombination since isolation (as clones), because these cells can still produce 1N cells. However, we have never observed 2N cells to be formed in cultures of either clonal 1N cells or non-clonal 1N populations originating from the same 2N clonal parent, suggesting heterothally and/or strong barriers to inbreeding. Thus, we believe the 2N cells have remained clonal and the higher transcriptome richness in 2N cells is not due to increased diversity of genotypes present in these cultures.
The higher transcriptome richness of 2N cells compared to 1N cells has implications for life cycle function in coccolithophores in particular and for life cycle evolution in eukaryotes more broadly. A smaller transcriptome richness in haploid relative to diploid cells was also seen in studies of vascular plant gametophyte development: mature pollen grains express 40 to 50% fewer genes than the diploid progenitor tissues [48,49]. Likewise, the set of genes specifically expressed in post-meiotic spermatids in mammals is smaller than the set of genes specifically expressed in diploid tissues [45], suggesting a similar drop in transcriptome richness in the haploid stage. The large decrease in total expressed genes in the vascular plant pollen grain may mostly reflect that they represent a haploid gametophyte that does not live independently of the parent diploid and is capable of only a limited number of mitoses. A similar explanation would apply for highly specialized short-lived animal sperm that cannot undergo mitosis. This explanation would not apply to haploid coccolithophorid cells, which are capable of unlimited mitotic division and live independently of the diploids. An increase in transcriptome richness with ploidy has not previously been reported in studies of autopolyploid organisms [50,51]. Only one study (done in S. cerevisiae using microarrays) has compared global gene expression between haploid and diploid cells where both represent free-living life stages. No decrease in transcriptome richness of 1N cells was observed [50]. Proposed selective advantages allowing the maintenance of haplo-diplontic life cycles in eukaryotes include the ability for each life stage to adapt to alternative 'niches' [52], with 1N stages possibly better adapted to low-resource environments [14,53,54]. Available data on coccolithophore life stages is consistent with this hypothesis, as the holocolith-producing 1N stages of several species are associated with nutrient-poor waters compared to the heterococcolith-producing 2N stages of the same species [18]. Perhaps a reduced transcriptome allows 1N cells to be more streamlined to adapt to specific niches and an intrinsically more rich transcriptome allows 2N cells to be versatile in exploiting a variety of productive environments. There is a tendency of diploid cells to be the dominant building blocks of the most complex multicellular organisms, including animals, vascular plants, and some algae, albeit with many exceptions [13,14]. There might be a more general constraint, such as differential expression of alternative alleles due to heterozygosity, that permits diploid cells to express a larger number of genetic loci (counting alleles as a single entity), and hence a more complex transcriptome, than haploid cells.

Enhanced motility and sensory systems of 1N cells
The 1N library displayed over-representation of signal transduction-related transcripts compared to the 2N library. This trend was seen when measured by the number of distinct EST clusters, by the number of ESTs, and also by the over-representation of predicted and validated '1N-specific' clusters in the 'signal transduction' functional class. Three clusters related to signal transduction processes were demonstrated to be highly specific to the 1N cells. The motility of 1N cells may require an enhanced repertoire for rapid signal perception and processing, leading to a more sophisticated behavioral repertoire.
A combination of homology analysis and digital subtraction successfully identified a large set of motility-related transcripts in the 1N library. Of the motility-related proteins identified in C. reinhardtii, 68% had identifiable homologs in our E. huxleyi EST datasets. Likewise, homologs for six of the nine BBS basal body proteins queried were identified, and the identification of 12 distinct flagellar DHC homologs and one cytoplasmic DHC homolog is similar to the number of total flagellar DHCs and cytoplasmic DHCs identified in other organisms [26,55,56]. These proportions are similar to what would be expected from the estimated sampling coverage of the 1N library (Table 4). We conclude that the flagellar elements are highly conserved between C. reinhardtii and E. huxleyi. Conserved core flagellar structural components, such as flagellar dyneins and basal body components, will probably be intriguing new targets for phylogenomic studies of deep relationships among the eukaryotes, as most eukaryote branches contain flagellated members, and the 'unikont/ bikont' split is originally based on flagellar/basal body characters [57].
Motility and sensory systems also appear to account, in part, for the over-representation of clusters with highest homology to metazoans and C. reinhardtii among the 1N-unique clusters. C. reinhardtii is the only member of the Viridiplantae represented in the KEGG database that has an independently reproducing motile stage. Clusters best-hitting to the moss Physcomitrella patens were not over-represented among 1Nunique clusters. Mosses have a flagellated gamete stage but it does not reproduce asexually and moss flagella are lacking components present in C. reinhardtii [58]. The gene content differentiation between eukaryotes that have ciliated/flagellated cells in the dominant phase of the life cycle and those that do not is thus reflected in the transcriptome content differentiation between 1N and 2N E. huxleyi. This result suggests that the transcriptional differentiation of E. huxleyi might involve regulation of functional modules with distinct phylogenetic distribution among diverse eukaryotes.

Light sensing
The three 1N-specific phototropin homologs might have a role in the pronounced phototaxis exhibited by 1N cells. Although no information is yet available on the environmental cues that might lead to syngamy (fusion of 1N cells to produce 2N cells) or meiosis (formation of 1N cells from 2N cells), light could be an important trigger to examine, given the identification of four phototropin-related genes in this study. Phototropin has been shown to mediate the light-regulation of gamete differentiation, light re-activation of gametes, and zygote development in C. reinhardtii [34], and light is also a key regulator of centric diatom gametogenesis [59] and diploid cyst germination in dinoflagellates [60]. The absence of any detectable rhodopsin homologs in the E. huxleyi EST database further suggests that phototropin-like proteins might represent the major light-sensing proteins in coccolithophores.

Identification of new putative elements involved in biomineralization of 2N cells
Coccolith production has been calculated to require a massive sustained flux of Ca 2+ from outside the cell into the specialized Golgi-derived coccolith-deposition vesicles where calcification occurs; a major question in coccolithophore cell biology is how the 2N cells can sustain such ion fluxes while avoiding Ca 2+ toxicity [61]. Calcification also releases H + and requires alkalinization of the coccolith deposition vesicles, so cells must simultaneously manage Ca 2+ and H + fluxes [61]. The 2N-specific VCX1 (Ptype) Ca 2+ /H + exchanger might participate in such a function. However, P-type Ca 2+ -stimulated ATPase activity was localized to plasma membrane fractions rather than coccolith vesicle or Golgi vesicle fractions in the coccolithophore Pleurochrysis sp. [63]. Vacuolar-type (V-type) H + ATPase activity has previously been located to the Pleurochrysis sp. coccolith vesicle [63] and the V0 subunit c/c' has been immunolocalized there as well [42]. Those studies assumed that the coccolith vesicle must pump H + into the cytoplasm, functioning in reverse to other eukaryotic V-type H + ATPases, which pump protons out of the cytoplasm. Here we identified homologs of five out of the six subunits of the V0 domain and six out of the eight subunits of the V1 domain known in yeast. Many of these subunits were represented by multiple homologs but, in general, were found in both the 1N and the 2N libraries; there was no clear candidate for a highly 2N-specific V-type H + ATPase. This suggests that the same genes are utilized for both forward H + pumping (out of the cytosol) and possible reverse H + pumping (into the cytosol from the coccolith vesicle) as there is no obvious reason for the non-calcifying 1N cells to engage in reverse H + pumping. Alternatively, it may be that VCX1 activity does localize to the coccolith deposition vesicle in E. huxleyi, differing from Pleurochrysis sp. VCX1-mediated Ca 2+ /H + exchange has been estimated to occur with a stoichiometry of 3H + :1Ca 2+ [64,65], whereas calcification releases one H + for every Ca 2+ precipitated [61]. In this scenario, VCX1 would pump excess H + out of the vesicle and V-type H + ATPases would function in the normal direction (H + into the vesicle).
We detected a highly 2N-specific homolog of SLC4 Cl -/bicarbonate exchangers, which are well known to play roles in intracellular pH regulation in animal cells [40]. The SLC4 homolog might function to maintain optimal balance of pH and carbonate/bicarbonate in the coccolith deposition vesicle for calcification. None of the 12 carbonic anhydrases identified showed evidence for highly 2N-specific expression in inter-library comparisons. The putative carbonic anhydrase identified with highly 1N-specific expression, GS00157, was a member of the β-family (closest homology to a putative carbonic anhydrase identified in Pleurochrysis sp.). Members of this family can localize to either the mitochondria, the chloroplast, or potentially other locations in C. reinhardtii [66]. Target prediction was not possible with available EST sequence from GS00157, so its function remains unclear. Identification of the Ca 2+ , H + and carbonate transport-related transcripts will allow quantitative comparisons of their activities and functions to understand how 2N cells control the large Ca 2+ and H + fluxes involved in calcification.
The 2N-specific SNARE homolog could likely play a role in coccolith secretion, as SNARE proteins are the machinery involved in fusion of secretory vesicles with the plasma membrane during exocytosis [41]. The secretion of coccoliths to the cell surface is a massive and highly coordinated exocytosis event specific to 2N cells: each coccolith is an ellipsoidal plate with axial dimensions ranging from 2 to 4 μm, a substantial fraction of the dimension of an individual E. huxleyi cell (4 to 5 μm) [7]. In contrast, 1N cells produce much smaller ellipsoid organic scales with dimensions of 0.4 to 0.6 μm [11].
The glycoprotein GPA was previously found associated with coccolith polysaccharides and displays Ca 2+ -binding activity, so it likely plays a role in biomineralization [43]. However, the GPA gene (GS09822) is well expressed by both calcified 2N cells and non-calcified 1N cells. The set of transcripts hypothesized by Quinn et al. [44] to be involved in biomineralization based on differential expression between calcifying and non-calcifying cells (that were presumably 2N) showed a similar pattern. Although Quinn et al. did not provide sufficient information to retrieve all of these transcripts from public databases, most of the ones that could be obtained were represented in both of our libraries. Those genes might still be subjected to differential expression or post-transcriptional control. It is also intriguing to speculate that GPA (and perhaps other biomineralization proteins) might have a structural role in non-calcified 1N cells, potentially as a component of 1N-specific organic scales.

Possible mechanisms for control of the transcriptome
The identification of 1N-specific Myb-like transcription factors and alternative histones provides the first insight into transcriptional and epigenetic regulatory mechanisms in haptophytes. Myb transcription factors are known to control a large variety of cell differentiation processes in plants, including sperm formation [38,39]. Histone variants are differentially expressed in different mammalian tissues [67] and exert epigenetic control in both animals and plants [68,69].
Of the several H2 homologs identified, GS10455 was detected only in 1N cells. It was identical in amino acid sequence to GS07154, expressed in both cells. Both grouped phylogenetically within the canonical H2A histone group. GS06749 was expressed in both 1N and 2N cells yet it was associated phylogenetically with the H2AZ and H2AV variant histones involved in epigenetic controls in plants, yeast, and animals [69]. Therefore GS06749 might be involved in epigenetic controls in E. huxleyi. We did not identify a histone H2A variant highly specific to either phase. The potentially 2N-specific H4 histone variant GS02435 is intriguing because H4 histones are nearly 100% identical across all eukaryotes [67,70], yet this H4 variant is unusually long with an extended amino-terminal region with no homology in other eukaryotic H4s. The amino-terminal region of histones, including H4, is the site of specific acetylations and methylations that are thought to influence transcriptional activity of the bound DNA [71]. A strange aspect of GS02435 was its absence from the 1N genome. GS02435 could have been lost from RCC1217 during the culture of this clonal isolate or it could represent a mutated allele present in one of the sets of chromosomes of the 2N genome that was not transmitted to the 1N genome.
Alternatively, E. huxleyi might be heterothallic (as mentioned above). If the GS02435 gene is on a sex-specific genome segment, half of the 1N cell products of meiosis would not contain it.

Conclusions
The expanded EST dataset on E. huxleyi represents an essential community resource for the on-going annotation and manual curation of the whole genome sequence of E. huxleyi and for improved understanding of coccolithophorids and haptophytes. The greatest gap in current understanding of the ecology and natural history of E. huxleyi and other coccolithophores is lack of knowledge of the occurrence, distribution, and role of 1N cells. Currently, there are no techniques that can reliably recognize the presence of 1N E. huxleyi cells in plankton samples. In addition to revealing potent new hints about specific elements of coccolithophorid cell biology (for example, calcification, motility, and signaling), our dataset will help create useful molecular markers of the occurrence of 1N cells in the field. For example, quantitative RT-PCR using primers specific to E. huxleyi flagellar genes might detect the presence of active 1N cells in natural plankton samples.
The heteromorphic haplo-diplontic life cycle and excellent fossil record of coccolithophores makes them an especially attractive group in which to study the influence of the life cycle on gene evolution. An interesting observation was that clusters shared between both life cycle phases were more likely to have homologs among many taxonomic groups whereas clusters unique to one phase or the other were more likely to be orphans. The coming availability of the whole genome sequence of E. huxleyi will allow the generation of longer transcript models. This will increase the number of genes for which homologs can be identified, permitting true phylogenomic analyses to track gene origins and more confidence in assignment of genes as orphans. At the same time, genomic and EST resources are being developed for other coccolithophores and haptophytes. It will then be possible to

Measurement of cell properties, cell cycle, and photosynthetic efficiency
Cell abundance was measured daily and at each harvesting time point using a Becton-Dickinson FACSort flow cytometer equipped with a 488 nm laser (BD Biosciences, San Jose, California, USA). Cell optical properties were characterized by forward scatter (proportional to protoplast size), side scatter (proportional to optical complexity), and chlorophyll fluorescence. A 3.8% NaCl sheath fluid was used for whole cell measurements to match the refractive index of seawater permitting measurements of forward scatter [75]. Cell cycle phase was determined at each point in harvesting by flow cytometric analysis of Sybr Green I-stained extracted nuclei. Nuclear extraction using previously published protocols [20,76] was successful on 1N cells but not on 2N cells. Photosynthetic efficiency was measured as Fv/Fm with a Phy-toPAM pulse-amplitude modulated fluorometer.

RNA isolation, cDNA library construction, and sequencing
Cells were harvested by filtration onto 1.2 μm pore-size membrane filters (Millipore) and extracted following the Trizol protocol. Ethanol-washed total RNA pellets were resuspended in RNA-free H 2 O, treated with DNase I and further purified using an RNeasy Minikit following the manufacturer's recommendations (Qiagen, Valencia, California, USA). RNA quantity and quality was assessed using a Nanodrop spectrophotometer (Thermo Fisher Scientific, Waltham, Massachussets, USA) and a Bioanalyzer (Agilent Technologies, Inc., Santa Clara, California, USA). The 260:280 ratios were typically greater than 2.2 and absence of degradation was evidenced by sharp 18 s and 28 s bands. Equal amounts of total RNA from each time point were pooled before sending to Vertis Biotechnologies (Freising-Weihenstephan, Germany) for library construction following a protocol for fulllength enriched cDNA that includes two successive poly-A RNA purifications, first-strand cDNA synthesis at 42°C for 45 minutes, followed by a ramp up in temperature for a final 10 minutes at 55°C. cDNA was amplified by 19 cycles of PCR. cDNA was normalized by one cycle of denaturation-reassociation followed by separation of reassociated double-stranded cDNA from normalized single-stranded cDNA by hydroxylapatite chromatography. Normalized cDNA was amplified with nine cycles of PCR followed by limited exonuclease digestion. The >0.5 kb size fraction was isolated by agarose gel electrophoresis and directionally ligated into the EcoR1 and BamH1 sites of the pBS II sk+ vector. Ligations were elec-troporated into the T1 Phage resistant TransforMax EC100-T1R (Epicentre Biotechnologies, Madison, WI, USA) electrocompetent E. coli cells. Libraries were then sent to Genoscope (Evry, France) for 5' Sanger sequencing.

EST sequence processing and analysis
Sequences derived from the 5'-end reads were trimmed according to Phred TRIM values [77,78]. Vector, adaptor and polyA sequences were removed using in-house software and the NCBI/UniVec database [79]. The longest high quality regions of each read were used as ESTs. At this stage, we found no obvious contaminations in our EST dataset from other organisms using BLASTN against NCBI/GenBank [80]. ESTs ≥ 50 nucleotides long were selected for further analysis. Initial single-linkage clustering was performed using BLAT version 34 [81] following the criteria that there was ≥ 98% identity across the BLAT alignment and an additional constraint on the alignment (that is, either the alignment was ≥ 150 nucleotides or ≥ 90% of the length of the shorter trimmed read or was long enough that the extremities of the alignment were within a few bases from the end of one of the ESTs in comparison). Next we used the CAP3 program (the version as of December 2007) [82] to generate one or more 'mini-clusters' for each of the initial clusters. A consensus sequence was simultaneously obtained for each mini-cluster. We assigned identifiers (such as 'e00001.1') to mini-clusters. Finally, we performed a third round of clustering based on the overlap of consensus sequences after BLAT mapping on the JGI E. huxleyi draft genomic sequences [10] as well as pairwise sequence similarity criteria as above, provided that the latter were consistent with the BLAT genomic mapping data. If the longest consensus sequence of the mini-clusters composing a cluster was shorter than 90 nucleotides, we discarded the corresponding cluster. The clusters finally obtained, each containing one or more mini-clusters, were denoted by distinct identifiers (for example, 'GS00001') in this study. All the EST sequences determined in the present study were deposited in the EMBL database with the assigned accession numbers provided in Additional data file 2. At the final stage of the present study, we noticed a possible contamination of yeast cloning vector (GS12427 with 15 ESTs matching to yeast expression vector pYAA-ZP-MCS EU882163.1). We re-performed BLASTN searches of all mini-cluster consensus sequences against the latest version of GenBank, and found no additional possible contaminations in our EST dataset. Thus, we concluded that possible contaminations did not affect our conclusions in this study. The GS12427 cluster was removed from all lists of clusters in this manuscript but not from Additional data file 2.
Transcriptome complexity, or an estimate of the total number of expressed genes represented in each library and in the combined library, was assessed with the Chao1 estimator [24], which has been recommended for estimating microbial diversity in rDNA libraries [95] and has previously been used for analysis of EST libraries [96], and using a ML estimator developed for EST analysis [23]. The ML analysis was performed by artificially dividing the ESTs into two sets based on the time of sequencing (that is, the first and the second rounds of sequencing) and by counting the number of clusters represented by either or both of the two sets. The ML analysis assumes a uniform distribution for the probability of finding an object (in our case, an EST cluster). This assumption may not exactly apply to our dataset, although our ESTs were derived from normalized libraries and the distribution of EST reads per cluster followed a negative exponential curve characteristic of Poisson processes ( Figure S3 in Additional data file 1). The ML estimates should thus be considered as qualitative. Chao1 and transcriptome diversity were calculated using EstimateS [97]. Sampling coverage of the libraries was estimated to be >50% for both libraries according to the ratio of total clusters to the estimates of transcriptome richness (  1N, 2N, and combined libraries, respectively) but the same trends between libraries were seen. Empirical estimates of coverage based on identification of well-known and highly conserved flagellar-related genes were between the two higher coverage estimates, so those are used in Table 4. Shannon diversity H is a function of the total number of genes detected and the distribution of ESTs among the genes [99]. H is maximal when every gene that is detected is represented by an equal number of ESTs: Hmax = lnS (where S is the total number of genes that are represented in the library). When H is close to Hmax, it suggests that a new EST generated has a nearly equal probability of being assigned to any of the genes that exist in the library, as expected after normalization. The abundance based estimator of Jaccard similarity index, which estimates the overlap between the 1N and 2N libraries taking sample coverage into account [100], and the Shannon diversity of each library and the combined library were also calculated using EstimateS. Statistical analysis for differential representation between libraries, for both KOG classes and individual clusters, was performed using the method by Audic and Claverie [25]. MUSCLE alignment and phylogenetic analysis werer performed with the web service Phylogeny.fr [101].

DNA isolation
We filtered 25 ml of dense but growing cultures of RCC1216 (2N) and RCC1217 (1N) onto 25 mm 1.2 μm pore filters (Millipore) and DNA from these were extracted using the Qiagen DNeasy Plant Minikit.

Primer design, reverse transcription and PCR for confirming gene expression patterns
Oligonucleotide primers were designed using on-line software Primer3 [102] and double-checked for possible self and primer-primer interactions using the on-line Oligo Analysis and Plotting Tool (Operon) [103]. Custom primers were constructed by Eurogentec (Angers, France) and are provided in Table S1 in Additional data file 7.
All RNA samples were diluted to 37.5 ng μl -1 prior to reverse transcription (RT; final reaction concentration 16.9 ng μl -1 ) using the Thermoscript RT-PCR system (Invitrogen) with oligo-dT 20 mers following the manufacturer's protocol with the following temperature selections: RNA was denatured with primer and dNTPs at 65°C for 5 minutes followed by transfer to ice immediately prior to addition of enzyme and buffer. RT was performed at 55°C for 10 minutes, 60°C for 30 minutes, 65° for 10 minutes, and terminated at 85°C for 5 minutes. RT-negative (RT-) reactions were performed in parallel, substituting water for enzyme. Following the RT termination step, samples were treated with RNase following the manufacturer's recommended protocol. All RT+ cDNA and RT-samples were diluted 1:10 prior to testing by PCR. PCRs were performed using the GoTaq PCR Core System I kit (Promega, Madison, Wisconsin, USA) with 1 mM MgCl 2 , 0.2 mM dNTPs, and 0.2 μM of forward and reverse primers. The thermocycler protocol included an initial 2-minute denaturation at 95°C followed by 35 cycles of 45 s denaturation at 95°C, 30 s annealing at 60°C, 90 s extension at 72°C. When preliminary PCR tests showed that the product from genomic DNA was less than 1 kb, the extension was typically shortened to 60 s.

Authors' contributions
PvD, CdV, and IP together conceived the project. PvD prepared axenic cultures of E. huxleyi, managed all experimental work, and wrote the manuscript. IP provided the initial nonaxenic E. huxleyi cultures and assisted with initial experimental work. PW and CDS managed sequencing at Genoscope. HO, implemented the bioinformatics pipeline for EST processing, clustering, and BLAST searching, and made major contributions at all stages of manuscript preparation. HO, SA, JMC, and PvD performed statistical analyses.

Additional data files
The following additional data files are available with the online version of this paper: Figures S1 to S15 (Additional data file 1); an Excel-format list of all clusters with component mini-clusters, cDNA clones, and associated EMBL accession numbers (Additional data file 2); an Excel-format list of clusters and mini-clusters with read numbers by library, and top homologies in UniProt, Swiss-Prot, KOG, and CDD (Additional data file 3); an Excel-format list of all clusters predicted by statistical analysis to be 1N-specific, including all orphan clusters and clusters with reads in the 2N library that have P < 0.01 associated with the difference in read number between 1N and 2N libraries (Additional data file 4); an Excel-format list of all clusters predicted by statistical analysis to be 2Nspecific, including all orphan clusters and clusters with reads in the 1N library that have P < 0.01 associated with the difference in read number between 1N and 2N libraries (Additional data file 5); an Excel-format list of selected KEGG genomes used for the taxonomic search (Additional data file 6); a list of all oligonucleotide primers and a summary of RT-PCR results in Tables S1 and S2 (Additional data file 7); a detailed description of identification and analysis of flagellar-related homologs (Additional data file 8); an Excel file spreadsheet indicating the results of automatic queries with C. reinhardtii flagellar-related proteins against E. huxleyi EST clusters, and an analysis of these clusters compared to the KEGG database (Additional data file 9); an Excel-format spreadsheet detailing results of the search of Table 3 of Quinn et al. [44], suspected biomineralization-related transcripts (Additional data file 10).  Table 3 of Quinn et al. [44], suspected biomineralization-related transcripts results of the search of Table 3 of Quinn et al. [44], suspected biom-ineralization-related transcripts. Click here for file