- Open Access
LongSAGE profiling of nine human embryonic stem cell lines
Genome Biology volume 8, Article number: R113 (2007)
To facilitate discovery of novel human embryonic stem cell (ESC) transcripts, we generated 2.5 million LongSAGE tags from 9 human ESC lines. Analysis of this data revealed that ESCs express proportionately more RNA binding proteins compared with terminally differentiated cells, and identified novel ESC transcripts, at least one of which may represent a marker of the pluripotent state.
Embryonic stem cells (ESCs) can be derived from the inner cell mass of blastocysts and are defined by their ability to be propagated indefinitely as undifferentiated cells with the potential, upon appropriate stimulation, to generate cell types representing all three embryonic germ layers . Since the first reported isolation of human cells with these properties , the derivation of more than 150 such lines has been described. This large collection of human ESC lines provides opportunities for understanding the earliest stages of human embryo and tissue development, as well as for elucidating the mechanisms that can permanently maintain pluripotency. Studies of mouse ESCs have defined a number of genes that appear to play key roles in this process, including those encoding Oct4 , Nanog [4, 5], Sox2 , FoxD3  and fibroblast growth factor-4 [8, 9]. Comparisons of mouse and human ESCs have also revealed a number of conserved signaling pathways, including those involving JAK/STAT, transforming growth factor-β and fibroblast growth factor [10–12]. However, cross-species analysis of microarray data [13, 14] and expressed sequence tag (EST) resources [15–18] suggest that additional molecular regulators of ESC self-renewal may exist and that likely candidates are heterochronic genes, microRNAs, genes involved in telomeric regulation and polycomb group repressors .
Microarray-based approaches have been used to define the transcriptomes of numerous human ESC lines, including BG01, BG02, WA01, WA07, WA09, WA13, WA14, TE06, UC01 and UC06 [19–22]. These studies provide a rich resource for cell line comparisons; however, incomplete annotation of the genome and inherent biases in the microarray technology limit interpretation to well characterized, abundantly expressed transcripts [23–25]. A number of DNA sequence-based approaches have also been used to study the human ESC transcriptome, including EST analysis , serial analysis of gene expression (SAGE)  and massively parallel signature sequencing (MPSS) [16, 18]. Comparisons of these datasets have been used to search for genes that might be required for maintenance of pluripotency [13, 15, 16, 22] but, interestingly, exhibit limited overlap between datasets, in some cases as low as 1% [26–28], possibly because of the different technologies employed in different studies . The fact that a large proportion of transcripts expressed in ESCs do not correspond to annotated genes has further confounded the yields of such comparisons . To generate a transcript discovery resource complementary to previous work, we undertook a large scale gene expression analysis of nine different human ESC lines, maintained as undifferentiated cells, using the long serial analysis of gene expression (LongSAGE ) approach.
Results and discussion
Digital gene expression profiling of nine human ESC lines reveals an enrichment of RNA binding proteins
LongSAGE libraries were constructed using total RNA purified from nine different human ESC lines cultured as undifferentiated cells by serial passaging on mouse embryonic fibroblast (MEF) feeder layers  (Table 1). To enable detection of the majority of the moderately to abundantly expressed transcripts, we sequenced most libraries to a depth of approximately 200,000 tags. However, in one case (the library prepared from WA09 cells), we generated 468,252 tags. To ensure that tags included in the libraries were not contaminated with transcripts expressed from the MEF feeder layers, all tags matching the mouse reference genome sequence were excluded from further analysis (Additional data file 1). SAGE libraries were analyzed individually, and also as an electronically pooled 'meta-library' containing 2.5 million tags representing 379,645 different tag sequences. Of these, 73% were observed only once ('singletons'). Our previous experience indicated that singletons are enriched for experimental artifacts (sequencing errors, reverse transcriptase artifacts, and so on) as well as rare transcripts . To reduce the artifacts, we assigned confidence values to each tag sequence and selected for analysis only high quality tags as described . This filtering reduced the total number of different tag sequences to 268,515 (Additional data file 2). Of these, 40% of the singletons and 87% of the non-singletons could be mapped to publicly available gene expression resources.
To investigate the similarities and differences between the libraries, we performed hierarchical clustering using Pearson correlation coefficients . For this comparison, we included data from four LongSAGE libraries generated from terminally differentiated cells (available from the Cancer Genome Anatomy Project ) to provide an 'out-group'. Figure 1 shows that the libraries for all nine human ESC lines form a cluster distinct from the libraries for the four terminally differentiated cell preparations, as expected. The ESC libraries also do not cluster together based on obvious commonalities between the lines, such as the MEF feeder lines used, sex chromosome karyotype or passage number.
To assess the representation of known genes in the nine human ESC transcriptomes, we compared our data to other human sequence tag-based resources [15–18]. Highly expressed genes in each of the human ESC libraries showed significant overlap with previously published ESC SAGE  and MPSS datasets , but the diversity of genes identified by our LongSAGE data was significantly greater (Figure 2). To explore the functions encoded by transcripts detected in the LongSAGE libraries, we divided the genes (identified by uniquely mapping tags) into their respective Gene Ontology (GO) slim categories . Pair-wise comparisons of individual human ESC libraries showed little difference in the relative proportions of each of the GO slim categories (Additional data file 3). In contrast, a similar comparison of individual or pooled ESC libraries to the differentiated cell lines showed a statistically significant increase, in the ESC libraries, in the proportion of transcripts encoding RNA binding proteins and mitochondrial proteins (P = 1.8 × 10-7 and 1.0 × 10-6, respectively, by one-sided t-tests).
To investigate the potential functional significance of increased expression of transcripts for RNA binding proteins, we compared the global splicing profile of the ESCs and the four libraries of terminally differentiated cells. This was done by performing pair-wise comparisons across all transcripts in both the ESC and the terminally differentiated cell meta-libraries with the position of each uniquely mapped LongSAGE tag for which a transcript was known. These analyses did not reveal any difference in global transcript splicing patterns between the two meta-libraries, although differences in the relative abundance of specific transcript isoforms were identified. Of a total of 70 transcript isoforms found to be differentially expressed between the ESC and differentiated cell meta-libraries, 8 demonstrated statistical significance (P < 3.0 × 10-5; Additional data file 4). The most significantly affected transcript (lowest P value) encoded Secreted frizzled-related protein-1 (Sfrp1), a well characterized antagonist of WNT signaling. Our analysis suggested that the two isoforms of Sfrp1 we identified either retained or lost the 3' untranslated region (UTR; Figure 3). Only the transcript isoform lacking the 3' UTR was found exclusively in the ESCs. Closer examination of the 3' UTR region revealed putative miRNA target sites for two evolutionarily conserved miRNAs , the mouse homologues of which were found previously to be expressed in murine ESCs  (Figure 3). Given that activation of the canonical WNT signaling pathway induces differentiation and cell proliferation , we speculate that the expression of Sfrp1 may be regulated through miRNA-directed translational repression and that this regulation is bypassed through alternative 3' end formation in pluripotent ESCs.
We next examined the expression of transcripts that encode previously identified markers of undifferentiated ESCs. These include transcription factors such as Oct 4 , Nanog [4, 5], the cell surface proteins tdgf-1  and thy-1 , Lck , connexin cx43 , Rex1  and Lefty-A and Lefty-B . In addition, we looked for transcripts from six genes associated with early stages of ESC differentiation . Table 2 shows the normalized gene expression levels across all cell lines. A similar pattern of expression is observed across all lines, with the exception of HSF-6, which exhibited a decrease in expression of ESC marker genes and a concomitant increase in expression of genes associated with differentiation, including alpha-fetoprotein. Notably, expression of Nanog, a divergent homeodomain protein that directs propagation of undifferentiated mouse ESCs , was not detected in the HSF-6 library. These features are consistent with the closer relationship of the HSF-6 library to libraries from differentiated tissues than to other ESC libraries (Figure 1). We therefore excluded the HSF-6 library from further analysis.
A previous analysis of SAGE data generated using ES03 and ES04 cells showed that Rex1 was within the top 25 differentially expressed transcripts, with no Rex1 tags detected in the ES04 line and an absence of Rex1 expression in ES04 cells confirmed by quantitative and semi-quantitative real time (RT)-PCR . Interestingly, in our LongSAGE libraries, tags for Rex1 were present in all nine ESC libraries, including the library prepared from ES04 cells and there was less than a three-fold difference in Rex1 expression between ES03 and ES04 (Table 2).
To generate a list of transcripts common to all libraries (excluding the HSF-6 library because of the differentiation markers found therein), we first identified tags from each library that uniquely mapped to transcripts within RefSeq  and the Mammalian Gene Collection (MGC) . This analysis identified a set of 4,337 LongSAGE tags present in all libraries (Additional data file 5). Comparison of this list to those generated by previous MPSS and SAGE approaches revealed extensive (80%) concordance between the SAGE-based transcriptomes. In contrast, 52% of genes identified by MPSS were not found in either of the SAGE common gene lists. Some of this lack of concordance may be explained by differences in the tagging restriction enzyme used by the two protocols (NlaIII for SAGE and Dpn1 for MPSS) and the fact that different mRNA preparations were used in each study. To further explore this lack of concordance, we compared the longSAGE and MPSS-derived gene lists to a common gene list derived from Affymetrix expression arrays generated from the same RNAs used to construct our LongSAGE libraries . The Affymetrix common gene set contained more than 80% of the LongSAGE common gene list (Additional data file 5) while MPSS contained only 68% of the genes on this list.
Identification of novel ESC-specific transcripts
LongSAGE offers opportunities for discovering novel transcripts. These can be identified as tags that map uniquely to the genome but not to any available transcript resources. To look for these, we used the 2.5 million tag meta-library, which contained 379,645 unique tag sequences. Grouping LongSAGE tags that mapped to genomic locations in close proximity to one another  resulted in the identification of 24,593 transcription units. Of these, 14,588 did not overlap with known genes and were classified as novel. Most tags were expressed at low levels with 46% (6,672) identified by a single LongSAGE tag. Even though singletons are enriched for artifacts, many of these are likely to represent real transcripts, for two reasons: first, they map to the genome; and second, we  and others  have shown previously that at least 70% of novel, singleton, high quality LongSAGE tags identify rare transcripts whose expression can be confirmed in RNA-dependent RT-PCR experiments.
To further characterize these putative novel, low-abundance ESC library specific transcripts, we compared the ESC meta-library to publicly available data derived from 247 non-ESC SAGE libraries that together contained 654,491 unique tag sequences. This comparison identified 20,047 tag sequences found only in the human ESC meta-library (Additional data file 6). For subsequent analyses, we focused on those tags that uniquely mapped at least 2 kb away from any known gene. This analysis reduced the number of tags to 634 (Additional data file 7), of which 301 were found within genomic regions exhibiting sequence conservation between human and mouse or rat (Additional data file 8). We used rapid amplification of cDNA ends (RACE) [47, 48] to clone the 5' ends of 52 of these (Additional data file 9). Alignment of the resulting sequences to the human genome revealed that 22 (40%) were spliced. An open reading frame (ORF) scan of the 52 RACE clone sequences using Bioperl  tools and custom scripts identified 6 transcripts that encoded peptides longer than 100 amino acids in length. However, with the exception of one transcript (HA_003333) that overlapped the 3' end of the MAPK2 gene, none of the identified ORFs demonstrated Ka/Ks ratios suggestive of purifying selection . Hence, these transcripts may not encode proteins but may instead represent non-coding RNAs (ncRNAs).
Four RACE clones were found to have genomic coordinates that overlapped with those of known transcripts (Additional data file 9). One of these (HA_003240; Figure 4) is of particular interest because it contains the entire coding sequence of the Foxb1 gene within its first intron. Foxb1 encodes a winged helix transcription factor involved in the development of the vertebrate central nervous system and Foxb1-/- mice display phenotypes consistent with a requirement for this gene in both embryonic and postnatal stages of development [51–53]. Interestingly, the ESC meta-library did not contain any tags corresponding to known Foxb1 transcripts except for a single Foxb1 tag in the HSF-6 library. This general lack of Foxb1 expression in ESCs and the genomic location of the Foxb1 gene within the first intron of HA_003240 are consistent with the notion that Foxb1 expression is repressed by expression of HA_003240, possibly by steric inhibition of the transcription initiation complex . The HA_003240 sequence overlaps partially with an EST obtained from an undifferentiated human ESC line (CD049816), as well as with ESTs from an embryonic carcinoma line, a kidney carcinoma line and hypothalamus tissue (for example, DA713666, DB173211 and BI458015, respectively). Examination of the promoter region of HA_003240 revealed the presence of highly conserved sequences containing an Oct/Sox binding element, suggesting that HA_003240 expression may be maintained in pluripotent ESCs through the recruitment of an Oct4/Sox2 complex (Figure 4). Oct4 encodes a transcription factor that regulates a number of key human ESC markers, including Nanog, through co-operative binding with a Sox family member . Given the documented role for Foxb1 in controlling the differentiation of neuronal cell types, the genomic organization of the Foxb1 locus is intriguing and suggests an interesting mechanism for negatively regulating Foxb1 expression in Oct4-expressing cells.
Many pseudogenes have been identified in the human genome using homology-based approaches [56–58]. Pseudogenes are generally not transcribed due to their lack of functional promoters [59, 60]. However, there are examples of pseudogenes that have retained or acquired functional promoters, leading to their transcription . Because of the low levels of expression of the 52 novel transcripts (on average, only 3 tags per million) we asked whether the 5' RACE clones were derived from expressed pseudogenes. Comparison of the RACE clone sequences to three computationally generated lists of known human pseudogenes [56–58] revealed only one clone (HA_003350) with a predicted pseudogene contained within its exon. Furthermore, with the exception of HA_003333, none of the novel transcript sequences showed significant sequence similarity to any known ORF (using a 70% ORF threshold ). Taken together, these analyses do not support the notion that the novel genes identified by our analysis are enriched for expressed pseudogenes.
To more fully characterize a transcript identified by a singleton tag (Additional data file 9), we attempted to recover a full length transcript using 5' and 3' RACE and primers annealing within the terminal exon of the putative transcript. Alignment of the resulting candidate full length sequence to the human genome revealed a transcript that contained two introns (Figure 5). Examination of the genomic region surrounding this transcript showed that it resides in a region of the long arm of chromosome 3 (chr3:110,539,351-110,584,565) lacking annotated transcripts. The putative transcriptional start is located 266 bp from the transcriptional start site of Dppa4, a gene known to have an expression pattern in ESCs that is similar to that of Oct4  (Figure 5). To investigate the possibility that this promoter region is regulated directly by Oct4, we looked for the presence of conserved Octamer and Sox (high mobility group (HMG)) elements. A single 20 bp region of cross-species sequence conservation was found that contains a consensus binding element for an Octamer/Sox dimmer, suggesting that the novel gene is regulated by Oct4/Sox2 (Figure 5; chr3: 110,539,180-111,539,200). In support of this finding, the conserved region was found to reside within a probe identified by chromatin immunoprecipitation (ChIP)/CHIP  as a target of Oct4 and Sox2 (Probe spans chr3: 110,539,028-110,539,588). Taken together, these analyses suggest that both Dppa4 and the novel transcript are divergently transcribed from a common promoter bound by an Oct4/Sox complex. Based on its proximity to the Dppa4 gene we have named this novel transcript Spd4 (for 'shares promoter with Dppa4').
Comparison of the 5' RACE clone sequences to publicly available ESTs revealed 36 (69%) with matches to other ESTs, of which 7 were found only in data derived from pluripotent human ESC lines. One RACE clone that overlapped an EST derived from pluripotent human ESC lines (HA_003152) was also found to be expressed in all nine ESC lines studied here. BLAT  alignment of the 5' RACE clone sequence to the human reference genome sequence revealed that HA_003152 contained two introns and resided within a genomic region that exhibited sequence similarity to long interspersed nuclear elements. An ORF scan revealed a 129 amino acid peptide encoded in the second exon with homology to the carboxyl terminus of the LINE p40 ORF.
To explore the expression pattern of the HA_003152 transcript we used quantitative RT-PCR (qPCR) to compare transcript levels in RNA purified from human ESCs maintained under conditions that promote their maintenance in an undifferentiated state to RNA extracts obtained from human ESCs that had been stimulated to differentiate into embryoid bodies. To provide a comparative dataset we selected five additional novel transcripts for qPCR. In all cases, qPCR amplicons were designed to cross exon-exon boundaries. As controls we also monitored expression of Oct4, Lin28 and Msx1 in the same RNA preparations. Figure 6 shows the expected expression pattern for the control gene set, with a reduction in expression of Oct4 and Lin28 in the human ESCs stimulated to differentiate into embryoid bodies and an up-regulation of expression of the early differentiation marker Msx1. Significant reduction of expression was observed in four of the six transcripts tested, including HA_003152, whose expression was undetectable at d30 (Figure 6). These transcripts are hence potential markers of pluripotency.
As part of the ongoing effort to elucidate mechanisms regulating ESC self-renewal, we generated 2.5 million LongSAGE tags from nine human ESC lines. Comparison of these data to libraries prepared from differentiated tissues identified a group of ESC-library specific transcripts and an enrichment of transcripts encoding mitochondrial and RNA binding proteins (by comparison to differentiated cells). RNA binding proteins play a role in the regulation of mRNA processing and examination of non-canonical longSAGE tags in the human ESC libraries suggest that these cells express a distinct collection of gene isoforms. One such isoform may bypass translational down regulation through the expression of a transcript lacking predicted miRNA target sequences.
An emerging theme in digital gene expression profiling is the identification of a large class of transcripts that map uniquely to the genome, but cannot be localized to any known or computationally predicted transcripts. Tags in this class are predominantly found at relatively low levels. Analysis of the 2.5 million LongSAGE tags generated in the course of this study revealed 14,588 such tag sequences, a subset of which were found exclusively in human ESCs. As a first step towards understanding the relevance of these transcripts to ESC biology we generated 5' RACE clones for 52 novel apparently ESC-specific transcripts. Analyses of these transcripts revealed that the majority do not appear to encode proteins and do not overlap existing pseudogene predictions. One transcript was found to be expressed across all nine ESC lines we profiled and matched ESTs generated by others from ESCs. Its restricted expression pattern suggests that it may represent a novel transcriptional marker for the maintenance of pluripotentiality. In addition to the discovery of this potential marker, we also identified four novel transcripts that may participate in the regulation of expression of known genes, one of which is known to play a direct role in differentiation. Our analyses indicate that there are many previously undiscovered transcripts expressed in human ESCs and support the contention that sampling of SAGE libraries to depths beyond currently accepted practice is required to fully explore the coding potential of the mammalian transcriptome. To assess possible functions associated with such rare transcripts, we are actively pursuing the cloning and characterization of the remaining novel human ESC-specific transcripts identified in this study.
Materials and methods
Cell culture and RNA isolation
Detailed information regarding the human ESC lines used in this study can be found at the NIH Stem Cell Information website . The passage numbers of the cells analyzed in this study are presented in Table 1. Total RNA was prepared using Trizol reagent (Invitrogen, Burlington, ON, USA) following the manufacturer's protocol and was assayed for quality and quantified using an Agilent 2100 Bioanalyzer (Agilent Technologies) and RNA 6000 Nano LabChip kit (Caliper Technologies, Hopkinton, MA, USA).
LongSAGE library construction
Nine LongSAGE  libraries were constructed from 5-20 μg of DNase I-treated total RNA as described  (DNase I from Invitrogen). LongSAGE data generated for this study are available through our embryonic stem cell transcriptomes website  and through the CGAP web portal .
Novel transcript identification
LongSAGE tags of at least 99.9% accuracy (calculated using Phred [66, 67] quality scores) from the meta-library were compared to 247 publicly available human SAGE libraries (GEO , Discovery db ). To allow direct comparison of the LongSAGE data to the 14 bp SAGE tags available in the public libraries, the 3' ends of the 21 bp tags were truncated in silico to form 14 bp tags. A total of 2,508,608 tags corresponding to 222,337 unique 14 bp tag sequences (379,465; 21 bp parental sequences) were utilized in this analysis. These tags were directly compared to all unique tags from the human SAGE libraries to generate a list of tags found solely in the ESC meta-library.
Tag-to-gene mapping was performed using the comprehensive mapping of SAGE tags (CMOST) software  as follows. Tags were mapped to various publicly available transcript databases in a hierarchical fashion with the highest quality transcript databases used first. As tags were mapped to a known transcript in a higher quality database, they were excluded from further analysis with subsequent lower quality databases to mitigate redundancies arising from lower quality DNA sequence resources. The following databases were used for CMOST tag-to-gene mapping in this order: MGC , RefSeq , Ensembl transcripts  (exon sequences only), Genbank Human Mitochondrial Sequence (accession AY289102.1), Genbank Non-coding sequences , Ensembl genes  (1,000 bp UTR and intron sequences included), Ensembl ESTs , and Golden path genomic contigs (Genbank Human Genome Assembly Contigs build 34, January 2004 ). In addition to allowing perfect matches, the CMOST approach attempts to account for single base permutations, insertions and deletions, improving the rate of tag-to-gene mapping.
SAGE tag-to-gene mapping
LongSAGE tags were mapped to known and computationally predicted transcripts using versions of the following databases available as of March, 2005: RefSeq , RefSeqX , Mammalian Gene Collection , and RefSeqGS . Tags were also mapped to human genomic sequence using the NCBI Reference Sequence Genome database , release 35, August 2004. From the genome sequence, a table was generated containing all 27.4 million potential SAGE tags adjacent to genomic NlaIII restriction sites (CATG). Of these, our analysis defined a subset of 19.4 million genomic tag sequences that were unique within the genome.
A second table was generated that stored information about exons: genome sequence contig, transcript orientation, exon number, exon boundary type and nucleotide positions of exon boundaries for all approximately 267,000 exons annotated on release 35 of the Reference Sequence genome. The LongSAGE tag sequences were compared to the unique genomic tag table, yielding sets of genomic positions for all tags in the library. These in turn were compared to the table of exon information, producing a mapping for each tag relative to annotated exons.
For the GO category comparisons, a standard t-test comparing two samples was used. The null hypothesis was that the two samples arose from populations with the same mean and standard deviation. The values within each sample were the number of GO categories represented in each library of the set, nine in the ESC set and four in the normal set. To account for variation due to library size, only the transcripts with the top 1,000 expression values were included. A one-sided p value was reported. Microsoft Excel was used to perform the computation.
To select differentially expressed LongSAGE tags, the ESC and CGN meta-libraries were compared on a tag per tag basis to obtain a p value for the null hypothesis that the two tag frequencies arose from Poisson distributions with the same mean. This was derived using a normal approximation to the Poisson as described by Kal et al. . All transcripts that showed differences with a significance of p < 0.05 were selected. Tag counts were converted to tags per million, and transcripts that differed by less than three-fold were eliminated. All pairs of tags existing within the same transcript were then listed if the differential expression for the two tags was in the opposite direction.
First strand 5' and 3' RACE ready cDNA was synthesized from 2.0 μg of DNase I (DNA-free™ kit; Ambion, Austin, TX, USA) treated RNA using the BD SMART RACE cDNA Amplification kit following the manufacturer's recommended protocol (BD Biosciences Clontech, Mountain View, CA, USA). Gene specific 5' RACE primers were designed using custom scripts and Primer 3  to lie downstream of the target LongSAGE tag with an optimal Tm of 68°C (Additional data file 10). For 3' RACE reactions a series of primers were designed manually based on the 5' RACE clone sequence (Additional data file 10). The cDNA was amplified using the Phusion™ High-Fidelity PCR Kit (MJ Research, Inc., Waltham, MA, USA) following the manufacturer's recommended protocol with the addition of DMSO to a final concentration of 3%. The cycling conditions consisted of an initial denaturation at 98°C for 30 seconds followed by 10 touchdown PCR cycles starting with 98°C for 10 seconds, 72°C (decreased by 1°C in each subsequent cycle) for 15 seconds, 72°C for 30 seconds; then 29 cycles of 98°C for 10 seconds, 62°C for 15 seconds, 72°C for 30 seconds; followed by an extension at 72°C for 10 minutes. PCR product for each sample (10 μl) was loaded on a 1.2% agarose gel and subjected to electrophoresis for 3.5 hours at 110 mA in 1× TBE buffer (Tris/Boric Acid/EDTA). The gel was stained with SYBR Green (Mandel, Guelph, ON, Canada) and visualized using a Typhoon 9400 Variable Mode Imager (Amersham, Baie d'Urfe, PQ, Canada). Amplicons were extracted from the gel, purified and cloned into the pCR4®-TOPO® vector using the TOPO TA Cloning® Kit for Sequencing (Invitrogen). Plasmid vectors were electroporated into bacterial cells, and recombinant clones were selected on agar plates containing appropriate antibiotics as described . Glycerol stocks were prepared from 12 individual clone isolates per amplicon and stored in 384-well plates. Clone inserts were sequenced on an ABI PRISM 3730 XL DNA Analyzer using BigDye primer cycle sequencing reagents (Applied Biosystems, Foster City, CA, USA).
RNA was obtained from H9 cells before and after induction of differentiation using a 30-day embryoid body protocol. Undifferentiated H9 cells maintained for 7 days on matrigel (BD Biosciences, San Jose, CA, USA) in media conditioned by mouse embryonic fibroblasts and supplemented with 4 ng/ml fibroblast growth factor (bFGF-2) were harvested for embryoid body formation. Briefly, the cells were incubated with TrypLE (Invitrogen) for 10 minutes at 37°C and then collected by scraping. Resultant cell aggregates were subsequently cultured in non-adherent dishes using KOSR-based media without FGF2, for 15 to 30 days. At appropriate time-points RNA was extracted into Trizol. cDNA was synthesized from 2.0 ug of DNase I (DNA-free™ kit, Ambion) treated total RNA using the SuperScript Choice System following the manufacturer's recommended protocol (Invitrogen). Gene specific primer pairs were designed using custom scripts and Primer 3  to amplify approximately 150 bp of the target gene with an optimal Tm of 68°C (Additional data file 10). Whenever possible amplicons were designed to cross exon/intron boundaries. Amplification was performed in a 10 μl reaction mixture containing 5 μl of 2× SYBR Green PCR Master Mix (Applied Biosystems), 2 μl of template cDNA, and 250 pmol of the forward and reverse primer pair. After preparation of the reaction mixtures in 96-well plates, the plates were centrifuged at 800 rpm for 1 minute in an Eppendorf 5810 swing rotor centrifuge (Eppendorf, Westbury, NY, USA). Amplification and detection were performed on an ABI Prism 7600 Sequence Detection System (Applied Biosystems). The PCR protocol consisted of the following: a single cycle of 10 minute at 95°C and 40 two-step cycles, with one cycle consisting of 15 seconds at 95°C and 60 seconds at 60°C. Results were analyzed as described  using a GAPDH probe for normalization.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a summary of mouse specific tag types identified. Additional data file 2 is a table of genomic mappings for 268,515 unique tag sequences found in nine independent human embryonic stem cell lines. Additional data file 3 is a Gene Ontology analysis of nine independent human embryonic stem cells. Tag counts are expressed for each GO category for the top 1,000 by tag count. Additional data file 4 lists statistically significant differentially expressed LongSAGE tags found between embryonic stem cells and terminally differentiated tissues. Additional data file 5 is a table listing the 4,337 genes found in common across 8 undifferentiated human embryonic stem cell lines. Additional data file 6 is a table listing the 20,047 LongSAGE tags exclusively expressed in embryonic stem cell lines. Additional data file 7 is a table listing the 634 LongSAGE tags exclusively expressed in ESCs that uniquely map to the human genome at least 2 kb away from an annotated transcript. Additional data file 8 is a table listing the 301 LongSAGE tags exclusively expressed in ESCs that uniquely map to species conserved regions of the human genome at least 2 kb away from an annotated transcript. Additional data file 9 is a table listing the 52 ESC specific transcripts identified by 5' RACE. Additional data file 10 lists the RACE and qPCR primer sequences used in this study.
Evans MJ, Kaufman MH: Establishment in culture of pluripotential cells from mouse embryos. Nature. 1981, 292: 154-156. 10.1038/292154a0.
Thomson JA, Itskovitz-Eldor J, Shapiro SS, Waknitz MA, Swiergiel JJ, Marshall VS, Jones JM: Embryonic stem cell lines derived from human blastocysts. Science. 1998, 282: 1145-1147. 10.1126/science.282.5391.1145.
Scholer HR, Balling R, Hatzopoulos AK, Suzuki N, Gruss P: Octamer binding proteins confer transcriptional activity in early mouse embryogenesis. EMBO J. 1989, 8: 2551-2557.
Mitsui K, Tokuzawa Y, Itoh H, Segawa K, Murakami M, Takahashi K, Maruyama M, Maeda M, Yamanaka S: The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell. 2003, 113: 631-642. 10.1016/S0092-8674(03)00393-3.
Chambers I, Colby D, Robertson M, Nichols J, Lee S, Tweedie S, Smith A: Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell. 2003, 113: 643-655. 10.1016/S0092-8674(03)00392-1.
Avilion AA, Nicolis SK, Pevny LH, Perez L, Vivian N, Lovell-Badge R: Multipotent cell lineages in early mouse development depend on SOX2 function. Genes Dev. 2003, 17: 126-140. 10.1101/gad.224503.
Sutton J, Costa R, Klug M, Field L, Xu D, Largaespada DA, Fletcher CF, Jenkins NA, Copeland NG, Klemsz M, et al: Genesis, a winged helix transcriptional repressor with expression restricted to embryonic stem cells. J Biol Chem. 1996, 271: 23126-23133. 10.1074/jbc.271.38.23126.
Wilder PJ, Kelly D, Brigman K, Peterson CL, Nowling T, Gao QS, McComb RD, Capecchi MR, Rizzino A: Inactivation of the FGF-4 gene in embryonic stem cells alters the growth and/or the survival of their early differentiated progeny. Dev Biol. 1997, 192: 614-629. 10.1006/dbio.1997.8777.
Yuan H, Corbi N, Basilico C, Dailey L: Developmental-specific activity of the FGF-4 enhancer requires the synergistic action of Sox2 and Oct-3. Genes Dev. 1995, 9: 2635-2645. 10.1101/gad.9.21.2635.
Xu RH, Chen X, Li DS, Li R, Addicks GC, Glennon C, Zwaka TP, Thomson JA: BMP4 initiates human embryonic stem cell differentiation to trophoblast. Nat Biotech. 2002, 20: 1261-1264. 10.1038/nbt761.
Thomson JA, Odorico JS: Human embryonic stem cell and embryonic germ cell lines. Trends Biotechnol. 2000, 18: 53-57. 10.1016/S0167-7799(99)01410-9.
Sato N, Meijer L, Skaltsounis L, Greengard P, Brivanlou AH: Maintenance of pluripotency in human and mouse embryonic stem cells through activation of Wnt signaling by a pharmacological GSK-3-specific inhibitor. Nat Med. 2004, 10: 55-63. 10.1038/nm979.
Sato N, Sanjuan IM, Heke M, Uchida M, Naef F, Brivanlou AH: Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol. 2003, 260: 404-413. 10.1016/S0012-1606(03)00256-2.
Rao M: Conserved and divergent paths that regulate self-renewal in mouse and human embryonic stem cells. Dev Biol. 2004, 275: 269-286. 10.1016/j.ydbio.2004.08.013.
Richards M, Tan SP, Tan JH, Chan WK, Bongso A: The transcriptome profile of human embryonic stem cells as defined by SAGE. Stem Cells. 2004, 22: 51-64. 10.1634/stemcells.22-1-51.
Brimble SN, Zeng X, Weiler DA, Luo Y, Liu Y, Lyons IG, Freed WJ, Robins AJ, Rao MS, Schulz TC: Karyotypic stability, genotyping, differentiation, feeder-free maintenance, and gene expression sampling in three human embryonic stem cell lines derived prior to August 9, 2001. Stem Cells Dev. 2004, 13: 585-597. 10.1089/scd.2004.13.585.
Brandenberger R, Wei H, Zhang S, Lei S, Murage J, Fisk GJ, Li Y, Xu C, Fang R, Guegler K, et al: Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol. 2004, 22: 707-716. 10.1038/nbt971.
Brandenberger R, Khrebtukova I, Thies RS, Miura T, Jingli C, Puri R, Vasicek T, Lebkowski J, Rao M: MPSS profiling of human embryonic stem cells. BMC Dev Biol. 2004, 4: 10-10.1186/1471-213X-4-10.
Sperger JM, Chen X, Draper JS, Antosiewicz JE, Chon CH, Jones SB, Brooks JD, Andrews PW, Brown PO, Thomson JA: Gene expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc Natl Acad Sci USA. 2003, 100: 13350-13355. 10.1073/pnas.2235735100.
Ginis I, Luo Y, Miura T, Thies S, Brandenberger R, Gerecht-Nir S, Amit M, Hoke A, Carpenter MK, Itskovitz-Eldor J, et al: Differences between human and mouse embryonic stem cells. Dev Biol. 2004, 269: 360-380. 10.1016/j.ydbio.2003.12.034.
Bhattacharya B, Miura T, Brandenberger R, Mejido J, Luo Y, Yang AX, Joshi BH, Ginis I, Thies RS, Amit M, et al: Gene expression in human embryonic stem cell lines: unique molecular signature. Blood. 2004, 103: 2956-2964. 10.1182/blood-2003-09-3314.
Abeyta MJ, Clark AT, Rodriguez RT, Bodnar MS, Pera RA, Firpo MT: Unique gene expression signatures of independently-derived human embryonic stem cell lines. Human Mol Genet. 2004, 13: 601-608. 10.1093/hmg/ddh068.
Mah N, Thelin A, Lu T, Nikolaus S, Kuhbacher T, Gurbuz Y, Eickhoff H, Kloppel G, Lehrach H, Mellgard B, et al: A comparison of oligonucleotide and cDNA-based microarray systems. Physiol Genomics. 2004, 16: 361-370. 10.1152/physiolgenomics.00080.2003.
Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002, 18: 405-412. 10.1093/bioinformatics/18.3.405.
Jenssen TK, Langaas M, Kuo WP, Smith-Sorensen B, Myklebost O, Hovig E: Analysis of repeatability in spotted cDNA microarrays. Nucleic Acids Res. 2002, 30: 3235-3244. 10.1093/nar/gkf441.
Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA: "Stemness": transcriptional profiling of embryonic and adult stem cells. Science. 2002, 298: 597-600. 10.1126/science.1072530.
Ivanova NB, Dimos JT, Schaniel C, Hackney JA, Moore KA, Lemischka IR: A stem cell molecular signature. Science. 2002, 298: 601-604. 10.1126/science.1073823.
Fortunel NO, Otu HH, Ng HH, Chen J, Mu X, Chevassut T, Li X, Joseph M, Bailey C, Hatzfeld JA, et al: Comment on "'Stemness': transcriptional profiling of embryonic and adult stem cells" and "a stem cell molecular signature". Science. 2003, 302: 393-10.1126/science.1086384.
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE: Using the transcriptome to annotate the genome. Nat Biotechnol. 2002, 20: 508-512. 10.1038/nbt0502-508.
Siddiqui AS, Khattra J, Delaney AD, Zhao Y, Astell C, Asano J, Babakaiff R, Barber S, Beland J, Bohacec S, et al: A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proc Natl Acad Sci USA. 2005, 102: 18485-18490. 10.1073/pnas.0509455102.
Pearson K: Mathematical contributions to the theory of evolution III. Regression, heredity and panmixia. Phil Trans R Soc Lond Series A. 1896, 187: 253-318. 10.1098/rsta.1896.0007.
The Cancer Genome Anatomy Project. [http://cgap.nci.nih.gov]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS: Human MicroRNA Targets. PLOS Biol. 2004, 2: e363-10.1371/journal.pbio.0020363.
Houbaviy HB, Murray MF, Sharp PA: Embryonic stem cell-specific MicroRNAs. Dev Cell. 2003, 5: 351-358. 10.1016/S1534-5807(03)00227-2.
Dravid G, Ye Z, Hammond H, Chen G, Pyle A, Donovan P, Yu X, Cheng L: Defining the role of Wnt/B-catenin signaling in the survival, proliferation and self-renewal of human embryonic stem cells. Stem Cells Express. 2005, 23: 1489-1501. 10.1634/stemcells.2005-0034.
Nichols J, Zevnik B, Anastassiadis K, Niwa H, Klewe-Nebenius D, Chambers I, Scholer H, Smith A: Formation of pluripotent stem cells in the mammalian embryo depends on the POU transcription factor Oct4. Cell. 1998, 95: 379-391. 10.1016/S0092-8674(00)81769-9.
Baldassarre G, Romano A, Armenante F, Rambaldi M, Paoletti I, Sandomenico C, Pepe S, Staibano S, Salvatore G, De Rosa G, et al: Expression of teratocarcinoma-derived growth factor-1 (TDGF-1) in testis germ cell tumors and its effects on growth and differentiation of embryonal carcinoma cell line NTERA2/D1. Oncogene. 1997, 15: 927-936. 10.1038/sj.onc.1201260.
Henderson JK, Draper JS, Baillie HS, Fishel S, Thomson JA, Moore H, Andrews PW: Preimplantation human embryos and embryonic stem cells show comparable expression of stage-specific embryonic antigens. Stem Cells. 2002, 20: 329-337. 10.1634/stemcells.20-4-329.
Wong RC, Pebay A, Nguyen LT, Koh KL, Pera MF: Presence of functional gap junctions in human embryonic stem cells. Stem Cells. 2004, 22: 883-889. 10.1634/stemcells.22-6-883.
Rao RR, Stice SL: Gene expression profiling of embryonic stem cells leads to greater understanding of pluripotency and early developmental events. Biol Reprod. 2004, 71: 1772-1778. 10.1095/biolreprod.104.030395.
Besser D: Expression of nodal, lefty-a, and lefty-B in undifferentiated human embryonic stem cells requires activation of Smad2/3. J Biol Chem. 2004, 279: 45076-45084. 10.1074/jbc.M404979200.
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-140. 10.1093/nar/29.1.137.
Strausberg RL, Feingold EA, Klausner RD, Collins FS: The mammalian gene collection. Science. 1999, 286: 455-457. 10.1126/science.286.5439.455.
Embryonic Stem Cell Transcriptomes. [http://www.transcriptomes.org]
Chen J, Sun M, Lee S, Zhou G, Rowley JD, Wang SM: Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc Natl Acad Sci USA. 2002, 99: 12257-12262. 10.1073/pnas.192436499.
Frohman MA, Dush MK, Martin GR: Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci USA. 1988, 85: 8998-9002. 10.1073/pnas.85.23.8998.
Chenchik A, Diachenko L, Moqadam F, Tarabykin V, Lukyanov S, Siebert PD: Full-length cDNA cloning and determination of mRNA 5' and 3' ends by amplification of adaptor-ligated cDNA. Biotechniques. 1996, 21: 526-534.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12: 1611-1618. 10.1101/gr.361602.
Yang Z, Nielsen R, Goldman N, Pedersen AM: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000, 155: 431-449.
Alvarez-Bolado G, Zhou X, Cecconi F, Gruss P: Expression of Foxb1 reveals two strategies for the formation of nuclei in the developing ventral diencephalon. Dev Neurosci. 2000, 22: 197-206. 10.1159/000017442.
Alvarez-Bolado G, Zhou X, Voss AK, Thomas T, Gruss P: Winged helix transcription factor Foxb1 is essential for access of mammillothalamic axons to the thalamus. Development. 2000, 127: 1029-1038.
Labosky PA, Winnier GE, Jetton TL, Hargett L, Ryan AK, Rosenfeld MG, Parlow AF, Hogan BL: The winged helix gene, Mf3, is required for normal development of the diencephalon and midbrain, postnatal growth and the milk-ejection reflex. Development. 1997, 124: 1263-1274.
Uptain SM, Kane CM, Chamberlin MJ: Basic mechanisms of transcript elongation and its regulation. Annu Rev Biochem. 1997, 66: 117-172. 10.1146/annurev.biochem.66.1.117.
Kuroda T, Tada M, Kubota H, Kimura H, Hatano SY, Suemori H, Nakatsuji N, Tada T: Octamer and Sox elements are required for transcriptional cis regulation of Nanog gene expression. Mol Cell Biol. 2005, 25: 2475-2485. 10.1128/MCB.25.6.2475-2485.2005.
Ohshima K, Hattori M, Yada T, Gojobori T, Sakaki Y, Okada N: Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol. 2003, 4: R74-10.1186/gb-2003-4-11-r74.
Torrents D, Suyama M, Zdobnov E, Bork P: A genome-wide survey of human pseudogenes. Genome Res. 2003, 13: 2559-2567. 10.1101/gr.1455503.
Zhang Z, Harrison PM, Liu Y, Gerstein M: Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003, 13: 2541-2558. 10.1101/gr.1429003.
Balakirev ES, Ayala FJ: Pseudogenes: are they "junk" or functional DNA?. Annu Rev Genet. 2003, 37: 123-151. 10.1146/annurev.genet.37.040103.103949.
Mighell AJ, Smith NR, Robinson PA, Markham AF: Vertebrate pseudogenes. FEBS Lett. 2000, 468: 109-114. 10.1016/S0014-5793(00)01199-6.
Harrison PM, Zheng D, Zhang Z, Carriero N, Gerstein M: Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic Acids Res. 2005, 33: 2374-2383. 10.1093/nar/gki531.
Bortvin A, Eggan K, Skaletsky H, Akutsu H, Berry DL, Yanagimachi R, Page DC, Jaenisch R: Incomplete reactivation of Oct4-related genes in mouse embryos cloned from somatic nuclei. Development. 2003, 130: 1673-1680. 10.1242/dev.00366.
Boyer L, Lee TI, Cole MF, Johnstone SE, Zucker JP, Young RA: Core transcriptional regulatory circuitry in human embyronic stem cells. Cell. 2005, 122: 947-956. 10.1016/j.cell.2005.08.020.
Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.
Stem Cell Information. [http://stemcells.nih.gov]
Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.
Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.
Gene Expression Omnibus. [http://www.ncbi.nlm.nih.gov/geo]
Discovery Space. [http://www.bcgsc.ca/bioinfo/software/discoveryspace/]
The Mammalian Gene Collection. [http://mgc.nci.nih.gov]
NCBI Reference Sequence. [http://www.ncbi.nlm.nih.gov/RefSeq]
Ensembl Genome Browser. [http://www.ensembl.org]
Kal AJ, van Zonneveld AJ, Benes V, van den Berg M, Koerkamp MG, Albermann K, Strack N, Ruijter JM, Richter A, Dujon B, et al: Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Mol Biol of the Cell. 1999, 10: 1859-1872.
Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.
Baross A, Butterfield YS, Coughlin SM, Zeng T, Griffith M, Griffith OL, Petrescu AS, Smailus DE, Khattra J, McDonald HL, et al: Systematic recovery and analysis of full-ORF human cDNA clones. Genome Res. 2004, 14: 2083-2092. 10.1101/gr.2473704.
Muller PY, Janovjak H, Miserez AR, Dobbie Z: Processing of gene expression data generated by quantitative real-time RT-PCR. Biotechniques. 2002, 32: 1372-1374.
We are grateful to MF Pera (Monash Institute of Medical Research, Monash University and the Australian Stem Cell Center, Clayton, Victoria, Australia), MT Firpo (Department of Obstetrics, Gynecology and Reproductive Sciences, University of California San Francisco, San Francisco, CA) and BresaGen Inc. (Athens, GA), for providing human ESC RNA samples. This project was supported by funds from the National Cancer Institute, National Institutes of Health, under Contract No. N01-C0-12400 and by grants from Genome Canada, Genome British Columbia and the Canadian Stem Cell Network to MAM and CE. MAM is a Scholar of the Michael Smith Foundation for Health Research and is a Terry Fox Young Investigator of the National Cancer Institute of Canada. The content of this publication does not necessarily reflect the views or policies of the US Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the US Government.
Electronic supplementary material
Authors’ original submitted files for images
About this article
Cite this article
Hirst, M., Delaney, A., Rogers, S.A. et al. LongSAGE profiling of nine human embryonic stem cell lines. Genome Biol 8, R113 (2007) doi:10.1186/gb-2007-8-6-r113
- Embryonic Stem Cell
- Additional Data File
- Massively Parallel Signature Sequencing
- Human ESCs
- Human Embryonic Stem Cell Line