GeneChip analysis of human embryonic stem cell differentiation into hemangioblasts: an in silico dissection of mixed phenotypes

Transcriptional profiling of human embryonic stem cells differentiating into blast cells reveals that erythroblasts are the predominant cell type in the blast cell population. In silico comparisons with publicly available data sets revealed the presence of endothelia, cardiomyocytes and hematopoietic lineages.


Background
The establishment of human embryonic stem cells (hESCs) raised the possibility of being able to treat/cure many human diseases that are nowadays untreatable. This therapeutic potential, however, largely relies on the efficient and controlled differentiation of hESCs towards a specific cell type and the generation of homogeneous cell populations. Many differentiation protocols utilize the formation of progenitors through a stepwise approach. Thus, characterizing and understanding mixed populations of progenitor stages will be of increasing importance in stem cell research.
hESCs have been shown to be able to differentiate into a variety of cell types, including hematopoietic precursors and endothelial cells, in vitro under various culture conditions [1][2][3][4][5][6][7][8][9]. Hemangioblasts are the precursors of both hematopoietic and endothelial cells [10]. The existence of hemangioblasts was first demonstrated using an in vitro differentiation system of mouse ESCs. Replating of embryonic bodies (EBs) of mouse ESCs resulted in the formation of blast colony forming cells (BL-CFCs), which possessed hemangioblastic characteristics: BL-CFCs generated both hematopoietic and endothelial cells upon transfer to appropriate conditions [11,12]. Cells with hemangioblastic characteristics have been reported in both mouse and human adult tissues [13][14][15][16][17][18]. In an hESC system, Wang et al. [3] found that a fraction of a percent (0.18%) of CD45 neg FVP cells with hemangioblast-like properties in hESCs derived from EBs. Zambidis et al. [8] demonstrated the formation of multi-potential colonies from hEBs, although it is unclear whether these colonies can be expanded and/or whether they have any functional activity in vivo. Umeda et al. [19] also identified the presence of CD34+/ KDR+ bipotential cells in non-human (Cynomolgus) ESCs. Kennedy et al. [20] recently reported the generation of BL-CFCs from hESCs. However, the rarity of the cells with hemangioblast properties both from adult tissues and from ESC systems precluded comprehensive analysis of gene expression and comparison with other populations.
We have recently developed a two-step strategy that can efficiently and reproducibly generate blast colonies (BCs), the human counterparts of BL-CFCs, from hESCs [21]. These BC cells expressed gene signatures characteristic of hemangioblasts, and could be differentiated into multiple hematopoietic cell lineages as well as endothelial cells. When the BC cells were injected into animals with spontaneous type II diabetes or ischemia/reperfusion injury of the retina, they homed to the site of injury and showed robust reparative function of the damaged vasculature. The cells also showed a similar regenerative capacity in NOD/SCID β2-/mouse models of both myocardial infarction (50% reduction in mortality rate) and hind limb ischemia, with restoration of blood flow in the latter model to near normal levels, demonstrating the functional properties of hemangioblasts in vivo [21]. In contrast to previous studies, these cells could be readily obtained in large scale, which allowed us to perform comprehensive analysis of gene expression in these cells and compare this with other cell populations from which the BC cells originated.
Microarrays assess the total amount of RNA in a population and can be influenced by a predominating cell type. Variation in the homogeneity of the population can influence the number of genes identified as differentially expressed. Here, we show how comparisons to publicly available tissues in silico can identify differentially expressed genes representative of the various cell types within a heterogeneous population.
In the current study, we analyzed the global gene expression profiles with robust multi-chip average (RMA) normalization to provide a relative value of gene expression between two samples. The first analysis consisted of direct comparisons with ESCs and their derivates (EBs and BCs). Genes enriched in BCs relative to hESCs revealed a genetic signature indicative of erythroblasts, suggesting that erythroblasts are the predominant cell type in the BC population. The next analysis consisted of multiple but biologically meaningful in silico comparisons to publicly available data sets that identified other progenitor cell types within the BC population. The significance of this microarray study is in its ability to assess and identify heterogeneous cellular populations through biologically relevant in silico comparisons.

Strategy
Microarrays assess the total amount of RNA in a population and can, therefore, be influenced by a predominating cell type. Variations in the homogeneity of the population can influence the number of genes identified as differentially expressed, especially if both populations are relatively homogeneous. Here, we show how comparisons to publicly available tissues in silico can identify differentially expressed genes representative of the various cell types within the heterogeneous population of BCs.
We describe our method of assessing heterogeneous samples in three levels of analysis. The first level consists of making direct comparisons within the ESCs and their differentiated derivatives (EBs and BCs). The advantage of this technique is that it provides a kinetic-like relationship of changes in gene expression upon differentiation. The second level of analysis consists of indirect comparisons to a baseline, or reference tissue. Breast epithelium was chosen as a reference tissue because it represents a genetically distinct cell type that BCs are not capable of differentiating into. ESCs, EBs, and BCs were compared to breast epithelial tissue and differentially expressed genes were compared and contrasted to each other. Because genes that are up-regulated in BCs when compared to breast epithelia could represent those that are underexpressed in breast tissue, we removed those that were also up-regulated in a genotypically similar but different cell type (hESCs), when compared to breast epithelia.
The third level of analysis consists of comparing BCs to tissues they are capable of differentiating into as a way to mask that cell type's 'genetic signature' and reveal signatures of the more minor cell types. Samples were chosen based on type of GeneChip and their public availability -leukocytes, and endothelial and stromal cells. These biologically relevant comparisons identified tissue specific genetic signatures that would have otherwise been missed in the level I and II analyses.
The reliability of the microarray data generated from our multi-comparison analysis is demonstrated by the consistent identification of a set of genes among multiple comparisons, of which a subset of genes were confirmed by immunocytochemistry (Table 1) and RT-PCR ( Figure 1). To summarize, comparing BCs to leukocytes identified genes involved in vasculogenesis, to endothelial cells identified genes involved in hematopoiesis, and to stromal cells identified genes involved in heart development.

Level I analysis Genes down-regulated upon differentiation of ESCs into EBs and BCs
We began our data analysis by verifying the expression of 'stemness' genes that are down-regulated in ESCs upon differentiation into EBs and BCs. We identified 87 genes that were down-regulated upon differentiation into EBs. Genes with the highest fold change include SOX2, LEFTY1, GAL, NODAL, OCT4, and THY1, which play a critical role in maintaining the undifferentiated status of ESCs [22][23][24]. To uncover enriched processes, data sets were analyzed by DAVID, a web-based tool that identifies over-represented biological themes in a data set based on their Gene Ontology (GO) terms. GO provides consistent descriptions of genes in terms of biological processes and molecular function. When these genes were clustered with DAVID based on their GO terms, processes involved in development, cell differentiation, and proliferation were identified. The genes identified in the development ontology were DNA methyltransferase 3B, FGF2, THY1, SFRP2, LEFTY1, GREM1, and NODAL (Additional data file 3).
We also identified 267 genes that were down-regulated upon differentiation of ESCs to BCs. These genes include GAL, TDGF, NANOG, LEFTY1, and OCT4, most of which are stemness genes [22][23][24]. When genes were clustered with DAVID using their GO terms, the processes included development, cell differentiation, and morphogenesis (Additional data file 4). These data demonstrate that OCT4, NODAL, GAL, and THY1 are initially down-regulated in stage 1 (ESCs→EBs) and are further down-regulated in stage 2 (EBs→BCs).

Genes up-regulated upon differentiation of ESCs into EBs
While the focus of this paper is to evaluate the pathways involved in hemangioblast differentiation, we begin by identifying those genes that were up-regulated in the early stage of differentiation into EBs (day 3.5) from which BCs were derived [21]. We identified 128 genes that were up-regulated upon differentiation of ESCs into EBs (Additional data file 5). These genes include HAND1, WNT5, HEY1, LMO2, BMP4, TBX3, and MYL4. Clustering these genes with EASE identified processes involved in development, transcription, organ development and system development, some of which are related to hemangioblastic differentiation. These genes include SOX9, HOXB2, HOXB3, Neuregulin 1, LMO2 [25] and GATA2 [26]. This data set also included numerous genes encoding transcription factors, such as MESP1, HAND1, TBX3, GATA2, SOX7, SOX9, HOXB2, and HOXB3.

Genes down-regulated upon differentiation of EBs into BCs
When EBs were compared to BCs, 185 genes were identified as down-regulated upon differentiation. This data set contained processes that were similar to those that were downregulated upon ESC differentiation into EBs, such as tissue and organ development. The most significantly down-regulated genes included NANOG, WNT5, OCT4, GAL, TDGF1, BMP4, endothelin receptor B, and VEFG (Additional data file  ). These data demonstrate that BMP4, WNT5, and HEY1 are initially up-regulated upon differentiation into EBs but then down-regulated upon further differentiation into BCs.

Genes up-regulated upon differentiation of EBs into BCs
In contrast, 82 genes were up-regulated upon differentiation of EBs into BCs. The genes with the greatest fold change were hemoglobin genes and erythropoietic genes, such as hemoglobins γ, ζ, α, and ε, Alas2, AFP, TUBB1, GYPA, and RHAG (fold change, (FC) 31x-886x). Genes with moderate increases in expression (FC 6.2x-7.2x) were KLF1, TAL1/SCL, GATA1 and CD71. When genes were clustered with DAVID using their GO terms, processes characteristic of erythropoiesis (heme and porphyrin biosynthesis and oxygen transport) were identified (Additional data file 7).

Genes up-regulated upon differentiation of ESCs into BCs
There were 107 genes up-regulated upon differentiation of ESCs into BCs. Similar to the data set above (EBs→BCs), the genes with the greatest fold change (FC 29x-810x) were involved in hemoglobin synthesis (hemoglobins γ, ε, and α, ALA2, GYPA, and TAL1/SCL), similar to the comparison of EBs to BCs. In addition, this data set contained many key transcription factors involved in hemangioblastic differentiation, such as GATA2 [26], LMO2 [25] and TAL1/ SCL [27,28] (Additional data file 8). GATA2 and MYL4 were The reliability of the microarray data generated from our multi-comparison analysis is demonstrated by the consistent identification of a set of genes among multiple comparisons, of which a subset of genes were confirmed by immunocytochemistry. For immunochemistry (IC): +, moderate to strong staining; -, negative staining; +/-, very weak staining. For Affymetrix arrays:+, detected as up-regulated in BCs; --, not detected as up-regulated in BCs. up-regulated upon differentiation into EBs and remained at a constant level upon differentiation into BCs. EASE analysis of up-regulated genes in BCs identified biologically relevant themes, such as oxygen and gas transport, and development (TAL1/SCL, KLF1, LMO2, GATA1; Table 2).

Level II analysis Genes enriched in ESCs
Since genes that are up-regulated in ESCs when compared to breast epithelia could represent those genes that are underexpressed in breast epithelium, we filtered out those that were also up-regulated in BCs when compared to breast epithelium. This analysis identified 2,108 genes, which comprised GO processes involved in cell cycle, DNA and RNA metabolism, and DNA replication, as expected (Additional data file 9). Genes with the highest fold change include TDGF1, GAL, LEFTY1/2, OCT4, and NANOG (FC 130.0x, 69.6x, 68.7x, 47.6x, 43.8x, and 31.5x). When this data set was clustered based on their GO terms, processes involved in development, cell differentiation and nervous system development were identified (data not shown). This data set was then analyzed with GenMapp, and then used for pathway analysis. Each genetic signature was assigned a color: ESCs, green; EBs, orange; and BCs, red. The ESC pathway confirms that most of the embryonic genes were not removed when compared to breast epithelial cells (Additional data file 1).

Genes enriched in EBs
Since genes up-regulated in EBs when compared to breast epithelia could similarly represent those that are underexpressed in breast epithelium, we also filtered out those that were also up-regulated in ESCs when compared to breast epithelium. We identified 939 genes as up-regulated in EBs relative to breast epithelium and filtered out those that are enriched in ESCs relative to breast epithelium (Additional data file 10). When these genes were clustered with DAVID, processes involved in development, transcription, wnt and frizzle signaling, cell cycle and blood vessel morphogenesis (KDR, VEGF, Neuropilin-1 and 2, and FLT1) were identified The EBs also express genes involved in organ development, suggesting a heterogeneous mixture of cell types (GATA2/4/ 5/6, BMP4, NCAM1, NOG, ISL2, NKX2.5), and mesoderm genes (HAND1, T-brachyury, MESP1). Further examination of the data set identified multiple genes involved in the BMP signaling pathway in the differentiation of blood and endothelial cells. These genes include BMPR1A, BMP4, Tbrachyury, KDR, GATA2, and TAL1/SCL [26,28].

Genes enriched in BCs
We identified 2,735 genes that were up-regulated in BCs relative to breast epithelium after removing genes that were enriched in ESCs when compared to breast epithelium and genes enriched in BCs when compared to breast epithelium (Additional data file 11). When genes were clustered based on their GO, we identified processes characteristic of lymphocytic cells (response to stimulus, defense response, immune response), erythrocytes (heme and porphyrin biosynthesis), coagulation, neurophysiology, development, and mesoderm and heart development ( 2), and hemangioblasts (GATA2 [26], RUNX1 [29], LMO2 [25] and TAL1/SCL [27,28]). Some of the genes identified in the coagulation ontology are not only involved in coagulation but also angiogenesis, such as thrombin [30], plasminogen [31], and possibly coagulation factor 2 receptor like 2/3 [30].
This was also recapitulated by DAVID analysis, which identified the following pathways as statistically over-represented: porphyrin metabolism, acute myocardial infarction, hematopoietic cell lineage pathways and calcium signaling pathways. GenMapp was then used for pathway analysis. Each genetic signature was assigned a color: ESC, green; EBs, orange; and BCs, red. GenMapp analysis of pathways involved in whole blood, bone marrow, coagulation and complement, heme and porphyrin synthesis are indicative of hematopoietic cell types ( Figure 2). Genes identified in Gen-Mapp's heme biosynthesis pathway are indicative of erythoblasts ( Figure 3). We also identified myogenic (cardiac and smooth) pathways, which correlates with the GO analysis.

Level III analysis Genes enriched in BCs relative to leukocytes
We identified 2,101 genes that were up-regulated in BCs relative to leukocytes (after removing genes that were enriched in both hESCs and BCs when compared to leukocytes (Additional data file 12). When these genes were clustered based on their GO, we identified processes involved in development, nervous system development, blood vessel development and angiogenesis, and erythrocytes ( Table 4). The presence of development ontology not only indicates a 'progenitor' status of BCs, but contains genes involved in hemangioblast development, such as LMO2, TAL1/SCL, and RUNX1. This comparison identified genes that are characteristic of endothelia (PECAM1, VE-Cadherin, CD34, vWF, EPOR [32], endothelin 1 [33], and thrombin receptor). There were also genes that indicate the presence of erythrocytes (GATA1, spectrin and ankyrin), blood vessel development (neuropilin-1 and 2, stabilin 1 and 2, EGFR, FGF1 and 6, NOTCH4), and neurons/ neuronal junctions (glutamate receptor 1/6, serotonin Of particular note is the absence of leukocytic processes, such as response to stimulus and defense response identified in the level two analyses. Thus, this comparison allowed for the masking of the 'lymphocytic' signature and, thus, the identification of other endothelial and blood vessel development genes (vWF, bradykinin receptor b1, and thrombin receptor). When this data set was analyzed using GenMapp, more endothelial genes were mapped to the coagulation cascade pathway (vWF, bradykinin receptor b1, thrombin receptorpathway; data not shown).

Genes enriched in BCs relative to endothelial cells
We identified 904 genes that were up-regulated in BCs relative to prostate-derived endothelium after filtering out those genes that are enriched in ESCs relative to endothelium (Additional data file 13). Comparing BCs to endothelial cells identified fewer genes when compared to other comparisons of BCs in level II and III analyses, and, thus, identified fewer GO terms. However, it did identify more erythrocytic processes (nine in total; Table 5) than the other comparisons. Another predominant theme in this data set was development ( When genes were clustered based on their GO, we identified processes characteristic of lymphocytic cells (response to stimulus, defense response, immune response), erythrocytes (heme and porphyrin biosynthesis), coagulation, neurophysiology, development, and mesoderm and heart development.

Gene ontologies of up-regulated processes in BCs versus epithelial cells
GenMAPP of complement and coagulation Figure 2 GenMAPP of complement and coagulation. Genes that are up-regulated in BCs (red), EBs (orange), and ESCs (green) compared to breast epithelia as baseline were mapped onto a pre-existing pathway. This pathway contains genes that are mostly up-regulated in BCs relative to breast epithelia.

Genes enriched in BCs relative to stromal cells
The BCs were then compared to prostate-derived stromal fibromuscular (CD49a immunoselected) tissue. This comparison identified the most number of genes (3,277 genes; Additional data file 14), and had the most diverse GO terms (lymphocytic, developmental, erythrocytic, coagulation, synapses and neurogenesis, and heart development; When these genes were clustered for pathways analysis with DAVID, we identified Nfat, hypertrophy of the heart and Alk in cardiac myocytes as a statistically over-represented pathway (data not shown). GenMapp identified a similar pathway involved in myometrial contraction and calcium regulation in the cardiac cell (data not shown). These genetic signatures from this analysis would indicate the presence of progenitors (indicated by the number of developmental genes) of erythrocytes, leukocytes, neurons/neuronal-muscular junctions, and cardiomyocytes. Thus, comparing BCs to stromal cells masked a connective tissue-like signature, allowing for the identification of tissue-specific processes.

Ingenuity analysis
To identify signaling pathways involved in hemangioblast differentiation, each of the data sets was analyzed by Ingenuity.
Ingenuity is a program that converts large data sets into networks containing direct and indirect relationships between genes based on known interactions in the literature. Genetic networks were created using the EB and BC data sets. The EB data set from the level 2 analysis contained a network of genes (VEGF, GATA4, BMP4) that are interconnected and involved in blood vessel development (VEGF), heart development (GATA4), and cellular development (BMP4) (Figure 4a). For example, Bmp-4 has been shown to promote blood vessel development by increasing VEGF production [34] and VEGF induces or binds to KDR FLT1, NRP1, and NRP2 [35][36][37][38]. This network suggests that BMP4 inhibits cardiac development by increasing HEY1, a transcriptional repressor of GATA4 and 6 ( Figure 5a) [39,40], and inhibits heart development by inducing DKK1 (dickkopf homolog 1) [41], which then inhibits WNT11 mRNA expression, GATA4, and NKX2-5 [42][43][44]. In conclusion, this network suggests that BMP4 induces blood vessel development through VEGF signaling and inhibits cardiac differentiation through HEY1 and DKK1.
We then looked for these and other signaling pathways in the level 2 BC data set. Here we identified genes involved in cardiovascular development (SHH [45], RAR-B, TBX5 [46], WNT11 [43]) acting through GATA4 (Figure 4b). However, unlike the EB data set, we did not identify the cardiac repressor HEY1, VEGF and BMP4 in the BC data set. Instead, the BC network contained cardiac and skeletal genes, such as HAND2 and ANKRD1 [47][48][49][50], and HIF3A, an inhibitor of VEGF expression [51] (Figure 5b). Thus, these networks demonstrate that when EBs differentiate into BCs, we see some angiogenic and some cardiac pathways.
Another signaling network we identified as differentially expressed between EBs and BCs is the network containing GATA2. GATA2 has been shown to play a vital role in hemangioblast development [26] by up-regulating BMP4, KDR, and TAL1/SCL expression. In the EB data set, the GATA-2 net-GenMAPP of heme biosynthesis: Genes that are up-regulated in BCs (red), EBs (orange), and ESCs (green) compared to breast epithelia as baseline were mapped onto a heme biosynthesis pathway  work contained EPOR, TAL1/SCL, TCF3 and PITX2 ( Figure  6a). Pitx2 is a homeobox gene involved in regulating the balance between proliferation and differentiation of progenitor cells [52] and is highly expressed in EBs (FC 24x) and is absent in BCs. PITX2 is not only rapidly down-regulated upon hematopoietic stem cell differentiation [52], but may also promote hemangioblast differentiation by inducing GATA2 expression [53]. In the BC data set, the GATA2 net-work contained the hemangioblastic and hematopoeitic genes TAL1/SCL and LMO2 [25,27,54], FOG/Zfpm1 [55], CD41/GPIIB/Igta2B [56] and GATA1 [57] (Figure 6b).
A predominant network we identified in all three data sets of the level III analysis involved GATA1. GATA1 is a globin transcription factor and is present in all BC but not EB data sets. For example, we see GATA1 interacting with other nuclear When these genes were clustered based on their GO, we identified processes involved in development, nervous system development, blood vessel development and angiogenesis, and erythrocytes.  This comparison did not identify as many genes because of their similar origin and thus fewer gene ontologies. However, it did identify more erythrocytic processes than the other comparisons.

Discussion
To delineate the mechanisms involved in the development of hemangioblasts from murine ES cells, Lugus et al. [26] performed global gene expression profiling of mouse Flk1+ cells that can form BL-CFCs. However, no such analysis has been reported for the human counterpart due to availability of such cells. In the present study, we carried out a large scale tran-scriptional l analysis to profile undifferentiated hESCs, early stage EBs, and BCs as an initial effort to understand the differentiation program of hESCs towards hemangioblasts. The availability of a well-annotated genome database for hESCs, EBs and BCs will provide a foundation that allows one to map and identify genes involved in human hemangioblast development, and to optimize conditions for efficient generation of hemangioblasts from hESCs. Our studies in general are consistent with previous reports that a cluster of genes (OCT4, NANOG) expressed at significantly high levels in hESCs were down-regulated in early stage EBs and BCs [23,[58][59][60][61][62][63] This comparison identified the most number of genes and had the most diverse gene ontologies.

Gene ontologies for up-regulated processes in BCs versus stromal cells
Ingenuity pathway analysis shows a network of genes expressed in EB and BC data sets from the level II analysis Figure 4 Ingenuity pathway analysis shows a network of genes expressed in EB and BC data sets from the level II analysis. This network contains nodes (genes/gene products) and edges (relationships between the nodes). The shaded genes, known as focus genes, were identified by microarrays and are the starting point to generate the network. The asterisks indicate that duplicates were identified in each dataset. The EB data set contains a network of genes (VEGF, GATA4, BMP4) that are interconnected and involved in blood vessel development (VEGF), heart development (GATA4), and cellular development Ingenuity pathway analysis identified a network of genes expressed in EB and BC data sets associated with VEGF Figure 5 Ingenuity pathway analysis identified a network of genes expressed in EB and BC data sets associated with VEGF. In the EB data set, VEGF is associated with 35 'focus' genes, including KDR, FLT1, NRP1, and NRP2. In the BC data set, 13 focus genes were associated with VEGF, even though it is not present in the data set. The intensity of the node color indicates the degree of expression. Nodes are displayed using various shapes that represent the functional class of the gene product (diamond-enzymes, ovals-transcription factors, triangles-kinase, circles-others). A solid line indicates a direct interaction while a dashed line indicates an indirect interaction. A line without an arrowhead indicates binding and a plus sign indicates that othernetworks contain this gene product.
Ingenuity pathway analysis of a network of genes expressed in EB and BC data sets associated with GATA2 Figure 6 Ingenuity pathway analysis of a network of genes expressed in EB and BC data sets associated with GATA2. In the EB data set, the GATA2 network contained EPOR, TAL1, TCF3 and PITX2. In the BC data set, the GATA2 network contained the hemangioblastic and hematopoeitic genes TAL1, LMO2, FOG/ZFPM1, IGTA2B/CD41/GPIIb, and GATA1. The intensity of the node color indicates the degree of expression. Nodes are displayed using various shapes that represent the functional class of the gene product (diamond-enzymes, ovals-transcription factors, triangles-kinase, circles-others). A solid line indicates a direct interaction while a dashed line indicates an indirect interaction. A line without an arrowhead indicates binding and a plus sign indicates that othernetworks contain this gene product.
Our previous study has shown that BCs contain a mixed progenitor population of cells capable of forming hemangioblasts, and hematopoietic and endothelial cells [21]. To assess the heterogeneous populations in BCs, comparisons to publicly available data sets were performed in silico. Biologically relevant comparisons allowed us to mask particular genetic signatures within the heterogeneous population, allowing us to identify others, such as myogenic, vasculogenic, and hematopoietic progenitors within BCs. Our level I analysis compared hESCs to their differentiated progeny, providing a kinetic-like relationship of gene expression. When comparing the hESCs to BCs, many of the down-regulated genes were involved in development, cell differentiation, and morphogenesis. The predominant up-regulated genes identified in this level I analysis were those involved in the development of hemangioblasts and primitive erythroblasts. These genes include those encoding the transcription factors TAL1/SCL, LMO2 and GATA1 and 2, heme synthesis genes, genes encoding erythroblast membrane proteins, and embryonic and fetal globin genes, but very low levels of adult globin gene, suggesting the yolk sac primitive status of BCs. This observation is consistent with findings obtained from both mouse and zebrafish models, in which hemangioblasts were first developed from primitive streaks and yolk sacs [65,66]. The results may also represent the fact that BC expansion medium contains several hematopoietic cytokines, which could push the developmental pathway towards the hematopoietic lineage, although BCs harvested at day 6 retain the potential of differentiating into endothelial cells under appropriate conditions [21]. Optimization of BC expansion conditions would, therefore, be valuable to keep these cells bipotential.
When applying this technique of multiple tissue type comparisons, some of the genes that are identified as 'up-regulated' in our tissue of interest could be 'under expressed' in the reference tissue. We controlled for this by filtering for those 'under expressed' genes with a comparison to a genotypically similar but different tissue type. For example, in our level II analysis, we identified those genes that are up-regulated in BCs relative to breast epithelial cells (3,700 genes) and then removed those genes that were up-regulated in hESCs relative to epithelial cells (2,735 genes). This removed 965 genes that could be thought of as being up-regulated in both BCs and hESCs relative to breast epithelium or as genes that are under-expressed in breast epithelium. GO analysis of this data set identified cell cycle genes as the predominant theme. This observation correlates with their biology because the BCs and hESCs are actively dividing cells in vitro, while the breast epithelium comprises relatively senescent cells freshly isolated from in vivo. Most importantly, we did not identify any tissue specific processes that we would expect to find in BCs.
GO analysis of the genes up-regulated in BCs with respect to epithelial cells after filtering out those that are up-regulated in hESCs relative to breast epithelial cells identified biological themes involved in erythropoiesis, as in the level I analysis (heme and porphyrin biosynthesis), but also the angiogenic components of coagulation, and synapses. This analysis identified genetic signatures representative of not just erythrocytes, as in the above level I analysis, but also the other cellular components of the BC population, such as muscle, cardiac, and hematopoietic cells and hemangioblasts. This in silico comparison to a biologically distinct reference tissue and filtering allows one to identify statistically significant genes that might otherwise be missed within a heterogeneous population.
In the level III analysis, biologically relevant comparisons were made in silico to adjust for particular cell types represented in the BC population. As in the level II analysis, this analysis also identified genetic signatures of the erythrocytic population in addition to other cellular components of BCs. When BCs were compared to leukocytes in silico, we identified a genetic signature representative of vasculogenesis, endothelia, neurons/synapses, hemangioblasts, and erythrocytes, with the relative absence of leukocyte genes. When BCs were compared to endothelia in silico, we identified a signature of erythrocytic and developmental genes with the relative absence of the vasculogenic signature. When BCs were compared to stromal cells in silico, we identified processes involved in hematopoiesis, synapses, angiogenesis/endothelia, development and more genes involved in cardiomyogenesis. Although EASE analysis for both epithelial and stromal comparisons identified similar heart GO terms, the stromal comparison identified different heart development genes. We believe that the epithelial comparison revealed more of the mesodermal aspect of BCs while the stromal comparison masked the mesodermal components and revealed more cardiomyocytic genes.
It has been shown that murine BL-CFCs are able to differentiate into hematopoietic, endothelial and smooth muscle cells, but failed to give rise to cardiomyocytes [67]. Molecular analyses showed that BL-CFCs expressed genes indicative of the hematopoietic and endothelial lineages, but not cardiomyocytes [26]. Kattman et al. [68] recently identified a cardiovascular progenitor with the same phenotype as the BL-CFC progenitor, brachyury+ and Flk-1+ cells from day 4.25 EBs (one day later than the BL-CFC progenitor) derived from mouse ESCs. These studies strongly suggest that BL-CFCs and cardiomyocytes, at least in the mouse ESC system, are derived from two different progenitors. Two other studies demonstrate the existence of multipotential progenitors for cardiomyocytes and muscle cells, but with different surface markers [69,70]. In the present study, several genes restricted to cardiomyocytes or their progenitor were detected with relatively high levels of expression in BCs. This observation suggests that human BCs may posses the potential to differentiate into cardiomyocytes, which will need further investigation. Alternatively, the purified BCs from multiple colonies could contain dissimilar blast clones origi-nated from different differentiation stages, some of which may have the potential to develop into cardiomyocytes, as demonstrated recently by three groups in the mouse ESC system [68][69][70]. Simultaneous isolation and characterization of functionally distinct colonies and analysis of gene expression in these colonies might serve to determine whether human BCs posses the potential to differentiate into hematopoietic, endothelial and cardiomyocyte lineages.

Conclusion
The identification and characterization of cell types within a heterogeneous population will be of increasing importance in stem cell research since differentiation protocols require the formation of progenitors through a multi-stage approach.
Our previous study has shown that BCs contain a mixed progenitor population of cells capable of forming hemangioblasts, and hematopoietic and endothelial cells [21]. To assess the heterogeneous populations in BCs, comparisons to publicly available data sets were performed in silico. Biologically relevant comparisons allowed us to mask particular genetic signatures within the heterogeneous population, allowing us to identify others, such as myogenic, vasculogenic, and hematopoietic progenitors, within BCs. The significance of this microarray study is in its ability to assess and identify cellular populations within a heterogeneous population through biologically relevant in silico comparisons of publicly available data sets. In conclusion, multiple in silico comparisons were necessary to characterize tissue-specific genetic signatures within a heterogeneous hemangioblast population.

Materials and methods hESC culture and BC growth
Culture of hESCs and growth of BCs were as reported previously [21]. In brief, undifferentiated hESCs (H1, H9) were cultured with inactivated mouse embryonic fibroblast cells in complete hESC media until they reached 80% confluence. Undifferentiated hESCs were dissociated by 0.05% trypsin-0.53 mM EDTA (Invitrogen, Carlsbad, CA, USA) for 2-5 minutes and collected by centrifugation at 1,000 rpm for 5 minutes. To induce hemangioblast precursor (mesoderm) formation, hESCs (2-5 × 10 5 cells/ml) were plated on ultralow dishes (Corning, Corning, NY, USA) in Stemline II media with the addition of BMP4 and VEGF 165 (50 ng/ml; R&D Systems (Minneapolis, MN, USA)) and cultured in 5% CO 2 . Forty eight hours later, half the media was removed and fresh medium was added with the same final concentrations of BMP4 and VEGF, plus SCF, Tpo and FLT3 ligand (20 ng/ml; R&D Systems), and PTD-HoxB4 (1.5 μg/ml) to expand out BCs and their precursor. After 3.5 days, EBs were collected and dissociated by 0.05% trypsin-0.53 mM EDTA (Invitrogen) for 2-5 minutes, and a single cell suspension was prepared by passing through a 22G needle 3-5 times. To expand BCs, a single cell suspension derived from differentiation of 2-5 × 10 5 hESCs were mixed with 2 ml hemangioblast expan-sion medium plated on ultra-low dishes and incubated at 37°C in 5% CO 2 for 6 days, and BCs were then collected and subjected to RNA isolation.

Affymetrix GeneChip analysis
Total RNA was isolated from purified BCs, day 3.5 EBs and undifferentiated ESCs (from two hESC lines, H1 and H9) using the Qiagen (Valencia, CA, USA) RNAeasy kit and amplified as previously described [21]. A total of six microarrays were performed (two biological replicates per time point using different hESC lines). Fragmented antisense cRNA was used for hybridizing with human U133 Plus 2.0 arrays (Affymetrix, Inc. Santa Clara, CA, USA) at the Core Genomic Facility of University of Massachusetts. The validation of differentially expressed genes was confirmed by immunocytochemistry in our previous studies [21] and by semiquantitative RT-PCR analyses ( Figure 1). The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus (GEO) [71] and are accessible through GEO series accession numbers GSE8884 and GSE9196, in accordance with MIAME standards. The demographics of the publicly available GeneChip data sets are breast-derived epithelial cells (n = 7), leukocytes (n = 6), prostate-derived endothelial cells (n = 5), and prostate-derived stromal cells (n = 5). These samples were chosen based on their homogeneity (cells were immunoselected or enriched) and the number of replicates. These data sets were downloaded from the NCBI's GEO and are accessible through accession numbers GSE9086 (breast-derived epithelial cells), GSE9091 (leukocytes), GSE9090 (stromal), and GSE9089 (endothelia).

Data analysis
Raw CEL files were provided by the Core Genome Facility of the University of Massachusetts and were then analyzed with a software package AffylmGUI (Affymetrix LIMMA, Linear Models for Microarray Data, Graphical User Interfaces) [72,73]. Within AffylmGUI, gene expression values were summarized with RMA. RMA adjusts for background noise, performs a quantile normalization, transforms the data into log base 2, and then summarizes the multiple probes into one intensity [74][75][76]. Quantification of relative differences in gene expression among the groups of interest was accomplished using AffylmGUI, the sister package of limmaGUI [72,73]. AffylmGUI reads the raw Affymetrix CEL files directly, summarizes the gene expression values using RMA, and then uses LIMMA to identify statistically significant differences in gene expression [77]. LIMMA fits a linear model for every gene (like ANOVA or multiple regression analysis), and adjusts P values for multiple testings [77]. Differentially expressed genes were identified with a B statistic >0. The B statistic, also known as a likelihood of odds (LOD) score is a moderated t-statistic with posterior residual standard deviations. Subsequent analyses were performed in Microsoft Excel and Microsoft Access.

EASE
The application Expression Analysis Systematic Explorer (EASE) was used to determine biologically relevant themes in a list of differentially expressed genes. EASE identifies overrepresented biological themes in terms of their GO [78]. GO was developed to provide consistent descriptions of genes in terms of biological processes and molecular function. Genes with a B value >0 are incorporated into EASE where each gene is matched to all possible GOs. The results of this analysis are compared to all possible GOs for all genes on the microarray platform and calculates a P value, based on a conservative variant of the Fischer's exact probability test. We selected those pathways/processes with P < 0.05.

DAVID
DAVID (Database for Annotation, Visualization and Integrated Discovery) provides tools and statistical methods for uncovering enriched processes and pathways within diverse and disparate gene lists [79]. DAVID also identifies over-represented biological themes in terms of their GO [78] and provides tools to visualize the distribution of genes on BioCarta and KEGG pathway maps.

GenMapp
GenMapp is designed to visualize gene expression data on maps representing biological pathways and grouping of genes. It consists of hundreds of pre-made pathway maps and is used to identify pathway level changes amongst multiple data sets [80]. In this program, data sets are assigned a color corresponding to cell type and the direction of gene expression changes.

Ingenuity
Networks were constructed using Ingenuity Pathways Analysis (Ingenuity ® Systems, Redwood City, CA, USA). A data set containing gene identifiers of genes with a B > 0 was uploaded into the applications. These genes, called focus genes, were overlaid onto a global molecular network developed from information contained in the Ingenuity Pathway Knowledge Base. Networks of these focus genes were then algorithmically generated based on their connectivity. Each network is a graphical representation of the molecular relationships between genes/gene products. Genes or gene products are represented as nodes, and the biological relationship between two nodes is represented as an edge (line). All edges are supported by at least 1 reference from the literature stored in the Ingenuity Pathways Knowledge Base. The intensity of the node color indicates the degree of expression. Nodes are displayed using various shapes that represent the functional class of the gene product (diamond-enzymes, ovals-transcription factors, triangles-kinase, circles-others). A solid line indicates a direct interaction while a dashed line indicates an indirect interaction. A line without an arrowhead indicates binding and a plus sign indicates that othernetworks contain this gene product.

RNA isolation and gene expression quantification by semi-quantitative PCR
Total RNA was isolated from hESCs, day 3 EBs and hemangioblasts (BCs) using an RNAeasy Mini Kit (Qiagen) following the procedure recommended by the supplier with DNase I digestion, which eliminates the contamination of genomic DNA. RNA was subjected to first-strand cDNA synthesis with SMART II and CDS primers (Clontech, Mountain View, CA, USA), using Superscript II reverse transcriptase (Invitrogen), and cDNA pools were constructed using the SMART cDNA synthesis kit (Clontech) as described previously [2,81]. Complementary DNA pools generated by the SMART procedure have been shown to preserve the relative abundance relationship of the original mRNA populations [82][83][84]. The Table 7 Primer sequences DNA templates in cDNA pools of human ES cells, EBs and hemangioblasts were adjusted to equal amounts based on the relative expression level of the hypoxanthine phosphoribosyltransferase gene (HPRT) and gene expression quantification by semi-quantitative PCR was performed as described previously [2,81]. The sense and anti-sense primer sequences, and the corresponding cDNA PCR product sizes, are shown in Table 7. The conditions for PCR amplification were as described with annealing temperatures and concentrations of MgCl 2 for each specific gene as shown below. PCR products (10 μl) were separated on 1.5 to 2.0% agarose gel and visualized by ethidium bromide staining. The relative expression levels in cDNA from hESCs, EBs and hemangioblasts were estimated visually.

Direct and indirect comparison of microarray data of differentially expressed genes
The direct analysis (level I) consists of making comparisons between ESCs, EBs, and BCs. Since there are two possible comparisons for each cell type, for example, ESCs relative to EBs or ESCs relative to BCs, fold changes were determined by comparing each to a cell type that did not detect the gene in question. Fold change levels for Oct-4 and Nanog were determined by comparing ESCs to BCs and EBs to BCs. Fold change levels for Gata-2 were determined by comparing EBs to ESs and BCs to ESs. Fold change levels for SCL/Tal1 were determined by comparing BCs to ESCs. Fold change levels for γ/ε-globin were determined by comparing BCs to EBs. For the indirect (level II) analysis, fold changes were simply determined by comparing each cell type to breast epithelia. If more than one probe-set was identified as differentially expressed, fold changes were averaged.

Authors' contributions
SJL and QF designed and performed the cell culture, differentiation and RNA isolation. JAH and JDH performed the microarray analysis. This study was conceived by JDH, SJL, and JAH. SJL and JAH wrote the manuscript with input from JDH, RL and AA.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure showing a GenMAPP generated embryonic stem cell pathway that corresponds to genes enriched in ESCs (green), EBs (orange) and BCs (red) with epithelial cells as a baseline. Additional data file 2 is a figure showing a predominant Ingenuity generated network that was identified in all four data sets of the level III analysis. GATA1 interacts with other nuclear genes such as TAL1, LMO2, and KLF1, which induceeythropoetic genes such as the hemoglobin family (HBG1, HBG2, HBE, HBB, and HBZ), and those involved in heme synthesis (ALAS2). Additional data files 3,4,5,6,7,8,9,10,11,12,13,14 are tables listing genes that were up-regulated in ESCs, EBs, or BCs relative to each other or a specific reference tissue. Additional data file 3 contains a list of genes that are down-regulated upon differentiation of ESCs into EBs. Additional data file 4 contains a list of genes that are down-regulated upon differentiation of ESCs into BCs. Additional data file 5 contains a list of genes that are up-regulated upon differentiation of ESCs into EBs. Additional data file 6 contains a list of genes that are down-regulated upon differentiation of EBs into BCs. Additional data file 7 contains a list of genes that are up-regulated upon differentiation of EBs into BCs. Additional data file 8 contains a list of genes that are up-regulated upon differentiation of ESCs into BCs. Additional data file 9 contains a list of genes that are up-regulated in ESCs when compared to breast epithelia. Additional data file 10 contains a list of genes that are up-regulated in EBs when compared to breast epithelia. Additional data file 11 contains a list of genes that are up-regulated in BCs when compared to breast epithelia. Additional data file 12 contains a list of genes that are up-regulated in BCs when compared to leukocytes. Additional data file 13 contains a list of genes that are up-regulated in BCs when compared to endothelial cells. Additional data file 14 contains a list of genes that are up-regulated in BCs when compared to stromal cells. Additional data files 15,16,17,18,19,20