Evaluating intra- and inter-individual variation in the human placental transcriptome
© Hughes et al.; licensee BioMed Central. 2015
Received: 9 December 2014
Accepted: 10 March 2015
Published: 19 March 2015
Gene expression variation is a phenotypic trait of particular interest as it represents the initial link between genotype and other phenotypes. Analyzing how such variation apportions among and within groups allows for the evaluation of how genetic and environmental factors influence such traits. It also provides opportunities to identify genes and pathways that may have been influenced by non-neutral processes. Here we use a population genetics framework and next generation sequencing to evaluate how gene expression variation is apportioned among four human groups in a natural biological tissue, the placenta.
We estimate that on average, 33.2%, 58.9%, and 7.8% of the placental transcriptome is explained by variation within individuals, among individuals, and among human groups, respectively. Additionally, when technical and biological traits are included in models of gene expression they each account for roughly 2% of total gene expression variation. Notably, the variation that is significantly different among groups is enriched in biological pathways associated with immune response, cell signaling, and metabolism. Many biological traits demonstrate correlated changes in expression in numerous pathways of potential interest to clinicians and evolutionary biologists. Finally, we estimate that the majority of the human placental transcriptome exhibits expression profiles consistent with neutrality; the remainder are consistent with stabilizing selection, directional selection, or diversifying selection.
We apportion placental gene expression variation into individual, population, and biological trait factors and identify how each influence the transcriptome. Additionally, we advance methods to associate expression profiles with different forms of selection.
Nearly four decades ago, it was estimated that about 85% of the neutral genetic variation in humans is found within groups and only about 15% between groups , which reflects the close genetic relationship of human populations. This initial observation, using protein markers, has been substantiated by numerous additional studies and markers [2-6]. Further, these analyses provide a framework to identify genes that exhibit unusually large differences between populations and thus may have been subject to recent local positive selection [2,7-10] as responses to population-specific evolutionary forces.
In principle, the variation in phenotypic traits can also be apportioned into within-population and between-population components , which could provide insights into the relative influence of both genetic and environmental factors on such traits. However, this has been done for only a few human traits. For example, cranial variation among human populations present between-population components (0.11 to 0.14) similar to neutral genetic variation , suggesting that human cranial variation also (largely) reflects neutral genetic processes. Conversely, variation in skin pigmentation has a significantly larger between-population component (0.87) , in keeping with hypotheses that skin pigmentation variation has been subject to strong selection [13,14].
A phenotypic trait of recent considerable interest is the level of gene expression (or RNA abundance), as it represents the initial link between genotype and other phenotypes, and hence is the logical place to begin evaluating the relative influence of genotype, environment and non-neutral evolution on phenotypic variation. Previous studies [15-21] have analyzed gene expression in lymphoblastoid cell lines from up to eight global populations derived from the International HapMap Project , and estimated that between 4.5% and 29% of genes are differentially expressed among groups. Four of these studies have estimated a between-population component of expression variation [17,19-21]. Specifically, when considering CEPH European (CEU) and Yoruba from Ibadan, Nigeria (YRI), the first of these studies estimated that 15% of expression variation was observed among groups, suggesting that expression variation mirrors genetic variation and hence is largely neutral . A subsequent study  found a similar median estimate of 12% for the among-group variation in expression. However, after accounting for non-genetic factors that estimate was reduced to 5%. Another attempt to reduce non-genetic factors influencing expression variation obtained a median estimate of 0.7% between CEU and YRI samples , while the most recent study estimated 3% of the expression variation is found among groups . It may be crucial to correct for non-genetic factors for these specific samples as they were collected at various times in the past, transformed into cell lines, and maintained in culture for up to 20 years [15,22,23]. Yet given the range of estimates, the question remains, what proportion of total gene expression variation is found among groups, especially for native tissues rather than cell lines?
Here, we provide one of the first studies of among population gene expression variation in a natural tissue (namely, placentas) . We chose placentas rather than more-easily obtained blood samples because gene expression in blood is influenced by the age of the individual  and the time of the day when samples are taken , whereas all placentas were obtained at the same ‘age’ and ‘time’ - namely, birth of the child. Additionally, the placenta is an important organ due to the fetal-maternal interplay and its critical role in fetal growth and development. Placentas were obtained from a single hospital during a 6-week time period from four groups: African-Americans (AF), European-Americans (EU), South Asian Americans (SA), and East Asian Americans (EA). We emphasize that although we have tried to minimize environmental variation by sampling from a single hospital over a short time period, any differences in gene expression among these four groups will reflect both differences in genetic ancestry as well as systematic differences in their individual environments. However, we also incorporated biological and environmental factors into our model of gene expression to explicitly dissect the contributing variation that individual biology and environmental elements, such as diet, may have on expression variation.
A complication in the study of native tissues, such as placentas, is their cell type heterogeneity, and their spatio-temporal expression variability [27-29]. Thus, any one dissection of a complex tissue is but a single snapshot of the stochastic variation observed in expression abundance in that tissue space and in that moment in time. We therefore sampled each placenta twice to explicitly measure variation within a single placenta, to estimate the contribution of cell-type heterogeneity and spatial variability to inter-individual variation.
Of the 159 million high quality reads obtained, 117 million mapped to annotated exons. An average of 1.46 million exon-mapped reads were obtained for each library (sample replicate), corresponding to an average of 2.9 million exon-mapped reads for each individual (Figure S1 in Additional file 1). There was at least one mapped read for each library at 13,156 genes, including but not limited to 11,301 protein-coding genes, 801 psuedogenes, 893 long non-coding RNAs (lncRNAs), and 40 small RNAs, which includes 21 pre-miRNAs. Expression levels were normalized (variance stabilized) using protocols described in the DESeq2 package . Pearson’s correlation coefficient for each pair of sample replicates was 0.98 ± 0.005, yielding an r-squared value of 0.96 ± 0.01. Data quality was further evaluated by validating the expression profiles of three genes by rt-qPCR, a mean Pearson’s r of 0.74 ± 0.07 was observed between the expression values measured by RNA-sequencing vs. rt-qPCR (Figure S2 in Additional file 1). Thus, based on both sample replicates and an independent method of measuring expression abundance, the data we obtained provide an accurate measurement of RNA transcript abundance.
Total gene expression structure
An additional observation from the sample correlation dendrogram is the lack of clustering of individuals with the same ancestry. To further evaluate this a principle component (PC) analysis reveals that in contrast to what is commonly observed for genetic data [31-33] there is no evident structure in this cellular phenotype that corresponds to groups (Figure 1B). However, when the PC loadings for each individual are tested for correlations with other aspects of the data (Figure 1C), PC2 is correlated with fetal length at birth (r = -0.54, Bonferroni P value = 0.007), PC3 correlates with the sum of mapped reads (r = -0.62, Bonferroni P value = 0.0005), and PC4 correlates with normal maternal weight (r = 0.46, Bonferroni P value = 0.045). Additionally, analysis of genes that correlate with the loadings from the first three PCs [34,35] reveals enrichment in hundreds of gene ontology categories, particularly molecular function (GO:0003674), biological process (GO:0008150), binding (GO:0005488) and their sub-categories (Additional file 2: Table A), as well as numerous KEGG pathways (Additional file 2: Table B) highlighted by the most enriched KEGG pathway, namely 01100:Metabolic Pathways (adjusted P value = 2.9e-05). Overall, it appears that total transcriptome variation is largely influenced by factors other than group affiliation (i.e. population), and that transcript variation hence does not parallel expected patterns of genetic structure for these groups [32,36].
An apportionment of gene expression variation
Mean expression and apportionment estimates
The mean expression of each gene is significantly correlated with the residual (or intra-individual) sum of squares estimate (Pearson’s r = 0.60, P <0.001). This illustrates that as mean expression increases, variation in mRNA abundance among our sample replicates also increases. As such, we estimate that mean expression explains 36% of the variation in our error sum of squares. However, the among group (r = 0.018, P = 0.034) and among individuals within groups (r = -0.029, P = 0.001) sums of squares are more weakly correlated with mean expression. Consequently, the apportionment parameters are correlated with mean expression with coefficients of -0.446, 0.388, and 0.166 for Net, Nit, and Nst (P <0.001), respectively. The proportion of variation explained by mean expression for each apportionment parameter is thus 20%, 15%, and 2.7% for Net, Nit, and Nst, respectively. This suggests that mean expression is having a modest influence on parameter estimates, and the acquisition of more reads will not greatly influence the apportionment estimates.
Differential gene expression among individuals
The proportion of genes that vary significantly among individuals in expression levels was analyzed via a F-ratio test between inter-individual and intra-individual variance. We observed that 5,880 genes, or 44.5% of all genes (at an FDR 5%), exhibited significant among individual, within group variation. Additionally, fitting two linear models to the data (a null model and a second model that includes individuals as an explanatory variable), followed by a Chi-squared test of model fitting, results in 5,491 genes (41.7% of all genes) with significant inter-individual variance (at an FDR 5%). There is an 84% overlap between the significant genes in both analyses. We estimated the proportion of within-group variation explained by inter-individual variation with the parameter Nis (SSb/SSb + SSe; see Methods). On average 64% of the within-group variation is attributed to individuals, indicating substantial inter-individual variation. Those genes that are significantly differentially expressed (DE) among individuals, as determined by the F-ratio test, have a minimum Nis value of 0.65. To determine if there may be significant variation attributed to intra-individual variation at some loci, we inverted the F-ratio test by placing the intra-individual mean squares in the numerator and inter-individual mean squares in the denominator, but observed no significant loci after Benjamini-Hochberg correction. Overall, this illustrates that there is substantial inter-individual variation in gene expression variation.
Differential gene expression among groups
Three different methods were used to identify and quantify genes that may be differentially expressed among human groups: two published methods (DESeq  and tweeDESeq ), and a permutation of the hierarchical ANOVA. The two published methods can only compare two groups at a time, while permutations of the hierarchical ANOVA permit the analysis of two or more groups simultaneously.
Number of differentially expressed genes
tweeDESeq 5% FDR
GO and KEGG enriched pathways for pairwise union of DE genes
Adjusted P value
Adjusted P value
Cell adhesion molecules (CAMs)
Multicellular organismal process
Single-multicellular organism process
Response to stress
Single-organism cellular process
Response to stimulus
Immune system process
Complement and coagulation cascades
Multicellular organismal development
Hematopoietic cell lineage
Staphylococcus aureus infection
Single organism signaling
Arachidonic acid metabolism
Non-neutral gene expression profiles
Although it is difficult to determine if expression at a particular gene is evolving according to neutrality or under selection, we are able to identify expression profiles that conform to four specific patterns of selection: directional, balancing, stabilizing, and diversifying. Importantly, these analyses do not test for deviations from neutrality, but rather identify genes that exhibit expression profiles consistent with selection on quantitative traits [38,39]. Traits under directional selection are expected to exhibit shifts in mean expression among groups exemplified by greater among group variation relative to within group variation, and would hence be consistent with previously identified DE genes. Balancing selection is exemplified by high diversity or variation among individuals within a population but low variation among populations. Stabilizing selection results in low levels of expression variance among individuals while diversifying selection is reflected in high levels of expression variance among individuals. We identified genes that typify each selection profile using apportionment of variation estimates, estimates of total expression variance, and a series of permutations, as described in Methods.
Expression variance, genetic diversity, and network connectivity
The prevalence of genes that deviate from neutral-drift expectations, particularly those consistent with stabilizing selection, prompted us to hypothesize that inter-individual variance in gene expression must have a genetic component. Specifically, we hypothesized that genes with greater expression constraint would have greater genetic constraint. Additionally, genes exhibiting large inter-individual expression variances may allow, through relaxed constraint or by necessity, a relative excess of variation. To evaluate this hypothesis, we tested for a correlation between expression variance and pairwise genetic diversity. Pairwise genetic diversity (π) was calculated for each gene, controlling for gene length , for three populations from the 1000 Genomes data: CEU = Northern Europeans, ASW = African Americans from the southwest USA, and CHS = Han Chinese from Southern China. We chose these three populations as they are the best available proxies for our sampled individuals. When diversity is compared from each population to expression variance, we observe a significant positive correlation (ASW: r = 0.213; CEU: r = 0.189; CHS: r = 0.177, P < 2.2e-16 Figure S5 in Additional file 1). In addition, expression variance also correlates with Tajima’s D values (ASW: r = 0.179; CEU: r = 0.129; CHS: r = 0.132, P < 2.2e-16 . These observations indicate that total expression variance has a small (r-squared = 0.04) albeit significant genetic and thus heritable component.
Another factor that may influence expression variance is the number of interacting partners a gene has. Previous work on gene-networks has illustrated that the degree of connectivity (number of interactions) influences the rate of molecular evolution . Here, using data from BioGrid we tested if the number of interacting genes also influences the expression variance of a gene (Figure S6 in Additional file 1). Indeed, we observe a weak tendency for the expression variance to increase as the number of interacting genes decreases (Pearson’s r = -0.28, P < 2.2e-16 ).
To evaluate how both genetic diversity and connectivity may together influence gene expression variance we built an ANOVA model setting the coefficient of variation in gene expression as the response variable, and setting gene diversity and connectivity as explanatory variables with interaction. Each component of the model significantly influenced expression variance (diversity P < 2.2e-16; connectivity P < 2.2e-16; interaction P = 0.029) explaining an estimated 4.3%, 2.3%, and 0.07% of the total variance in expression variance, respectively.
Gene co-expresssion modules and functionality of selection categories
To determine if the sets of genes corresponding to the four non-neutral modes of evolution have a coherent biological effect, we tested for evidence of co-expression networks and enrichment in GO gene ontology terms and KEGG functional pathways. No enrichment was observed for genes consistent with a pattern of balancing selection. The results from the three other non-neutral modes are presented below.
Top 10 GO and KEGG enriched categories for stabilizing and diversifying selection
Adjusted P value
Adjusted P value
mRNA metabolic process
Cellular macromolecule metabolic process
Protein processing in endoplasmic reticulum
Cellular protein metabolic process
Macromolecule metabolic process
Cellular metabolic process
Ubiquitin mediated proteolysis
Cellular macromolecule catabolic process
mRNA surveillance pathway
Cytokine-cytokine receptor interaction
Single-multicellular organism process
Neuroactive ligand-receptor interaction
Multicellular organismal process
Cell adhesion molecules (CAMs)
Growth factor activity
Hedgehog signaling pathway
Carbohydrate derivative binding
Glycine, serine and threonine metabolism
Stabilizing genes formed four co-expression modules that, as a unit (Additional file 3: Table I), are associated with 1,245 GO ontology terms and 51 KEGG pathways at an FDR of 20% (898 GO and 39 KEGG at an FDR of 5%) and are involved with basic, largely intracellular, processes (Table 4). These include association with the splicesome, ribosomes, RNA transport, and protein processing. But they are also associated with neurological diseases such as Huntington’s, Parkinson’s, and Alzheimer’s disease. Finally, there are also associations with bacterial infection, hepatitis C, T-cell signaling, and cancer pathways. Individually, each module has a unique functional composition, but there is overlap at varying degrees for a few key pathways that include basic intracellular functions and associations with neurological diseases (Additional file 3: Table J and K).
The influence of biological traits on gene expression
Along with population ancestry, several anthropometric and dietary traits were also collected from each individual, to evaluate their association with expression variation. Starting with the model of gene expression used previously, which included technical (number of mapped reads and RNA quality) and population factors (group and individual), eight additional traits were added: sex of the child, weight of the child, length of the child, birthing manner (Cesarean or vaginal), maternal age, maternal body mass index, whether or not the mother drinks alcohol (outside of the pregnancy), and whether or not the mother is a vegetarian (see Methods for model details). Note that each new trait being modeled is a measure of inter-individual variation. The significance for each factor was determined by an F-test (FDR of 5%) using the mean square estimates of each factor over the residual (intra-individual variation).
Discussion and conclusion
Using a population genetics framework, we have demonstrated that both intra- and inter-individual variation account for the vast majority of total gene expression variation. Significantly, intra-individual variation in gene expression cannot be ignored in evaluating expression variation, consistent with studies of single cell gene expression that have illustrated the stochastic nature of expression variation . While this is particularly true for the placenta it also holds true for other tissues [46,47]. If intra-individual variation is not measured it will be erroneously attributed to inter-individual variation, thereby inflating estimates of inter-individual variation.
Gene expression profiles were dissected to evaluate the impact that non-neutral evolutionary forces may play in shaping expression variation. We observed that the majority of placental expression variation is consistent with a neutral-drift model, but an estimated 35% of the placental transcriptome is influenced by selection. Stabilizing selection plays a large role on transcriptome variation maintaining significant regulatory control over some 25% of the genes. Genes influenced by stabilizing selection are largely limited to basic intra-cellular functions. In contrast, the approximately 4% of genes influenced by diversifying selection typically encode extra-cellular proteins involved in cell signaling, metabolism, and immune pathways. Interestingly, the genes influenced by directional selection span the range of inter-individual variation observed, overlapping with profiles consistent with stabilizing and diversifying selection (Figure 3A). That is, directional selection can act on any gene regardless of the range of inter-individual variance. Therefore, measurements of fold change in gene expression that do not account for total expression variance can be misleading (Figure 4). Additionally, we find that expression diversity correlates with genetic diversity, substantiating a role for genetic selection in influencing inter-individual expression variation.
Among group variation in placental gene expression averages out to an Mst of 0.045 (Nst = 0.079), which is less than that found for human genetic variation (Fst = 0.111) . This suggests that placental transcriptome variation among groups is more similar than genetic variation alone would predict. Our estimates and thus our conclusions are certainly influenced by the accuracy with which we can measure and apportion variation for such a dynamic and quantitative trait. However, these estimates are qualitatively consistent with similar recent estimates of among group variation derived from lymphoblastoid cell lines [19,21].
Interestingly, where significant variation in expression level is manifested among human groups, it associates with genes pivotal to placental biology, fetal growth, and fetal development. This includes cell-cell interaction pathways like cell adhesion molecules , arachidonic acid metabolism , tryptophan metabolism , and immune response pathways including malaria, which is known to present serious health risks to the fetus [51,52]. While these inter-individual and inter-group observations are of potential interest for clinicians and biologists, a crucial point concerning the evolutionary consequences of these observations is heritability. In both the pairwise population analyses and k-means clustering of directional genes, individuals of South Asian ancestry appear to have experienced both the greatest change and the most biologically specific changes in placental gene expression. However, the mothers of all of these individuals were born in South Asia, and it is unknown how long they resided in the sample location prior to sampling. Thus, whether these observations are the product of a heritable, evolutionary adaptive response or the product of these particular individuals being exposed to an individually novel environment and presenting a plastic response cannot be determined. Nonetheless, those genes with expression profiles consistent with models of directional selection exemplify genes and pathways that may be most frequently targeted during adaptive responses to novel environments.
We have also demonstrated that by incorporating biological trait variation in models of gene expression, we can identify genes and pathways that have correlated changes with the modeled traits. By evaluating the direction of the correlated change and combining this information with biological and/or clinical information, this framework allows the potential influence of the trait to be dissected. For example, pre-pregnancy alcohol consumption is associated with the regulation of essential pathways like glycolysis/glycogenesis.
Finally, these observations provide a first insight into human, in vivo, gene expression variation among populations of cells within a tissue, among individuals and among continental groups. The model of gene expression variation presented here is adaptable to any system and the apportionment parameters based on the sums of squares provide a set of stable statistics that can be compared across studies. Importantly, classifying genes into selection categories is difficult as there are a number of assumptions involved. We stress that no formal tests of selection were performed in this study. Instead, we presented a framework to identify genes with expression profiles that are consistent with theoretical expectations of selection on a quantitative trait. Hopefully, this work will provide a foundation for the development of a neutral theory of gene expression in which formal tests of selection may be conducted. Further, we note that precision in the apportionments can be strengthened by increasing the number of sequencing reads, adding technical replicates, and increasing the number of both tissue replicates and individuals. Notably, only a single complex tissue was evaluated in this study - additional biological and evolutionary insight can be gained by studying other tissues or, as single-cell transcriptomic methodologies become more mature, specific cell types. In addition, sampling individuals of similar ancestry at multiple locations would allow one to estimate the influence of both environment and ancestry on expression variation. The framework and methodologies present here provide a foundation for further such studies of transcriptome variation.
All placentas were collected in October and November 2006 at Northside Hospital in Atlanta, Georgia, with the approval of the Northside Hospital Institutional Review Board (NSH #804), with the written informed consent of the donors and in accordance with the Declaration of Helsinki agreement.
A total of 66 human placentas were collected and processed within 1 h of delivery from both natural and Cesarean births. Placentas were quartered, wrapped in aluminum envelopes, and immediately snap frozen in liquid nitrogen. All samples were stored at -80°C prior to shipping on dry ice to the Max Plank Institute in Leipzig, Germany, where they were again stored at -80°C. Each contributing family completed a questionnaire which asked for self-described ancestry and birthplace going back three generations and anthropometric, health, and lifestyle questions about the mother including: height, weight, weight at full term, number of pregnancies, number of children, smoking status, alcohol intake, illness during pregnancy, chronic illnesses, medication taken during pregnancy, diet, and any other volunteered information. Finally, the sex, weight, length, and the delivery manner of the child were recorded. From this collection and the provided data, we selected 40 samples to include in the study. Samples were chosen only from those families with self-described ancestry from a single group, with no major illnesses during birth, and with the most complete questionnaires. The final 40 samples include 10 samples each of African-American (AF), European-American (EU), South Asian-American (India; SA) and East Asian-American ancestry (Korea, China, Vietnam, and Taiwan; EA). All SA individuals are first-generation immigrants, and all but one of the EA individuals are first-generation immigrants; the exception is a second-generation American.
Given the mosaic composition of the placenta, possible maternal blood/tissue contributions to any dissection, and previous observations that placental sample location influences expression variation , we produced tissue sample replicates for each individual. Tissue sample or dissection replicates were generated to quantify expression variation introduced in the dissection process. Specifically, tissue replicates quantify intra-individual variation in the form of (a) cell-type heterogeneity, (b) biological variation across a tissue, and (c) temporal and stochastic variation in gene expression, thus allowing for a more accurate estimation of the variation found among individuals. From three of the four-quarters of each placenta we dissected 100 mg of centrally located villus parenchyma tissue (taking care to avoid decidua, chorion, or amnion tissue) from five non-adjacent locations, totaling 600 dissections. The five dissections from each quarter were pooled, resulting in three sample replicates from each placenta. Five non-adjacent dissections were taken in an effort to homogenize the cell-type composition among samples. All dissections were carried out on a sterilized steel plate situated on top of dry ice, thereby keeping the samples frozen at all times. Samples for dissection were chosen at random to avoid any possible dissection processing effect that would correlate with individual ancestry.
Total RNA isolation
RNA was extracted and isolated from each of the three sample replicates from each placenta using TRIZOL reagent (Invitrogen) following manufacturer recommendations. Total RNA was purified using the Qiagen RNAeasy minElute Cleanup kits and RNA quality was determined using Agilent 6000 Nano kits and an Agilent Bioanalyzer.
Construction of indexed RNA-Seq libraries
Based on the RNA integrity numbers (RIN values) the two best sample replicates from each placenta were chosen to construct indexed Illumina RNA-Seq libraries. The indices are sample-specific and allow for the pooling and sequencing of all libraries together, thereby minimizing lane and run effects on the RNA-Seq data. Library construction followed a merging of Illumina’s RNA-Seq library preparation and an indexing protocol  which introduces barcodes for each library during an enrichment PCR step. Library construction included the following steps: two rounds of mRNA capture with oligo dT magnetic beads, mRNA fragmentation, first strand synthesis, second strand synthesis, end repair, index adapter ligation, adapter fill-in, size-selection, indexing/enrichment PCR, and finally quantification. All steps, including SPRI bead reaction clean-ups, were processed in parallel in a 96-well plate, where all samples were randomized across the plate, thereby eliminating any library processing batch effect.
Sequencing, base calling, and mapping
The 80 indexed libraries were pooled in equimolar ratios and sequenced on nine lanes over three runs on the Illumina Genome Analyzer IIx platform. Eight lanes were single-end 76 bp (base pair) reads and a ninth lane was a 76 bp paired-end run. Base calling was done using Ibis  and mapping was performed with TopHat2 , a spliced-read mapper which is built on top of the Bowtie mapper . Reads were mapped to the human reference genome build hg19 (GRCh37). Reads were annotated to known Ensembl 70 genes. All count data were normalized (variance stabilized) using protocols described in the DESeq2 package . In instances where data for individuals are used (such as in PCAs), the raw count data from each replicate for each individual were summed and data for individuals was independently normalized with the aforementioned method.
Apportioning expression variation
where y is normalized gene expression for the kth sample replicate in the jth individual in the ith group, x and z are technical explanatory variables (x is the number of mapped reads for each library and z the RIN value), and μ is mean expression for any gene g. The group (A), individual (B), and sample replicate (e) effects are assumed to be random with variance σA 2, σB 2, and σ2, respectively. After removing the variance from technical factors (x and z), the total variance in gene expression can then be apportioned as σ2 T = σA 2 + σB 2 + σ2. We summarize the amount of expression variance attributed to groups as σA 2 /σ2 T and define this correlation coefficient as Mst, the expression variance analog to the standard among-group component of the total genetic variance, Fst [1,57]. Further we can define the correlation coefficients Met and Mit as the amount of expression variance attributed to sample replicates and error (Met = σ2/σ2 T), and to individuals (Mit = σB 2 /σ2 T). Each parameter ranges in value from 0 to 1 and the sum of these parameters, for each gene, equals 1.
Similarly we also estimated a complementary (η2) statistic for each explanatory factor, using the sums of squares (SS). In this instance, total gene expression variation can be expressed as SST = SSA + SSB + SSe, and we can subsequently define the parameters Net (SSe/SST), Nit (SSB/SST), and Nst (SSA/SST), which mirror the aforementioned parameters derived from the additive components of variance (Met, Mit, and Mst, respectively). Additionally, we defined Nis as SSB/ (SSB + SSe) to quantify the amount of inter-individual variation relative to the total inter- and intra-individual variation. Finally, we defined Nig as SSB/ (SSA + SSB) to quantify the amount of inter-individual variation relative to the total inter-individual and inter-group variation. We will refer to these two sets of parameters as the apportionment of variance (using σ2) and apportionment of variation (using SS) parameters, respectively. An ANOVA table providing further details of the models can be found in Table S1 and S2 in Additional file 1.
Mode of selection permutations
To determine if variation at each gene may be consistent with a particular mode of selection, a series of permutations were performed, building on the models of Whitehead and Crawford . There are four types of selection to consider: directional, balancing, stabilizing, and diversifying. First, directional selection, or simply differential expression (DE), is typified by large variation among groups. To test for directional selection, we permuted individuals among groups 1,000 times, maintaining replicate associations, randomly sampled 100 genes for each permutation, and apportioned variation as described above. The 99th percentile of the permuted Nst distribution was taken as a cutoff for extreme Nst values and thus DE genes.
The second type of selection is balancing selection, typified by high among individual variation along with low among population variation [38,58,59]. Balancing selection was examined by permuting sample replicates among individuals within groups 1,000 times (randomizing inter-individual differences), randomly sampling 100 genes for each permutation and apportioning variation. The parameter Nig was used to identify genes with significantly more variation among individuals than among groups. The 99th percentile of the permuted Nig distribution was taken as a cutoff for extreme Nig values.
The other types of selection are stabilizing selection (characterized by low among individual variation) and diversifying selection (characterized by high individual variation). In these later two modes of selection we specifically assume that selection does not vary spatially and is thus uniform across populations. To identify profiles consistent with stabilizing or diversifying selection, we generated a random distribution of inter-individual variances as follows. Gene expression was normalized across all genes, so that all genes have the same mean expression. We then randomly selected the expression level of any one gene from each individual to create a new artificial gene. We did this 10,000 times and calculated the variance across all individuals with no regard for population association. The first percentile and 99th percentile of this distribution were taken as cutoffs for stabilizing and diversifying selection, respectively.
GO and KEGG enrichment
Enrichment in GO (Gene Ontology) categories and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways were performed using the GOSeq  R package, designed to account for read count biases in transcript length from RNA-Seq data. In all enrichment analyses we present results using two false discovery rate (FDR) cutoffs - a high confidence FDR of 5% and a moderate confidence FDR of 20%. All P values and FDRs are provided in supplementary materials.
Co-expression modules, network construction, and profile partitioning
Gene co-expression modules were identified using a weighted gene co-expression network analysis (WGCNA) . Network graphs were constructed using the graph.adjacency function from the igraph package. Interacting genes, used to build the network, were identified by using the dissimilarity values from the WGCNA analysis but limiting them to those that were additionally significant in a linear regression correlation analysis at an FDR of 1%. For our data and this analysis, an FDR of 1% corresponds roughly to a Pearson’s r > = 0.6 and a dissimilarity value < = 0.3. Gene expression profile partitioning was performed using k-means clustering (kmeans function in the stats package in R). Data from individuals was used and expression at each gene was normalized, prior to clustering, to have a mean of 0 and a standard deviation of 1.
All analyses were carried out in the programming language R  with in-house scripts and the aforementioned packages. All P values are Benjamini-Hochberg  adjusted using the p.adjust function from the R package stats, and significance is taken at a P adjust/FDR of 0.05, unless stated otherwise.
We validate three genes with extreme Mst values by performing rt-qPCR. In this analysis one of the two original sample replicates was used, along with the third RNA sample processed at the same time as the study samples but not used to create an RNA-Seq library. The Maxima SYBR Green qPCR Master Mix from Fermentas was used following the manufacturer’s instructions. Primer sequences are presented in Table S3 in Additional file 1.
All raw, tabulated, and normalized RNA-Seq data can be found in the Gene Expression Omnibus (GEO) under the accession number GSE66622.
We would like to thank all donors for their participation, and the nursing staff at Northside Hospital in Atlanta, Georgia for their assistance in the sample collection.
The Max Planck Society and a grant for visiting young scientists from the Chinese Academy of Science (2009YB1-12) funded the project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Lewontin RC, Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74:175–95.PubMed CentralPubMedGoogle Scholar
- Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–14.View ArticlePubMed CentralPubMedGoogle Scholar
- International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–320.View ArticleGoogle Scholar
- Watkins WS, Ricker CE, Bamshad MJ, Carroll ML, Nguyen SV, Batzer MA, et al. Patterns of ancestral human diversity: an analysis of Alu-insertion and restriction-site polymorphisms. Am J Hum Genet. 2001;68:738–52.View ArticlePubMed CentralPubMedGoogle Scholar
- Watkins WS, Rogers AR, Ostler CT, Wooding S, Bamshad MJ, Brassington A-ME, et al. Genetic variation among world populations: inferences from 100 Alu insertion polymorphisms. Genome Res. 2003;13:1607–18.View ArticlePubMed CentralPubMedGoogle Scholar
- Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB. Human population genetic structure and inference of group membership. Am J Hum Genet. 2003;72:578–89.View ArticlePubMed CentralPubMedGoogle Scholar
- Beaumont MA, Balding DJ. Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol. 2004;13:969–80.View ArticlePubMedGoogle Scholar
- Myles S, Tang K, Somel M, Green RE, Kelso J, Stoneking M. Identification and analysis of genomic regions with large between-population differentiation in humans. Ann Hum Genet. 2008;72:99–110.PubMedGoogle Scholar
- Grossman SR, Andersen KG, Shlyakhter I, Tabrizi S, Winnicki S, Yen A, et al. Identifying recent adaptations in large-scale genomic data. Cell. 2013;152:703–13.View ArticlePubMed CentralPubMedGoogle Scholar
- Colonna V, Ayub Q, Chen Y, Pagani L, Luisi P, Pybus M, et al. Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol. 2014;15:R88.View ArticlePubMed CentralPubMedGoogle Scholar
- Whitlock MC. Evolutionary inference from QST. Mol Ecol. 2008;17:1885–96.View ArticlePubMedGoogle Scholar
- Relethford JH. Apportionment of global human genetic diversity based on craniometrics and skin color. Am J Phys Anthropol. 2002;118:393–8.View ArticlePubMedGoogle Scholar
- Myles S, Somel M, Tang K, Kelso J, Stoneking M. Identifying genes underlying skin pigmentation differences among human populations. Hum Genet. 2007;120:613–21.View ArticlePubMedGoogle Scholar
- Darwin C. The Descent of Man. 1st ed. London: John Murray; 1871.Google Scholar
- Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al. Population genomics of human gene expression. Nat Genet. 2007;39:1217–24.View ArticlePubMed CentralPubMedGoogle Scholar
- Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG. Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 2007;39:226–31.View ArticlePubMed CentralPubMedGoogle Scholar
- Storey JD, Madeoy J, Strout JL, Wurfel M, Ronald J, Akey JM. Gene-expression variation within and among human populations. Am J Hum Genet. 2007;80:502–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang W, Duan S, Kistner EO, Bleibel WK, Huang RS, Clark TA, et al. Evaluation of genetic variation contributing to differences in gene expression between populations. Am J Hum Genet. 2008;82:631–40.View ArticlePubMed CentralPubMedGoogle Scholar
- Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, et al. Patterns of Cis regulatory variation in diverse human populations. PLoS Genet. 2012;8:e1002639 EP.Google Scholar
- Price AL, Patterson N, Hancks DC, Myers S, Reich D, Cheung VG, et al. Effects of cis and trans genetic ancestry on gene expression in African Americans. PLoS Genet. 2008;4:e1000294.View ArticlePubMed CentralPubMedGoogle Scholar
- Lappalainen T, Sammeth M, Friedländer MR, THoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–11.View ArticlePubMed CentralPubMedGoogle Scholar
- International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–96.View ArticleGoogle Scholar
- Dausset J, Cann H, Cohen D, Lathrop M, Lalouel JM, White R. Centre d’etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics. 1990;6:575–7.View ArticlePubMedGoogle Scholar
- Idaghdour Y, Czika W, Shianna KV, Lee SH, Visscher PM, Martin HC, et al. Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet. 2010;42:62–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Somel M, Khaitovich P, Bahn S, Pääbo S, Lachmann M. Gene expression becomes heterogeneous with age. Curr Biol. 2006;16:R359–60.View ArticlePubMedGoogle Scholar
- Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA, et al. Individuality and variation in gene expression patterns in human blood. Proc Natl Acad Sci U S A. 2003;100:1896–901.View ArticlePubMed CentralPubMedGoogle Scholar
- Sood R, Zehnder JL, Druzin ML, Brown PO. Gene expression patterns in human placenta. Proc Natl Acad Sci U S A. 2006;103:5478–83.View ArticlePubMed CentralPubMedGoogle Scholar
- Shalek AK, Satija R, Adiconis X, Gertner RS, Gaublomme JT, Raychowdhury R, et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013;498:236–40.View ArticlePubMed CentralPubMedGoogle Scholar
- Kang HJ, Kawasawa YI, Cheng F, Zhu Y, Xu X, Li M, et al. Spatio-temporal transcriptome of the human brain. Nature. 2011;478:483–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.View ArticlePubMed CentralPubMedGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, et al. Genetic structure of human populations. Science. 2002;298:2381–5.View ArticlePubMedGoogle Scholar
- López Herráez D, Bauchet M, Tang K, Theunert C, Pugach I, Li J, et al. Genetic variation and recent positive selection in worldwide human populations: evidence from nearly 1 million SNPs. PLoS One. 2009;4:e7888.View ArticlePubMed CentralPubMedGoogle Scholar
- Xing J, Watkins WS, Shlien A, Walker E, Huff CD, Witherspoon DJ, et al. Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping. Genomics. 2010;96:199–210.View ArticlePubMed CentralPubMedGoogle Scholar
- Roden J, King B, Trout D, Mortazavi A, Wold B, Hart C. Mining gene expression data by interpreting principal components. BMC Bioinformatics. 2006;7:1–22.View ArticleGoogle Scholar
- Goldinger A, Henders AK, McRae AF, Martin NG, Gibson G, Montgomery GW, et al. Genetic and nongenetic variation revealed for the principal components of human gene expression. Genetics. 2013;195:1117–28.View ArticlePubMed CentralPubMedGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–4.View ArticlePubMedGoogle Scholar
- Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 2013;14:254.View ArticlePubMed CentralPubMedGoogle Scholar
- Whitehead A, Crawford DL. Neutral and adaptive variation in gene expression. Proc Natl Acad Sci U S A. 2006;103:5425–30.View ArticlePubMed CentralPubMedGoogle Scholar
- Leinonen T, McCairns RJS, O’Hara RB, Merilä J. QST–FST comparisons: evolutionary and ecological insights from genomic heterogeneity. Nat Rev Genet. 2013;14:179–90.View ArticlePubMedGoogle Scholar
- Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, et al. A neutral model of transcriptome evolution. PLoS Biol. 2004;2:E132.View ArticlePubMed CentralPubMedGoogle Scholar
- Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–60.PubMed CentralPubMedGoogle Scholar
- Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. Evolutionary rate in the protein interaction network. Science. 2002;296:750–2.View ArticlePubMedGoogle Scholar
- Gallego Romero I, Pai AA, Tung J, Gilad Y. RNA-Seq: impact of RNA degradation on transcript quantification. BMC Biol. 2014;12:42.View ArticlePubMed CentralPubMedGoogle Scholar
- Bjørge T, Sørensen HT, Grotmol T, Engeland A, Stephansson O, Gissler M, et al. Fetal growth and childhood cancer: a population-based study. Pediatrics. 2013;132:e1265–75.View ArticlePubMed CentralPubMedGoogle Scholar
- Milne E, Greenop KR, Metayer C, Schüz J, Petridou E, Pombo-de-Oliveira MS, et al. Fetal growth and childhood acute lymphoblastic leukemia: findings from the childhood leukemia international consortium. Int J Cancer. 2013;133:2968–79.View ArticlePubMed CentralPubMedGoogle Scholar
- Boedigheimer MJ, Wolfinger RD, Bass MB, Bushel PR, Chou JW, Cooper M, et al. Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC Genomics. 2008;9:285.View ArticlePubMed CentralPubMedGoogle Scholar
- Tamura K, Ono A, Miyagishima T, Nagao T, Urushidani T. Comparison of gene expression profiles among papilla, medulla and cortex in rat kidney. J Toxicol Sci. 2006;31:449–69.View ArticlePubMedGoogle Scholar
- Summers K, Crespi B. Cadherins in maternal-foetal interactions: red queen with a green beard? Proc Biol Sci. 2005;272:643–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Crawford M. Placental delivery of arachidonic and docosahexaenoic acids: implications for the lipid nutrition of preterm infants. Am J Clin Nutr. 2000;71:275S–84.PubMedGoogle Scholar
- Munn DH, Zhou M, Attwood JT, Bondarev I, Conway SJ, Marshall B, et al. Prevention of allogeneic fetal rejection by tryptophan catabolism. Science. 1998;281:1191–3.View ArticlePubMedGoogle Scholar
- Umbers AJ, Aitken EH, Rogerson SJ. Malaria in pregnancy: small babies, big problem. Trends Parasitol. 2011;27:168–75.View ArticlePubMedGoogle Scholar
- Brabin BJ, Romagosa C, Abdelgalil S, Menéndez C, Verhoeff FH, McGready R, et al. The sick placenta-the role of malaria. Placenta. 2004;25:359–78.View ArticlePubMedGoogle Scholar
- Kircher M, Stenzel U, Kelso J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 2009;10:R83.View ArticlePubMed CentralPubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–11.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.View ArticlePubMed CentralPubMedGoogle Scholar
- Giger T, Excoffier L, Day PJR, Champigneulle A, Hansen MM, Powell R, et al. Life history shapes gene expression in salmonids. Curr Biol. 2006;16:R281–2.View ArticlePubMedGoogle Scholar
- Weir B, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–70.View ArticleGoogle Scholar
- Nuzhdin SV, Wayne ML, Harmon KL, McIntyre LM. Common pattern of evolution of gene expression level and protein sequence in Drosophila. Mol Biol Evol. 2004;21:1308–17.View ArticlePubMedGoogle Scholar
- Rifkin SA, Kim J, White KP. Evolution of gene expression in the Drosophila melanogaster subgroup. Nat Genet. 2003;33:138–44.View ArticlePubMedGoogle Scholar
- Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11:R14.View ArticlePubMed CentralPubMedGoogle Scholar
- Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559.View ArticlePubMed CentralPubMedGoogle Scholar
- R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. [http://www.R-project.org/]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological). 1995;57:289–300.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.