RNA-seq sample clustering based on gene expression
Gene expression matrix
Raw cRPKM values were obtained from the study by Barbosa-Morais et al. [2]. To have a balanced design, the original matrix was restricted to species for which the same six organs were available (see below). The final matrix consisted of seven vertebrates, including human, chimpanzee, rhesus, mouse, opossum, platypus, and chicken, and six organs, including brain, cerebellum, heart, liver, kidney, and testes.
We restricted the analyses to protein-coding genes with a one-to-one orthology relationship in the seven species. We used the orthology relationships of the Barbosa-Morais et al. study [2], which include 6787 orthologous genes. Of these, we retained 6393 orthologs after checking for consistency against each annotation set in Ensembl v65, for each species (genome and annotation files from the Barbosa-Morais study can be found in Additional file 3: Table S2). Finally, we intersected this set with the list of one-to-one protein-coding orthologs between human and mouse provided by the mouse ENCODE consortium [12], to get a final matrix consisting of expression values for 6283 genes in 42 samples (Additional file 3: Tables S3 and S4).
Hierarchical clustering and PCA
In Figs. 1
a and 2
a, the samples are clustered hierarchically based on their pairwise Pearson’s correlation coefficients of gene expression values, where cRPKM are log10-normalized after adding a pseudocount of 0.01. The samples are then clustered on the vector of the correlation coefficients, with one minus Pearson’s correlation coefficient (1−|r|) as a distance metric, using the complete linkage clustering algorithm.
In Additional file 1: Figure S6A, B, the samples and genes are clustered hierarchically based on gene expression values directly. Again cRPKM are log10-normalized after adding a pseudocount of 0.01 and the complete linkage clustering algorithm is applied on Euclidean distances.
PCA, as shown in Figs. 1
b, 2
b and 3
d, e, was performed on cRPKM values normalized in the same way, but centered and scaled across all the samples for each gene. PCA, as shown in Fig. ??b, c, however, was performed after centering and scaling the normalized cRPKM for each gene across all the organs in a given species (Fig. ??b), and across all the species for a given organ (Fig. ??c), respectively.
Network modularity
The modularity of a graph with respect to some division (or vertex types) measures how good the division is, or how separated the different vertex types are from each other. In this study, we build a graph where samples are vertices (or nodes). Two vertices or samples are connected if the Pearson’s correlation coefficient between them, computed on the gene expression values, is higher than a certain threshold (excluding connections of a sample with itself). As in hierarchical clustering and PCA, gene expression values are log10-transformed cRPKM after adding a pseudocount of 0.01. The vertex types on which the modularity is computed are either the organ or the species classification. To compute the modularity, we used the function modularity() from the R package igraph v0.7.1, which implements the following definition [14]:
$$ Q = \frac{1}{2m} \times \sum_{i} \sum_{j} \left[ \left(A_{ij} - \frac{k_{i} \times k_{j}}{2m} \right) \delta(c_{i},c_{j}) \right], $$
(1)
where m is the number of edges, A
ij
is the element of the adjacency matrix A in row i and column j (corresponding to vertices i and j, respectively), k
i
is the degree of i, k
j
is the degree of j, c
i
is the type (or component) of i, c
j
that of j, the sum goes over all i and j pairs of vertices, and δ(x,y)=1 if x=y, and δ(x,y)=0 otherwise.
Finally, the modularity is plotted as a function of the network density, which is defined as the actual number of edges (based on the threshold of the correlation coefficient) over the total number of possible edges. We set self-connection to 0 in the adjacency matrix even though samples share an identity with themselves, to ensure self-connection does not inflate the modularity calculation. Conclusions are robust to setting self-connection to 1.
Projection score
The projection score is a measure of the informativeness of a subset of variables with respect to PCA visualization [13]. Here, we subset the variables, i.e., the genes, based on increasing thresholds of their variance across all samples (as a ratio to the maximum variance). For each subset of genes, the projection score is computed over 100 permutations with respect to the first three PCs (Additional file 1: Figure S3A), and the subset with the highest score is selected for further analyses. This subset includes 256 genes (Additional file 3: Table S5), and their log10-transformed cRPKM values are shown in Additional file 1: Figure S3B.
Linear models, variance decomposition, and SVG and TVG definition
The expression of each gene in a given sample is usually dependent on the identity of the sample, which here is represented by the organ and the species of origin. More formally, for an individual gene, a linear model can be built that describes its expression as the sum of the factors organ and species and a residual term:
$$ y_{ij} = \mu + \text{org}_{i} + \text{spc}_{j} + \epsilon_{ij}, $$
(2)
where y
ij
is the expression of a gene in organ i (of n
o organs) and species j (of n
s species), μ is the basal expression level of the gene, org
i
is the coefficient for organ i, spc
j
is the coefficient for species j, and ε
ij
is the residual term.
Thus, as in the ANOVA type of analysis, the total gene expression variation for each gene (or total sum of squares, SSTg) across all samples can be decomposed into three variations: variation across organs (SSOg), variation across species (SSSg), and a residual variation (SSRg):
$$ \text{SST}_{\mathrm{g}} = \text{SSO}_{\mathrm{g}} + \text{SSS}_{\mathrm{g}} + \text{SSR}_{\mathrm{g}}, $$
(3)
where
$$ \text{SST}_{\mathrm{g}} = \sum\limits_{i=1}^{n_{\mathrm{o}}}\sum\limits_{j=1}^{n_{\mathrm{s}}}(y_{ij}-\bar{y}_{\cdot\cdot})^{2}, $$
(4)
$$ \text{SSO}_{\mathrm{g}} = n_{\mathrm{s}}\sum\limits_{i=1}^{n_{\mathrm{o}}}(\bar{y}_{i{\cdot}} - \bar{y}_{\cdot\cdot})^{2}, $$
(5)
$$ \text{SSS}_{\mathrm{g}}=n_{\mathrm{o}}\sum\limits_{j=1}^{n_{\mathrm{s}}}(\bar{y}_{{\cdot}j} - \bar{y}_{\cdot\cdot})^{2}, $$
(6)
$$ \text{SSR}_{\mathrm{g}}=\sum\limits_{i=1}^{n_{\mathrm{o}}}\sum\limits_{j=1}^{n_{\mathrm{s}}}(y_{ij} - \bar{y}_{i\cdot} - \bar{y}_{{\cdot}j} + \bar{y}_{\cdot\cdot})^{2}, $$
(7)
and
$$ \bar{y}_{\cdot\cdot}=\sum\limits_{i=1}^{n_{\mathrm{o}}}\sum\limits_{j=1}^{n_{\mathrm{s}}} y_{ij}, $$
(8)
$$ \bar{y}_{i{\cdot}}=\frac{1}{n_{\mathrm{s}}} \sum\limits_{j=1}^{n_{\mathrm{s}}} {y_{ij}}, $$
(9)
$$ \bar{y}_{{\cdot}j}=\frac{1}{n_{\mathrm{o}}} \sum\limits_{i=1}^{n_{\mathrm{o}}} {y_{ij}}. $$
(10)
The relative contribution of each factor to the total gene expression variation can then be computed as the relative proportion of each variation with respect to the total. The linear model was implemented using the function lm() from basic R. A convenient in-house wrapper is available at https://github.com/abreschi/Rscripts/blob/master/anova.R.
In a two-factor linear mixed model, the factors organ and species can be considered as giving an independent additive contribution to the gene expression level with variances \(\sigma _{\mathrm {o}}^{2}\) and \(\sigma _{\mathrm {s}}^{2}\), respectively, along with an independent additive contribution of the residual term that has variance \(\sigma _{\mathrm {e}}^{2}\). In this case, the relative contribution of each factor (e.g., organ) to the gene expression variation can be thought of as the variance of that factor over the sum of the variances of both factors plus the residual variance (e.g., \(\sigma _{\mathrm {o}}^{2}/(\sigma _{\mathrm {o}}^{2} + \sigma _{\mathrm {s}}^{2} + \sigma _{\mathrm {e}}^{2})\) [23]). The linear mixed models were implemented by using the function lmer() of the R package lme4 v1.1-7.
As the correlation between the relative contributions with the linear model and with the linear mixed model is very high for both factors (Additional file 1: Figure S11A, B), we decided to use the linear model, which requires no estimation step and is more intuitive.
To remove genes with relatively low variability of expression, we filtered them based on their dynamic range, computed on cRPKM after adding a pseudocount of 0.01. The dynamic range for each gene is defined as the difference in order of magnitudes between the maximum and the minimum expression across all samples. We used a minimum threshold of 2 orders of magnitude [11], to retain only the most variable genes, which we refer to as unconstrained. Within this set of unconstrained genes, we further considered genes for which either species or organ explains at least 75 % of the variance (dashed line at x+y=0.75 on Fig. 3
a and f), and defined two sets of genes: genes whose relative variation of expression is twofold greater across species than across organs (SVGs) and genes whose relative variation of expression is twofold greater across organs than across species (TVGs). The unconstrained genes that are neither SVGs nor TVGs are referred to as others.
To find the distribution of the proportion of expression variation between human and each other species (Fig. 3
b), we built a linear model for all the organs of human and the other species. The gene expression values were log10-normalized, after adding a pseudocount of 0.01, and centered and scaled within each sample. Since gene expression is known to evolve much faster in testis [1], we performed the same analysis excluding testis. We found the same result (Additional file 1: Figure S4).
Properties of SVGs and TVGs
GO analysis
The GO term enrichment analysis in Additional file 1: Figure S8 was performed separately for each set of genes, with respect to all 6283 orthologous genes in the matrix, used as background. The enrichment is tested with the hypergeometric test implemented in the R package GOstats v2.34.0. Ensembl gene IDs are converted to entrez gene IDs via the R package org.Hs.eg.db v3.1.2, and mapped to gene ontology through the R package GO.db v3.1.2. The GO terms associated with the biological process hierarchy are sorted by their p values corrected for multiple testing (Benjamini–Hochberg correction [24]), and the top ten significantly enriched terms are shown for each group of genes.
Evolutionarily conserved genes
We computed the fraction of evolutionarily conserved genes as the proportion of genes in each class that were identified as being orthologous between human, fly (Drosophila melanogaster), and worm (Caenorhabditis elegans) as defined by the modENCODE consortium [16].
Promoter analysis
Promoter sequence conservation.
The promoter sequence conservation was computed for a window of 2000 bp upstream and 500 bp downstream of the transcription start site (TSS) of each gene. A gene TSS is defined as the most 5′ base of the gene. PhastCons scores [25] in this window are averaged from the bigwig file with the bwtool software [26] and this average is taken as a measure of promoter sequence conservation.
CpG island coverage.
For each base in a 6000-bp window around the TSS (±3000 bp) of the human genes of our set of one-to-one orthologs, we computed a binary overlap with CpG islands, as annotated in [17]. We then computed the proportion of genes in each class with at least one overlapping CpG island, as shown in Fig. 4
d.
H3K4me3 signal.
We compared the intensity of the H3K4me3 mark at the promoter of each category of gene in the five organs for which ChIP-seq experiments were available from the mouse ENCODE consortium [12] (Additional file 3: Table S6), namely cerebellum, heart, kidney, liver, and testes. As H3K4me3 is a mark known to be present at the promoter of the majority of actively transcribed genes, we restricted our comparison to a subset of TVGs specific to each organ, and to a subset of SVGs and others that are comparable to this subset of TVGs in terms of number of genes and expression values.
To select genes specific to each organ, we required the genes to be common to multiple species and shared by a limited number of organs. As shown in Additional file 1: Figure S12, five is the number of species for which we have the maximum number of genes specific to one organ and present in this number of species. We identified 1086 such genes (Additional file 1: Figure S13).
For each subset of TVGs specific to a given organ, we selected a subset of SVGs and other genes with the same number of genes and a similar expression. To select genes with comparable expression, we binned the expression values of SVGs, other genes, and TVGs specific to one organ in 50 expression bins. Then, for each bin, we randomly selected a number of genes from each class, corresponding to the minimum number of genes available in that bin.
The average signal intensity for each mark was computed around the TSS (±3000 bp) at each 10-bp bin for the three classes of genes in each organ (Fig. 4
e and Additional file 1: Figure S9).
GWAS and OMIM analyses
For each category of genes, we computed the proportion of genes with an associated disease in the OMIM database (http://omim.org/, version updated to June 2014) or a trait in the GWAS catalog (https://www.ebi.ac.uk/gwas/, version updated to June 2014). For the genes associated with a GWAS trait, we used the gene reported in the catalog, when available.
Analysis of inter-individual variation in GTEx
Gene expression values (RPKM) for the latest public GTEx release were downloaded from the GTEx portal (http://www.gtexportal.org/home/datasets/, file: GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz). To apply linear models to a balanced design matrix with organs and individuals, we retained gene expression data from the only four donors for which most of the organs in Barbosa-Morais et al. [2] were available (cerebellum, heart, kidney, liver, and testis; Additional file 3: Table S7). To remove genes with relatively low variability of expression, and for consistency with the previous analyses, we filtered them based on their dynamic range, computed on cRPKM after adding a pseudocount of 0.01 (see “Linear models, variance decomposition, and SVG and TVG definition”). To estimate the proportion of expression variation across organs and donors, we built a linear model for each individual gene that describes its expression as the sum of the organ and donor factors and a residual term (see “Linear models, variance decomposition, and SVG and TVG definition”). The relationship between the relative variation across donors and organs is shown in Additional file 1: Figure S10B.