Genetic regulation of gene expression and splicing during a 10-year period of human aging

Background Molecular and cellular changes are intrinsic to aging and age-related diseases. Prior cross-sectional studies have investigated the combined effects of age and genetics on gene expression and alternative splicing; however, there has been no long-term, longitudinal characterization of these molecular changes, especially in older age. Results We perform RNA sequencing in whole blood from the same individuals at ages 70 and 80 to quantify how gene expression, alternative splicing, and their genetic regulation are altered during this 10-year period of advanced aging at a population and individual level. We observe that individuals are more similar to their own expression profiles later in life than profiles of other individuals their own age. We identify 1291 and 294 genes differentially expressed and alternatively spliced with age, as well as 529 genes with outlying individual trajectories. Further, we observe a strong correlation of genetic effects on expression and splicing between the two ages, with a small subset of tested genes showing a reduction in genetic associations with expression and splicing in older age. Conclusions These findings demonstrate that, although the transcriptome and its genetic regulation is mostly stable late in life, a small subset of genes is dynamic and is characterized by a reduction in genetic regulation, most likely due to increasing environmental variance with age.


Expression quantification and quality control
The quality of the raw reads was assessed using FastQC (v0.11.5). The adaptors were clipped using cutadapt (v1.8.1) 2 requiring at least three bases to match (--min overlap 3) and removing processed reads shorter than 20 bases (--min length 20). RNA-Seq reads were mapped to the NCBI v37 H. sapiens reference genome using STAR (v2.4.2a) 3 . Only uniquely aligned reads were used for downstream quantification and analysis. The percentage of reads marked as PCR duplicates was computed using Picard. For the differential expression and eQTL analysis, PCR duplicates were not filtered out, since it has been shown that computational removal of duplicates does not improve power or FDR in differential expression analyses 4 , but the proportion of PCR duplicates used as a technical covariate in downstream analysis. For the differential ASE analysis, PCR duplicates were removed. Mapping statistics from the BAM files were acquired through Samtools flagstat (v1.2) 5 . The 5' and 3' coverage bias, duplication rate and insert sizes were assessed using Picard tools (v2.0.1). HTSeq was used to quantify gene expression 6 .
Expression data on the sample level were first corrected for library size using the DESeq2 R package 7 . Genes were excluded if they had less than 5 counts on average for either age groups, and zero counts in more than 20% of individuals (to minimize tails). A total of 16,087 genes were expressed. For the eQTL analysis, genes from the sex chromosomes as well as mitochondrial genes were also excluded, leaving 15,729 genes in the analysis.
To identify potential outlier individuals, we performed PCA ( Figure S1). Samples that demonstrated extreme values in the first two principal components on the expression levels were removed (more than 3 standard deviations). No samples were excluded based on this measure. To identify low quality samples, we applied several quality metrics ( Figure S1). Samples were removed if they had insufficient reads (≤20M), poor mappability (≤60%), and low correlation with other samples (D-statistic ≤ 0.85). We also checked for sample mix-ups by comparing agreement between true and RNA-Seq-inferred heterozygous SNPs for all possible pairs of RNA-Seq and genotype data. No samples were excluded based on these metrics. Three individuals with low RNA integrity number (RIN ≤ 5) were marked. These individuals were not excluded since they do not seem to appear as outliers in the PCA plots ( Figure S1C).

Measuring DNA methylation and quality control
Extracted DNA was bisulphite-converted using the Zymo bisulphite conversion kit and hybridized to the Infinium HumanMethylation450 BeadChip (450k array). Signal intensities were measured with the BeadChip scanner. The 130 samples (with available RNA-Seq data) were distributed across two periods of methylation data collection, eight 96-well plates (5 plates for the 70-year old samples and 3 plates for the 80-year old samples), and forty-one 12-sample chips (24 chips for the 70-year old samples and 17 chips for the 80-year old samples).
Quality control was performed using the R package minfi (version 1.20.0) 8 . Missing values were imputed by the k-nearest neighbor approach using the impute R package (version 1.48.0) 9 . Background correction and dye-bias normalization were done using noob 10 through minfi. Signal intensities were converted to DNA methylation β-values, i.e. the ratio of methylated probe intensity to total (methylated + un-methylated) probe intensity.
Inferring cell-type frequencies from whole blood Whole blood is a heterogeneous mixture of cell types. Since gene expression and DNA methylation vary across different cell types, correlations between the phenotype of interest (e.g. age) and the cell type composition may lead to a large number of false discoveries. False discoveries due to cell type heterogeneity can be addressed by adding the cell proportions as covariates. Since cell counts were not available for our samples, we used computational methods to estimate their composition.
To estimate blood cell composition from gene expression data, we used CIBERSORT 11 . While this tool was designed from micro-array data, it has been shown to have reasonably robust cross-platform performance. To estimate blood cell composition from DNA methylation data, we used Houseman's reference-based method 12 , including their provided reference data and signature CpG sites. This approach was used on our methylation data after adjustment with noob 10 . Results are shown in Figure S2A.
Methylation-based cell-type frequency estimates are closer to expected values for adults of similar age, compared to expression-based estimates. Moreover, while methylation-based estimates are mainly correlated with biological covariates, expression-based estimates are highly correlated with technical factors Figure S2B. For these reason, we use methylationbased estimates in downstream analyses. Methylation-based estimates of B and CD8 T cells, granulocyte, and monocytes showed a significant difference between the two ages (2-sample Wilcoxon test; P = 0.018, 0.019, 1.21 × 10 −4 , and 9.29 × 10 −3 , Figure S2C).

Background noise correction in RNA-Seq experiments
We consider analyses corrected for either measured and/or inferred determinants of gene expression variability. Below we describe the selection of the known factors and the inference of the inferred factors.

Measured determinants of gene expression variability in RNA-Seq experiments
We considered 24 measured variables as candidate components of RNA-Seq variability, listed in Table S1. In order to decide which of the variables affect gene expression, we performed a multiple linear mixed model regression on the expression of each gene using the lme4 R package 13 . We used the π 1 statistic 14 to detect technical covariates affecting a large number of genes, i.e. π 1 ≥ 5%, and only consider those covariates in subsequent analyses. Table S1 also lists the median % of gene expression variance accounted for (VAF) by each measured variable, estimated using the R package variancePartition 15 , as well as the proportion of genes each variable was associated with at 5% FDR.
Figure sfig:KnownAndInferredFactorsA and C show the correlation of the variables (that have π 1 ≤ 5%) with age and the proportion of gene expression variance they explain. Since age is moderately correlated with RIN (Spearman's ρ = −0.46, P = 5.16 × 10 −8 ) and RNA concentration (Spearman's ρ = −0.30, P = 4.21 × 10 −4 ) and RIN and RNA concentration are associated with gene expression, these variables could act as potential confounders and we thus include them in the model for differential expression analysis.

Inferred determinants of gene expression variability in RNA-Seq experiments
We used surrogate variable analysis (SVA) to infer hidden factors from the RNA-Seq data. Two different algorithms for extracting hidden factors were considered: the twostep SVA procedure 16 without setting any covariate of interest, implemented in the sva R package 17 , and the IRW-SVA algorithm, setting age as the covariate of interest 18 , implemented in the SmartSVA R package 19 . The later algorithm, to which we hereafter refer as supervised SVA, attempts to protect the effect of age by identifying a subset of genes that show strong association with the underlying sources of gene expression heterogeneity but no association with age, also referred to as negative or empirical control genes.
We use the Buja and Eyuboglu method 20 , a permutation-based selection rule for the number-of-factors problem, to estimate the number of hidden factors that explain a significant proportion of gene expression variability, larger than what would be expected by chance. Using the SVA method, we find 12 and 15 factors when performing the unsupervised and supervised algorithms, respectively.
We found that the inferred factors summarize multiple correlated measured factors (Figure sfig:KnownAndInferredFactorsB) with significant contribution to variability in the RNA-sequencing data (Figure sfig:KnownAndInferredFactorsB). Generally, we observed that the top factors largely correspond to technical factors such as RNA extraction date, RIN scores and factors specific to RNA-seq such as percent duplicated reads and others obtained from the Picard metrics and to a much lesser extent to biological factors such as estimates of cell type frequencies.

Co-localization analysis for DE genes
For each DE gene, we obtained colocalization posterior probabilities (CLPP) between GWAS summary statistics of several complex traits and GTEx 21 whole blood from the LocusCompare database (http://locuscompare.ml:3838/). We defined any locus with CLPP ≥ 0.05 to have sufficient evidence for colocalization.

Differential expression analysis in the SardiNIA study
The SardiNIA study consists of 605 individuals (56% females, average age 57) from 195 families with measured RNA-Seq 22 . From a total of 19,646 genes expressed in Sar-diNIA, 14,847 of them are also expressed in PIVUS, using the same threshold for calling a gene expressed (see above). To account for the family structure, we perform the differential expression analysis using the pedigree-based linear mixed-models implemented in the coxme R package 23 . Specifically, let E ij and Age ij denote the gene expression and age for the j th member of family i, and x ij a p-dimensional vector with known / inferred determinants of gene expression for i = 1, . . . , n and j = 1, . . . , n i . Here we correct the analysis for sex. In the mixed models framework, a set of family-specific random effects T is introduced to model the within family dependencies and E ij is modeled conditional on the n i × 1 random-effects vector b i , and covariate information Age ij and x ij as Where β 0 is the intercept term, β 1 is the fixed effect of age, and β 2 is the p-dimensional regression coefficients vector for the additional covariates. Moreover, R n i is the coefficient of relationships matrix with elements r jk = 2 −d jk with d jk denoting the distance between subjects j and k in the pedigree and σ 2 b the genetic variance parameter. σ 2 is the residual variance and I n i is an n i × n i identity matrix. Table S1: Measured covariates that can introduce variability in RNA-sequencing experiment. Table lists technical factors directly obtained from Picard QC metrics (Type "T/Picard"), factors relating to sample preparation and storage (Type "SP"), methylation-based cell type frequencies (Type "MCTF"), and factors measured in the clinic (Type "C"). For subsequent analyses, we only consider factors that affect a large proportion of genes (π 1 > 5%). Table also shows median % of expression variance accounted for (VAF) by each factor and genes each factor was associated with at 5% FDR.π 1 , VAF, and the number of associations for each covariate were estimated via a multiple linear mixed model per gene correcting for all uncorrelated covariates.      Figure S4: Gene expression within individuals is highly correlated. (A) Intra-and interindividual gene expression correlations (Spearman's ρ; across all genes) based on uncorrected data (Uncorrected), data corrected for confounders and all inferred components of gene expression variability (Hidden (protect age) and confounders) or confounder and only inferred components that are uncorrelated with age (Hidden* (protect age) and confounders    TAX1BP1  TERF2IP  TFDP1  TFR2  THEM5  TLK1  TMEM245  TMEM57  TPRG1L  TRAV13−2  TRBC2  TREML3P  TSG101  TSTA3  UBE2B  UBE2H  UBQLN1  USP12  VWCE  WBP2  WNK1  Outlier genes for the individual that showed the largest outlying expression increase with age in IGKV1-27. The same individual was an outlier for several additional immunoglobulin-related genes (in red). Y-axis shows the expression differences between two ages. (B) Outlier genes for the individual that showed the largest outlying expression decrease with age in BIRC2. The same individual was an outlier for several other genes related to proteasomal protein catabolic process (in red). (C-E) Enrichment for GO biological processes for each individual's set of genes with outlying decrease of expression with age (C) and outlier phenotypes for the individuals with significant enrichment of GO terms (E). We observed significant enrichment (FDR ≤ 5%) for known agerelated GO terms for three individuals. For two of these individuals, one of which showed a large increase in leukocyte counts between the two ages, we see enrichment for terms related to immune response 24 . For the third individual, we see enrichment for terms related to protein catabolism 25 . The same individual had a substantial increase in albumin levels between the two ages, and was diagnosed with diabetes between age 70 and 80. Y-axis in D shows the phenotype differences between two ages (δY 80−70 ). The larger symbols indicate where the outlier individual is located in the distribution of each gene/phenotype.  Figure S7: Age-specific genetic regulation across the genome. (A) Number of genes with at least one significant eQTL (eGenes) at age 70 (red) and 80 (blue) for different background noise correction strategies (shape), minor allele frequencies (MAF), and FDR thresholds used to identify eGenes and eQTLs 26 . We find the largest number of discoveries when we correct for hidden factors. For all MAFs and FDR thresholds we make more discoveries at age 70, compared to age 80. (B) Proportion of eGenes discovered at age 70 (80) that validated at age 80 (70) for different background noise correction strategies, validation FDR levels (FDRv), and MAF filters for candidate eQTL at each age. The discovery FDR was 1%. *, **, and *** indicate that the validation proportion at age 70 is lower than the one at age 80 at ≤10%, 5%, and 1% nominal significance levels, respectively.   Figure S8: Known and inferred determinants of alternative splicing variability. (A) Correlation between measured covariates and alternative splicing hidden factors. Hidden factors are correlated with several measured factors. Several hidden factors are correlated with age, e.g. Spearman's ρ f8,age = 0.53.

ID
(B) Proportion of alternative splicing variance explained (VE) by measured and hidden factors. VE for each measured or hidden factor is estimated by fitting a multiple linear mixed model with all measured or hidden factors for each gene. VE by the measured or hidden factors for alternative splicing is lower than the VE for expression. This is due to intron excision ratios being internally normalized and thus less affected by technical variability.