In this study, we provide a framework for data correction leveraging the structure of scale-free networks. We show that for scale-free networks, principal components of a gene expression matrix can consistently identify components that reflect artifacts in the data rather than network relationships. It has been shown that real-world networks including co-expression networks often have scale-free topology, i.e., the node degree distribution of these networks follow a power law [13,14,15]. Several studies have employed the assumption of scale-free topology to infer high-dimensional gene co-expression and splicing networks [6, 16].
Latent factor-based data correction has been successfully employed in many applications in genomics from genome-wide association studies, cis- and trans-eQTL mapping, to differential expression analysis [9, 17,18,19,20]. In genome-wide association studies investigating the association between genotype and complex traits, it has been shown that top principal components explain the broad correlation between genotypes which generally reflects population structure rather than a desired functional biological signal of interest [20]. Co-expression analysis is more complicated because confounders affect sets of genes in ways that resemble co-expression. Here, we show mathematically, through simulation (Fig. 1, Additional file 1: Notes 1, 2.1, and 2.2; Additional files 2 and 3) and through real data examples that similar to genetic association studies, the broad correlation between gene expression levels in uncorrected data appears to reflect artifacts. We expect that most real co-expression networks are sparse which means that most genes are only connected to a small subset of other genes. We prove that when such networks satisfy the scale-free property, the signals from the network will not be sufficiently broad across genes to influence the latent variable estimates from PCA. Thus, principal components will primarily capture latent confounders, which can then be regressed from the expression data before network reconstruction is performed (Additional file 1: Note 1).
Using a toy and scale-free simulation, we first showed that confounding can introduce false correlations between sets of genes that can mimic co-expression and can lead to false edge discovery during reconstruction of co-expression networks with graphical lasso—sometimes at the expense of losing true connections (Fig. 1d-f, Additional file 2). We corrected the confounded simulated data using our PC-based approach and reconstructed the network using the residuals. Graphical lasso correctly estimated the network structure obtained from corrected data, which was the same as the true network structure that was obtained from the original simulated data (Fig. 1a-c,g-h, Additional file 2). We also simulated multivariate Gaussian data with 350 samples and 5000 genes from an underlying scale-free network (Additional file 3). Similar to the previous simulation, we found that confounding in data can introduce a lot more false positives in reconstructed co-expression networks. We also showed that networks reconstructed with PC corrected data in this setting were more similar to original simulated data compared to confounded data (Additional file 3). Throughout our analysis, to estimate the number of principal components to be removed, we used a permutation-based scheme [21] as implemented in the sva package [22].
To demonstrate the impact of latent confounders and principal component correction on the reconstruction of co-expression networks from real large-scale human gene expression measurements, we applied our method to RNA-Seq data from the Genotype-Tissue Expression (GTEx) project v6p release. We considered data from eight diverse tissues containing between 304 and 430 samples each (Additional file 1: Table S2): Subcutaneous adipose, lung, skeletal muscle, thyroid, whole blood, tibial artery, tibial nerve, and sun-exposed skin. Using the most variable 5000 genes (Additional file 1: Notes 2 and 4), we reconstructed co-expression networks for each tissue with two popular methods: (a) weighted gene co-expression network analysis [6, 23] and (b) graphical lasso [7, 24]. Since the true underlying co-expression network structure is not known, we assessed the networks using gene pairs annotated to function in the same pathways [25, 26] as ground truth edges.
We inferred networks obtained by using (a) uncorrected expression data, the residuals after regressing out (b) RNA integrity number (RIN), (c) exonic rate—a mapping covariate that corresponds to fraction of reads mapped to exons, (d) sample-specific estimate of GC bias, all known to be common confounders in mRNA gene expression data [27,28,29], and (e) residuals from multiple regression model using covariates that explained at least 1% of expression variance (adjusted R2 ≥ 0.01, Additional file 1: Table S3–S5) [28, 30,31,32,33].
Co-expression gene modules obtained from weighted signed co-expression networks (Additional file 1: Note 2.4) were interpreted as fully connected subgraphs, as is standard. For most tissues, networks obtained from data corrected for latent confounders showed fewer false discoveries compared to those obtained from uncorrected data or from correcting for individual covariates including RIN, exonic rate (a quality metric from RNA-Seq mapping), or sample-specific GC bias (Fig. 2, Additional file 1: Figures S1, S3, and S8). Improved performance of networks obtained from PC corrected data was more evident in the whole blood, skeletal muscle, tibial artery, tibial nerve, subcutaneous adipose, and thyroid. But for some tissues such as the lung, PC correction only contributes to moderate improvement on false discovery rates in the reconstructed networks. It is possible that in these cases, the networks may violate the scale-free assumption or that true signal was already sufficiently strong in the raw data. We also observed that correcting gene expression data with multiple technical covariates (approximately 9–17 were used per tissue, Additional file 1: Table S5) sometimes improved the reconstruction of co-expression networks obtained by WGCNA (Fig. 2a–c, Additional file 1: Figure S1). Average WGCNA module size for networks with cut-height greater than 0.99 was smaller with PC-corrected data compared to uncorrected counterparts (Additional file 1: Figure S15). We also observed that the number of genes assigned to the gray (unassigned) module in WGCNA was considerably higher in PC-corrected networks (Additional file 1: Figure S15). Finally, we repeated this analysis by varying multiple settings of WGCNA and found that PC corrected showed improvement in most tissues consistently (Additional file 1: Figures S10 and S11).
In graphical lasso networks, we found that networks estimated with principal component corrected data showed fewer false discoveries compared to networks estimated with uncorrected, RIN-corrected or multiple covariates corrected data (Fig. 2d–f, Additional file 1: Figure S2). We observed that in generally improved performance on false discoveries in PC corrected networks over raw data in the whole blood, the skeletal muscle, tibial artery, and tibial nerve. Compared to raw data, jointly correcting the gene expression data for multiple technical covariates that affect expression measurements also improved reconstruction with graphical lasso in some tissues such as the whole blood, thyroid, and tibial artery, while it showed little to no improvement over uncorrected data in the lung, muscle, tibial nerve, and sun-exposed skin (Fig. 2d–f, Additional file 1: Figure S2). However, we observed that across all tissues, PC correction still shows fewer false discoveries compared to multiple technical covariate-based correction. There was no visible improvement in network reconstruction between using uncorrected data and residuals from RIN or exonic rate, thereby suggesting that RIN, exonic rate, or GC bias individually is not a sufficient alternative for the wide range of confounding variation found in gene expression data (Fig. 2, Additional file 1: Figures S2, S4, and S9). We also found that there was no improvement on false negative rates upon PC correction in networks built with WGCNA or graphical lasso (Additional file 1: Figure S14).
With both WGCNA and graphical lasso, networks inferred from principal component corrected data were much sparser than networks from uncorrected and RIN, exonic rate, or GC bias corrected counterparts (Fig. 2g–l). Further, PC corrected networks from graphical lasso also showed higher clustering coefficient and fewer hubs compared to others (Additional file 1: Figures S12 and S13).