Tensorial blind source separation for improved analysis of multi-omic data
- Andrew E. Teschendorff†^{1, 2, 3}Email author,
- Han Jing†^{1, 4},
- Dirk S. Paul^{5},
- Joni Virta^{6} and
- Klaus Nordhausen^{7}
Received: 27 March 2018
Accepted: 18 May 2018
Published: 8 June 2018
Abstract
There is an increased need for integrative analyses of multi-omic data. We present and benchmark a novel tensorial independent component analysis (tICA) algorithm against current state-of-the-art methods. We find that tICA outperforms competing methods in identifying biological sources of data variation at a reduced computational cost. On epigenetic data, tICA can identify methylation quantitative trait loci at high sensitivity. In the cancer context, tICA identifies gene modules whose expression variation across tumours is driven by copy-number or DNA methylation changes, but whose deregulation relative to normal tissue is independent of such alterations, a result we validate by direct analysis of individual data types.
Keywords
Background
Omic data is now most often generated in a multi-dimensional context. For instance, for the same individual and tissue type, one may measure different data modalities (e.g. genotype, mutations, DNA methylation or gene expression), which may help pinpoint disease-driver genes [1]. Alternatively, for the same individual, the same data type may be measured across different tissues or cell types [2, 3], which may help identify the most relevant cell types or tissues for understanding disease aetiology. We refer to all of these types of multi-dimensional data generally as multi-way or multi-omic data, and when samples and molecular features are matched, the data can be brought into the form of a multi-dimensional array, formally known as a tensor [4].
While several statistical algorithms for the analysis of multi-way or tensorial data are available [4–7], their application to real data has been challenging. There are mainly three reasons for this. First, the associated multi-way datasets are often very large and how well the algorithms perform on such large sets is currently still unclear. Second, the algorithms can be computationally demanding, compromising their benefit-to-cost ratio [4]. Third, interpreting the output of these algorithms requires an in-depth understanding of the underlying methods. Exacerbating this problem, most available software packages are not user-friendly, requiring the user to have such an in-depth understanding to extract the relevant biological information. Beyond these technical challenges, there is also a lack of comparative studies, making it difficult to choose the appropriate algorithm for the task in question.
To help address some of these outstanding challenges, we here consider and evaluate a novel data tensor decomposition algorithm [8, 9], which is based on blind source separation (BSS), and specifically independent component analysis (ICA) [10]. Although common BSS techniques such as non-negative matrix factorisation and ICA have been successfully applied to a wide range of single omic data types, including e.g. gene expression [11–16], DNA methylation [17] and mutational data [18], their application to multi-way data is largely unexplored [19]. For single-omic datasets, the improved performance of ICA over non-BSS techniques like principal component analysis (PCA) is due primarily to the non-Gaussian and often sparse nature of biological sources of variation, which means that statistical deconvolution of biological samples benefits from non-linear decorrelation measures such as statistical independence (as used in ICA) [13]. It is, therefore, natural to consider analogous ICA algorithms for multi-way data, as we do here, since these may also lead to improved inference.
To assess this, we here benchmark our novel tensorial BSS algorithm against some of the most popular and powerful algorithms for inferring sources of variation from multi-omic data, including JIVE (joint and individual variation explained) [5], PARAFAC (parallel factor analysis) [4, 6], iCluster [7] and canonical correlation analysis (CCA) [20–22]. Each of these algorithms has particular strengths and weaknesses, which render comparisons between them highly non-trivial. For instance, a limitation of CCA is that it can infer only common sources of variation between data types or tissues, in contrast to JIVE or PARAFAC, which can infer both joint as well as individual sources of variation. On the other hand, JIVE and CCA can be run on multiple data matrices with different numbers of molecular features, while PARAFAC and iCluster require matched sets of features (and samples) for each data type. Model complexity also differs substantially between methods, with PARAFAC exhibiting a much lower model complexity than an algorithm such as iCluster. Thus, a comparison of all of these methods is of paramount interest, and here we do so in a tensorial context, i.e. one where the multi-way data is defined over a matched set of molecular features (e.g. genes or CpGs) and samples across all data types, allowing the data to be brought into the form of a tensor. Specifically, we shall here consider order-3 data tensors, i.e. data which can be brought into the form of an array with three dimensions (often called modes). In our evaluation and comparison of all multi-way algorithms, we consider both simulated data as well as data from real epigenome-wide association studies (EWAS). We further illustrate potential uses of our tensorial BSS algorithm (i) to detect cell-type-independent and cell-type-specific methylation quantitative trait loci (mQTLs) in multi-cell-type or multi-tissue EWAS and (ii) to detect cancer gene modules deregulated by copy-number and DNA methylation changes.
Results
Tensorial ICA outperforms JIVE, PARAFAC, iCluster and CCA on simulated data
Tensorial PCA/ICA reduces running time compared to JIVE, PARAFAC and iCluster
Comparison of running times of multi-way algorithms
Algorithm | Number of | Runtime (s) | Runtime (s) |
---|---|---|---|
components | n_{ g }=1000 | n_{ g }=2000 | |
CCA | K=3 | 0.57±0.11 | 1.15±0.12 |
K=12 | 0.58±0.14 | 1.34±0.13 | |
JIVE | j_{ V }=1, i_{ V }=(1,1) | 28.44±4.04 | 23.84±4.18 |
j_{ V }=4, i_{ V }=(4,4) | 137.77±44.94 | 173.20±60.35 | |
tPCA | (2,2) | 0.57±0.17 | 1.35±0.24 |
(2,6) | 0.61±0.15 | 1.25±0.23 | |
tWFOBI | (2,2) | 0.65±0.16 | 1.50±0.21 |
(2,6) | 0.76±0.14 | 1.53±0.25 | |
tWJADE | (2,2) | 0.66±0.19 | 1.44±0.24 |
(2,6) | 1.23±0.21 | 2.71±0.28 | |
PARAFAC | R=6 | 22.37±2.53 | 37.32±4.51 |
R=12 | 48.11±2.98 | 100.83±7.92 | |
iCLUSTER | K=3 | 79.28±14.16 | 595.06±70.67 |
K=12 | 114.28±30.02 | 688.74±166.85 |
tICA exhibits improved power in a real multi-tissue smoking EWAS
Next, we scaled up the data matrices by combining the 62 smkDMCs with a larger set of 10 000 non-smkDMCs, recomputing the SEs (again for 100 different Monte Carlo selections of 10 000 non-smkDMCs). As expected, with an increase in the number of CpGs, the SE of all algorithms dropped, likely driven by increased confounding due to genetic variation (Fig. 3c,d). With the increase in probe number, tICA (tWFOBI and tWJADE) outperformed not only JIVE, PARAFAC and CCA/SCCA, but also tPCA (Fig. 3c,d), in line with the increased sparsity of the smoking-associated source of variation.
tICA identifies mQTLs in a multi-cell-type EWAS
Next, we validated the mQTLs found using an independent dataset. Thus, we applied tWFOBI to the blood-buccal EWAS considered earlier. We inferred a source tensor of dimension 2×26×447 259, i.e. a total of 52 ICs, defined over two tissue types and 26 components in sample-mode space. As before, we observed very strong enrichment, notably for the same chromosomes 6 and 21 (Additional file 1: Figure S5). The previously found mQTL at the PM20D1 locus was also prominent in one of the inferred ICs in this blood-buccal EWAS, confirming its validity and further supporting that this mQTL is cell type independent (Fig. 5d). Overall, from the pure blood-cell-subtype EWAS, we detected a total of 1763 mQTLs, of which 547 were also observed in the blood-buccal EWAS (odd ratio = 12.8, Fisher test P<1×10^{−50}, Fig. 5e). Thus, we can conclude that tWFOBI is able to identify components of variation across cell types and samples that capture a significant number of mQTLs, without matched genotype information.
tICA outperforms JIVE and PARAFAC in their sensitivity to detect mQTLs
Next, we repeated the same sensitivity analysis to detect mQTLs in our buccal-blood EWAS, now also including CCA and sCCA (as there are only two tissue/cell types). Confirming the previous analysis, tICA methods outperformed JIVE and PARAFAC by over 20% in terms of the overall sensitivity, whilst also attaining a better sensitivity at the individual component level (Additional file 1: Figure S7). Of note, the sensitivity of both CCA and sCCA was substantially worse, since mainly only the top canonical vector was significant.
Application of tICA to multi-omic cancer data reveals dosage-independent effects of differentially expressed genes
Discussion
Here we have assessed and benchmarked a novel suite of tensorial decomposition algorithms (tPCA, tWFOBI, and tWJADE) against a number of state-of-the-art alternatives. Specifically, while popular multi-way algorithms such as JIVE, iCluster or CCA/SCCA are in principle applicable to non-tensorial multi-way data (e.g. if the features across data types are distinct or not matched), when assessed in a tensorial context (i.e. when all dimensions are matched), these established methods are outperformed by the tensorial PCA and ICA methods considered here. This was demonstrated not only on simulated data, but also in the context of two real EWAS, where tICA methods were significantly more powerful in detecting differentially methylated CpGs associated with an epidemiological factor (smoking) and single-nucleotide polymorphisms (SNPs; mQTLs). For a real EWAS, tICA also outperformed tPCA, in line with the fact that biological sources of data variation are non-Gaussian and sparse, and therefore, more readily identified using statistical independence as a (non-linear) deconvolution criterion (as opposed to the linear decorrelation criterion used in tPCA). Thus, this extends the improvements seen for ICA over PCA on ordinary omic data matrices [13, 16] to the tensorial context. In addition, tPCA and tICA offer substantial (50–100-fold) speed advantages over methods like iCluster, JIVE and PARAFAC, which can become computationally demanding or even prohibitive. Further application of tICA to a multi-cell-type EWAS (B cells, T cells and monocytes) revealed its ability to identify loci enriched for cis-mQTLs (as cis-mQTLs make up over 90% of validated mQTLs in the Aries database [25]). Indeed, tICA achieved relatively high sensitivity values with top-ranked CpGs in components containing over 60% mQTLs. Given that here we were limited because we did not have access to matched genotype information, our results demonstrate the potential of tICA to detect mQTLs in the absence of such genotype information. For instance, it identified many cell-type-independent mQTLs, of which a substantial proportion have been validated in an independent blood-buccal EWAS study, and with several mapping to key GWAS loci for important diseases like Parkinson’s and schizophrenia. Although most of the identified mQTLs were blood cell type independent, tICA estimated that approximately 25% of mQTLs may be blood cell type specific, in line with the estimate of 20% obtained by Blueprint using a slightly different combination of blood cell subtypes (neutrophils, monocytes and T cells) [28]. We note that application of tICA to any multi-cell-type or multi-tissue EWAS is likely to have components strongly enriched for mQTLs, since for the same individuals, DNAm is being measured in at least two different tissues or cell types, and therefore, genetic effects that do not depend on cell type are bound to explain most of the inter-individual variation [24, 31]. Thus, we conclude that tICA could be an extremely versatile tool for identifying novel candidate mQTLs in multi-cell EWAS for which matched genotype information may not be available. tICA may also help to identify groups of widely separated mQTLs that are regulated by the same SNP and bound e.g. by a common transcription factor [32].
More generally, tICA can be applied to any multi-way data tensor to identify complex patterns of variation correlating with phenotypes of interest and the underlying features driving these variation patterns. This is accomplished by first correlating inferred ICs of variation in the sample-mode space with sample phenotype information (e.g. age, smoking, normal/cancer status and genotype) and subsequently selecting the features with the largest weights in these correlated components. As an illustrative example, the application of tICA to a multi-omic TCGA dataset revealed a deep novel insight: namely, that most DEGs in cancer that exhibit a CNV or DNAm dosage-dependent effect on expression across individual tumours exhibit differential expression relative to the normal tissue in a manner that does not, in fact, depend on CNV or promoter DNAm state. In other words, although CNV and DNAm variation strongly modulates expression variation of these DEGs across individuals tumours, for most of the genes exhibiting this CNV or DNAm dosage-dependent expression pattern, their deregulation relative to normal cells appears to be independent of the underlying CNV or promoter DNAm state. Although it is clear that differential expression in cancer can be the result of many mechanisms other than CNV or DNAm, our observation is significant, because we did not just select cancer DEGs, but the subset of these that exhibit a CNV or DNAm dosage-dependent effect on expression across tumours. The implications of our observation are important, given that many cancer classifications have been derived from unsupervised (clustering) analyses that were performed using only tumours, thus ignoring their patterns of variation relative to the normal reference state. Other large cancer studies, such as METABRIC [33], which did not profile normal tissue samples, identified novel candidate oncogenes and tumour suppressors solely based on CNV-dosage effects on gene expression across cancers, yet our results indicate that this could identify many false positives in the sense that their overexpression or underexpression in cancer is not dependent on the underlying CNV state. We point out that although this finding could have been obtained without application of a multi-way algorithm, that this would have required substantial prior insight. Therefore, this subtle pattern of variation across multiple data types was only discovered thanks to applying an agnostic method like tICA.
Although we have shown the value of tICA in identifying mQTLs and interesting patterns of variation across different data types in cancer-genome data, it is also important to discuss some of the limitations, which, however, also apply to all the other multi-way algorithms considered here. In particular, identifying sources of DNAm variation associated with epidemiological factors in a multi-tissue EWAS setting can be difficult due to confounding genetic variation. Indeed, in our application to a buccal-blood Illumina 450k EWAS, we found that the sensitivity of all algorithms dropped very significantly if they were applied to all ∼480 000 CpGs. Thus, it is important to devise improvements to these tensorial methods. For instance, one solution may be to first perform dimensional reduction using supervised feature selection on separate data types, and subsequently applying the tensorial methods on a reduced feature space. Alternatively, supervised tensorial methods, such as tensorial slice inverse regression [34], may help to identify sources of variation specifically associated with epidemiological variables.
Conclusion
In summary, the combined tPCA and tICA methods presented here will be an extremely valuable tool for analysis and interpretation of complex multi-way data, including multi-omic cancer data, as well as for the detection and clustering of mQTLs in multi-cell-type EWAS where genotype information may not be available.
Methods
Below we briefly describe the main tensorial BSS algorithms [8, 9, 35] as implemented here. For more technical details, see [8, 9, 35]. We also provide brief details of our implementation of JIVE, PARAFAC, iCluster, CCA and SCCA. All these implementations are available as R functions within Additional file 2.
Tensorial PCA
These ranked eigenvalues are useful for performing dimensional reduction, i.e. projecting the data onto subspaces carrying significant variation. For instance, one could use random matrix theory (RMT) [17, 36] on each of the m-mode covariance matrices above to estimate the appropriate dimensionalities d_{1},…,d_{ r }. This would lead to a tPCA decomposition of the form \(X=S\odot _{m=1}^{2}\Omega ^{(R)}_{m}\), with S a d_{1}×d_{2}×p tensor and each \(\Omega ^{(R)}_{m}\) a reduced matrix obtained from Ω_{ m } by selecting the first d_{ m } columns. We note that for any of the original dimensions p_{1},…,p_{ r } that are small, such dimensional reduction is not necessary.
In the applications considered here, our data tensor X is typically of dimension n_{ t }×n_{ s }×n_{ G }, where n_{ t } denotes the number of data or tissue types, n_{ s } the number of samples and n_{ G } the number of features (e.g. genes or CpGs). We note that the tPCA decomposition is performed on the first two dimensions (typically data type and samples), so there are two relevant covariance matrices. In the special case of a data matrix (a 2-tensor), standard PCA involves the diagonalisation of one data covariance matrix.Hence, for a 3-tensor, there are two data covariance matrices, and for an (r+1)-tensor, there are r. Here we use tPCA as implemented in the tensorBSS R package [37].
tICA: the tWFOBI and tWJADE algorithms
but now with the p_{1},…,p_{ r } random variables \(S_{k_{1} {\ldots } k_{r}}\in \mathbf {R}^{p}\) (\(S\in \mathbf {R}^{p_{1}\times {\ldots } \times p_{r}\times p}\phantom {\dot {i}\!}\)) mutually statistically independent and satisfying \(\operatorname {E}[\!S_{k_{1} {\ldots } k_{r}}]=0\) and \(\operatorname {Var}[\!S_{k_{1} {\ldots } k_{r}}]=I\). We note that X could be a suitably dimensionally reduced version X^{(R)} of X, such as that obtained using tPCA. For instance, in our applications, X^{(R)} would typically be a 3-tensor of dimension n_{ t }×d_{ S }×n_{ G } where d_{ S }<n_{ S }. This dimensional reduction, and optionally the scaling of variances, is known as whitening (W).
As with ordinary ICA, there are different algorithms for inferring mutually statistical ICs \(S_{k_{1} {\ldots } k_{r}}\). One algorithm is based on the concept of simultaneously maximising the fourth-order moments (kurtosis) of the ICs (since by the central limit theorem, linear mixtures of these are more Gaussian and therefore, have smaller kurtosis values). This approach is known as fourth-order blind identification (FOBI) [38]. Alternatively, one may attempt a joint approximate diagonalisation of higher order eigenmatrices (JADE) [35, 39]. We note that although we use the tFOBI and tJADE functions in tensorBSS, that these do not implement tPCA beforehand. Hence, in this work we implement modified versions of tFOBI and tJADE, which include a prior whitening transformation with tPCA. We call these modified versions tWFOBI and tWJADE.
Benchmarking of tPCA and tICA against other tensor decomposition algorithms
JIVE (joint and individual variation explained) [5] is a powerful decomposition algorithm that identifies both joint and individual sources of data variation, i.e. sources of variation that are common and specific to each data type. For two data types (i.e. two tissue types or two types of molecular features), three key parameters need to be specified or estimated to run JIVE. These are the number of components of joint variation (dJ) and the number of components of variation that are specific to each data type (dI_{1} and dI_{2}). On simulated data, these parameters are chosen to be equal to the true (known) values, i.e. for our simulation model, dJ=1, dI_{1}=1 and dI_{2}=1. In our real-data applications, dJ is estimated using RMT on the concatenated matrix obtaining by merging the two data type matrices together (after z-score normalising features to make them comparable), whilst dI_{ i } are estimated using RMT [17]. We note that these are likely upper bounds on the true number of individual sources of variation that are not also joint. We implemented JIVE using the r.jive R package available from http://www.r-project.org.
PARAFAC (parallel factor analysis) [4, 6] is a tensor decomposition algorithm in which a data tensor is decomposed into the sum of R terms. Each term is a factorised outer product of rank-1 tensors (i.e. vectors) over each mode. Thus, the one key parameter is R, which is the number of terms or components in the decomposition. In our simulation model, we chose R=4. Although one of the two sources of variation in each data type is common to both (hence, there are three independent sources), we nevertheless ran PARAFAC with one additional component to assess its ability to infer components of joint variation more fairly. In the real-data applications, we estimated R as \(\sum _{i=1}^{n_{t}}{dI_{i}}-dJ\) (with n_{ t } the number of tissue or cell types), since this should approximately equal the total number of independent sources of variation. We implemented PARAFAC using the multiway R package available from http://www.r-project.org.
iCluster [7] is a joint clustering algorithm for multi-way data. It models joint and individual sources of variation as latent Gaussian factors. The key parameter is K, which is the total number of clusters to infer. Although for the simulated data there were only three independent sources of variation, we chose K=4 to assess more fairly the ability of the algorithm to infer the joint variation (choosing K=3 would force the algorithm to find the source of joint variation). We implemented iCluster using the iCluster R package available from http://www.r-project.org. CCA [20] and its sparse version, sparse CCA (sCCA/SCCA) [21, 22], are methods to identify joint sources of variation (called canonical vectors) between two data matrices, where at least one of the dimensions is matched across data types. Here we implement the version of CCA and sCCA in the R package PMA available from http://www.r-project.org. One key parameter is K, the maximum number of canonical vectors to search for. Another parameter is the number of permutations used to estimate the significance of the covariance of each of the K canonical vectors. In each permutation, one of the data matrices is randomised (say by permuting the features around) and CCA/sCCA is reapplied. Since the data matrices are typically large, the distribution of covariances for the permuted cases is very tight. Thus, even 25 permutations are sufficient to estimate the number of significant canonical vectors reasonably well. The number of significant canonical vectors was defined as the number of components that exhibit observed covariances larger than the maximum value obtained over all 25 permutations, and is, thus, bounded above by K. In the non-sparse case, the two penalty parameters were chosen to be equal to 1, which means no penalty term is used. For sCCA, we estimated the best penalty parameters using an optimisation procedure, as described in [21, 22], with the number of permutations set to 25 and the number of iterations equal to 15. On the simulated data, we ran CCA with K=3, as K needs to specify only the maximum number of components to search for (the actual number of significant canonical vectors is one in our instance, as we have one source of joint variation). In the real-data applications, we chose K to be equal to dJ, as estimated using the procedure for JIVE, and used a larger number of iterations (50) per run.
Evaluation on simulated data
Here we describe the simulation model. The model first generates two data matrices of dimension 1000×100, representing two data types (e.g. DNA methylation and gene expression) where rows represent features and columns samples. We assume that the column and row labels (i.e. samples and genes) of the two matrices are identical and ordered in the same way. We assume one source of individual variation (IV) for each data matrix, each driven by 50 genes and 10 samples with the 50 genes and 10 samples unique to each data matrix. We also assume one source of common variation driven by a common set of 20 samples. The genes driving this common source of variation, however, are assumed distinct for each data matrix. In total, there are 100 genes (50 for each data matrix) associated with this joint variation (JV). For the 50 genes driving the JV in one data type and the 20 samples associated with this JV, we draw the values from a Gaussian distribution \(\mathcal {N}(e,\sigma)\), whereas for the other 50 genes in the other data type, we draw them from \(\mathcal {N}(-e,\sigma)\), all with e=3 and σ representing the noise level. Likewise for the IV, we use Gaussians \(\mathcal {N}(e,\sigma)\). The rest of the data is modelled as noise \(\mathcal {N}(0,\sigma)\). We consider a range of nine noise levels, with σ ranging from 1 to 5 in steps of 0.5. Thus, at σ=3, the SNR=e/σ=1. For each noise level, we perform 1000 Monte Carlo runs, and for each run and algorithm, we estimate SE and SP for correctly identifying the 100 genes driving the JV.
For tPCA, tWJADE and tWFOBI, SE and SP were calculated as follows. We inferred a total of 12 components over the combined data type and sample modes (2 in data type mode × 6 in sample space). We then projected the inferred components onto the original data-type dimensions, using the inferred 2×2 mixing matrix. For each data type and each of the six components, we then selected the top-ranked 50 genes by absolute weight in the component. This allowed us to compute a SE and SP value for each data type and component. For each component, we then averaged the SE and SP values over the two data types. In the last step, we select the component with the largest SE and SP value and record these values. We note that the resulting SE and SP values are not dependent on choosing 12 components. As long as the number of estimated components is larger than the total number of components of variation in the data (which for the simulated data is four), the results are invariant to the number of inferred components.
For CCA, which can only infer sources of joint variation, we ran it to infer a number of components (K=3) larger than the true number (there is only one source of JV). Pairs of canonical vectors were then selected according to whether their joint variance is larger than expected, as assessed using permutations. From hereon, the procedure to compute SE and SP proceeds as for the other algorithms, by selecting the component with the best SE and SP value. As with the other methods, the results do not depend on how we choose K as long as K is larger than or equal to 1.
For PARAFAC, we ran it to infer R=4 components. Since for PARAFAC there is only one inferred projection across features per component, for each component we rank the features according to their absolute weight, select the top-ranked 50, and then compute two separate SE (or SP) values, one for each of the two sets of 50 true positive genes driving JV. We then select for each set of JV driver genes, the component achieving the best SE (or SP). Finally, we average the SE and SP values for the two sets of true positives. As with the other algorithms, the results do not depend on the choice of R, as long as R is larger then or equal to 4 (since there are four sources of variation, two of IV and two of JV, which counts as two in the PARAFAC setting). For JIVE, we ran it to infer one source of JV and two sources of IV. Because JIVE stacks the data matrices corresponding to the two data types together, we then select the 100 top-ranked genes, ranked by absolute weight in the inferred JV matrix. SE and SP are then computed. Once again, the results are stable to choosing a larger number of inferred sources of JV, because for the simulated data there is only one source of JV. Further details for all methods can be found in Additional file 2. Finally, for each algorithm and noise level, SE and SP are averaged over all the 1000 Monte Carlo runs. Finally, the statistical significance of the SE and SP values between algorithms was assessed using paired non-parametric Wilcoxon rank sum tests. The whole analysis above was repeated for sources of variation drawn from a Laplace distribution (with the same mean and standard deviation as the Gaussians above), to capture the super-Gaussian nature of real biological data better.
Illumina 450k DNA methylation and multi-way TCGA datasets
We analysed Illumina 450k datasets from three main sources. One dataset is a multi-blood-cell subtype EWAS derived from 47 healthy individuals and three cell types (B cells, T cells and monocytes) [3]. Specifically, we used the same normalised data as used in [3], with the resulting data tensor being of dimension 3×47×388 618, after removing poor quality probes and probes with SNPs [40].
Another dataset was generated in [2]. It consists of two tissue types (whole blood and buccal), 152 women and 447 259 probes, resulting in a data tensor of dimension 2×152×447 259. After quality control, and after removing probes on the X and Y chromosomes, polymorphic CpGs, probes with SNPs at the single-base extension site and probes containing SNPs in their body as determined by Chen et al. [40], we were left with 447 259 probes.
Finally, we also analysed six datasets from TCGA. Specifically, we processed the RNA-seq, Illumina 450k DNAm and copy-number data for six different cancer types: colon adenocarcinoma (COAD), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LSCC), kidney renal cell carcinoma (KIRC), kidney papillary carcinoma (KIRP) and bladder adenocarcinoma (BLCA). All of these contained a reasonable number of normal-adjacent samples. The processing was carried out following the same procedure described by us in [29], which resulted in data tensors over three data types (mRNA, DNAm and copy number), 14 593 common genes and the following sample numbers: 273 cancers and 8 normals for LSCC, 390 cancers and 20 normals for LUAD, 292 cancers and 24 normals for KIRC, 195 cancers and 21 normals for KIRP, 194 cancers and 13 normals for BLCA, and 253 cancers and 19 normals for COAD. We note that although these numbers of normal samples are small, that these are the normal samples with data for all three data types.
Identifying smoking-associated CpGs in the multi-tissue (whole blood + buccal) EWAS
To test the algorithms on real data, we considered the matched multi-tissue (whole blood and buccal) Illumina 450k DNAm dataset for 152 women [2]. Smoking has been shown to be reproducibly associated with DNAm changes at a number of different loci [23]. We, therefore, used as a true positive set a gold-standard list of 62 smkDMCs, which have been shown to be correlated with smoking exposure in at least three independent whole blood EWAS [23]. The 62 smkCpGs were combined with 1000 randomly selected CpGs (non-smoking-associated), resulting in a data tensor of dimension 2×152×1062. Robustness was assessed by performing 1000 different Monte Carlo runs, each run with a different random selection of 1000 non-smoking associated CpGs. The whole analysis was then repeated for 10 000 randomly selected CpGs (data tensor of dimension 2×152×10 062) and for a total of 1000 different Monte Carlo runs. For the tPCA/tICA algorithms, the dimensionality parameters were chosen based on RMT as applied on the two separate matrices. Specifically, estimated unmixing matrices were of dimension 2×2 (for tissue-type mode) and d×d (for sample mode) with d the maximum of the two RMT estimates obtained from each tissue-type matrix.
SE to capture the 62 smkCpGs was calculated in two different ways. In one approach, we used the maximum SE attained by any IC, denoted SE(max), whilst in the other approach, we allowed for the possibility that different enriched ICs could capture different subsets of smkCpGs. Thus, in the second approach, the SE was estimated using the union of the selected CpGs over all enriched ICs. We note that enrichment of ICs for the smkCpGs was assessed using a simple binomial test and selecting those with a P value less than the Bonferroni corrected value (i.e. less than 0.05 per number of ICs). In both approaches, the CpGs selected per component were the 62 with the largest absolute weights in the component, i.e. the number of selected CpGs was matched to the number of true positives.
For JIVE, the number of components of joint variation was determined by applying RMT to the data matrix obtained by concatenating the features of the blood and buccal sets together with features standardised to unit variance to ensure comparability between data types. For the number of components of individual variation, we used the RMT estimates of each individual dataset, as this provides a safe upper bound. For PARAFAC, the number of components was determined by the sum of the RMT estimates for the blood and buccal sets separately minus the value estimated for the concatenated matrix, as we reasoned that this would best approximate the total number of components of variation across the two data types (joint or individual). For CCA and sCCA, the maximum number of canonical vectors to search for was set to be equal to the RMT estimate of the concatenated matrix, i.e. equal to the dimension of joint variation used in JIVE. For all methods, we selected the top-ranked 62 CpGs with the largest absolute weights in each component, and estimated SE using the same two approaches described above for tPCA/tICA.
mQTL and chromosome enrichment analysis
We applied tWFOBI to the data tensor of a multi-cell-type EWAS (Illumina 450k) for 47 healthy individuals and three cell types (B cells, T cells and monocytes) [3]. Specifically, we used the same normalised data as used in [3], i.e. a data tensor of dimension 3×47×388 618, after removing poor quality probes and probes with SNPs [40]. Using RMT [17], we estimated a total of 11 components in the sample-mode space, and so we inferred a source tensor of dimension 3×11 618, and mixing matrices of dimension 3×3 and 11×11. We also applied tWFOBI to the previous blood plus buccal DNAm dataset, but for all 447 259 probes that passed quality control. Applying RMT, we estimated 26 significant components in the sample space. Hence, we applied tWFOBI on the 2×152×447 259 data tensor to infer a source tensor of dimension 2×26×447 259 and mixing matrices of dimension 2×2 and 26×26. For both datasets, and for each inferred IC, we selected the 500 probes with the largest absolute weights and tested enrichment of mQTLs against a high-confidence mQTL list from [25] (22 245 mQTLs). This list was generated as the overlap of mQTLs (passing a stringent P value threshold of 1×10^{−14}) in blood derived from five different cohorts representing five different age groups. Odds ratios and P values of enrichment were estimated using Fisher’s exact test. For chromosome enrichment, we obtained P values using a binomial test. In selecting the top-500 probes from each component, we note that this threshold is conservative, as all inferred ICs exhibited positive kurtosis with kurtosis values that remained significantly positive after removing the top-500 ranked probes.
To obtain estimates of cell-type-independent and cell-type-specific mQTLs, we used the following approach. The first mode/dimension of the estimated source tensor was rotated back to the original cell types, using the estimated mixing matrix (of dimension 3×3, since there were three cell types). For each of the previously enriched mQTLs, we compared its weights in all three components, each component being associated with a given cell type. For instance, if S_{t,cp,∗} denotes the component cp for cell type t, thus defining a vector of weights over all CpGs, we asked if the absolute weight of the given mQTL CpG is large for all cell types or not. If it was sufficiently large (i.e. if within the top 10% quantile of the weight distribution) for all cell types, it was declared to be cell type independent. If the mQTL weight for one or two cell types fell within the lower 50% quantile of weights, we declared it a cell-type-specific mQTL.
We also performed a comparative analysis of all multi-way algorithms in terms of their sensitivity to detect mQTLs, as given by the high-confidence list of 22 245 mQTLs from the Aries database [25]. To assess the stability of the conclusions, we computed SE as described earlier, but considered a range of top selected CpGs per component, ranging from 500 up to 22 245 in units of 500. As before, we estimated the overall SE by taking into account the union of all selected CpGs from each component, as well as the maximum SE attained by any single component. Since the SE attained by any single component is bounded by the number of selected CpGs, we also considered the SE normalised for the number of selected CpGs.
Application of tICA to multi-omic cancer data
We used the same normalised integrated copy-number state (segment values), Illumina 450k DNAm and RNA-seq datasets of six cancer types from TCGA [1], as used in our previous work [29]. For the cancer types considered, see above. We initially applied tWFOBI to the colon adenocarcinoma TCGA dataset, estimating unmixing matrices of dimension 3×3 (for data type) and K×K (for sample mode) where K was the maximum RMT estimate over each of the three data-type matrices. Features driving each IC in each data-type dimension were selected using an iterative approach in which genes were ranked by absolute weight, and recursively removed until the kurtosis of the IC was less than 1, or the number of removed genes was larger than 500. Genes selected in common between the CNV and mRNA modes, or between the DNAm and mRNA modes, were declared driver genes between the respective data types. To identify components correlating with normal/cancer status, we obtained the mixing matrix of the samples and then correlated each component to normal/cancer status using Wilcoxon’s rank sum test.
Declarations
Funding
AET is supported by the Eve Appeal, the Chinese Academy of Sciences, Shanghai Institute of Biological Sciences, a Royal Society Newton Advanced Fellowship (award 164914) and the National Science Foundation of China (31571359). The Cardiovascular Epidemiology Unit is supported by the UK Medical Research Council (MR/L003120/1), British Heart Foundation (RG/13/13/30194) and the Cambridge Biomedical Research Centre of the National Institute for Health Research.
Availability of data and materials
All data analysed here have already appeared in previous publications or are publicly available. The Illumina 450k DNAm buccal and whole blood dataset from the National Survey of Health and Development, as published in [2], is available by submitting data requests to https://mrclha.swiftinfucl.ac.uk; see the full policy at http://www.nshd.mrc.ac.uk/data.aspx. Managed access is in place for this 69-year-old study to ensure that any use of the data is within the bounds of consent given previously by participants and to safeguard any potential threat to anonymity, since the participants were all born in the same week [2]. The multi-blood-cell subtype Illumina 450k EWAS is available from the European Genome-phenome Archive with the accession code EGAS00001001598 (https://www.ebi.ac.uk/ega/studies/EGAS00001001598) [3]. All TCGA data analysed here were downloaded and are publicly available from the TCGA data portal website (https://portal.gdc.cancer.gov/). The Aries mQTL database is available from http://www.mqtldb.org[25].
All software is available from the corresponding R packages as described in ‘Methods’, and their specific implementations as used in this manuscript are available in Additional file 2 as well as from GitHub (https://github.com/jinghan1018/tensor_decomp) and Zenodo (https://zenodo.org/record/1208040or https://doi.org/10.5281/zenodo.1208040) [41]. The tICA algorithms, as implemented here, are available under a GNU General Public License version 3.
Authors’ contributions
AET devised the study, performed the analyses and wrote the manuscript. HJ performed the statistical analyses. JV and KN contributed to the methodology and to the writing of the manuscript. DSP contributed and prepared the EWAS datasets for statistical analyses.
Ethics approval and consent to participate
No ethics approval was necessary for this publication, as all data analysed are already freely available in the public domain.
Competing interests
All authors have read and approved the manuscript. The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- TCGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487(7407):330–7.View ArticleGoogle Scholar
- Teschendorff AE, Yang Z, Wong A, Pipinikas CP, Jiao Y, Jones A, et al.Correlation of smoking-associated DNA methylation changes in buccal cells with DNA methylation changes in epithelial cancer. JAMA Oncol. 2015; 1(4):476–85.View ArticlePubMedGoogle Scholar
- Paul DS, Teschendorff AE, Dang MA, Lowe R, Hawa MI, Ecker S, et al.Increased DNA methylation variability in type 1 diabetes across three immune effector cell types. Nat Commun. 2016; 7:13555.View ArticlePubMedPubMed CentralGoogle Scholar
- Hore V, Vinuela A, Buil A, Knight J, McCarthy MI, Small K, et al.Tensor decomposition for multiple-tissue gene expression experiments. Nat Genet. 2016; 48(9):1094–100.View ArticlePubMedPubMed CentralGoogle Scholar
- Lock EF, Hoadley KA, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013; 7(1):523–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Bro R. Parafac. Tutorial and applications. Chem Intel Lab Syst. 1997; 38:149–71.View ArticleGoogle Scholar
- Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009; 25(22):2906–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Virta J, Taskinen S, Nordhausen K. Applying fully tensorial ICA to fMRI data. In: Signal Processing in Medicine and Biology Symposium (SPMB). Philadelphia: IEEE: 2016. p. 1–6.Google Scholar
- Virta J, Li B, Nordhausen K, Oja H. Independent component analysis for tensor-valued data. J Multivar Anal. 2017; 162:172–92.View ArticleGoogle Scholar
- Comon P. Independent component analysis, a new concept?Signal Process. 1994; 36(3):287–314.View ArticleGoogle Scholar
- Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics. 2002; 18(1):51–60.View ArticlePubMedGoogle Scholar
- Martoglio AM, Miskin JW, Smith SK, MacKay DJ. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics. 2002; 18(12):1617–24.View ArticlePubMedGoogle Scholar
- Teschendorff AE, Journée M, Absil PA, Sepulchre R, Caldas C. Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput Biol. 2007; 3(8):161.View ArticleGoogle Scholar
- Kowarsch A, Blochl F, Bohl S, Saile M, Gretz N, Klingmuller U, et al.Knowledge-based matrix factorization temporally resolves the cellular responses to il-6 stimulation. BMC Bioinform. 2010; 11:585.View ArticleGoogle Scholar
- Illner K, Fuchs C, Theis FJ. Bayesian blind source separation for data with network structure. J Comput Biol. 2014; 21(11):855–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Biton A, Bernard-Pierrot I, Lou Y, Krucker C, Chapeaublanc E, Rubio-Perez C, et al.Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes. Cell Rep. 2014; 9(4):1235–45.View ArticlePubMedGoogle Scholar
- Teschendorff AE, Zhuang J, Widschwendter M. Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics. 2011; 27(11):1496–505.View ArticlePubMedGoogle Scholar
- Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013; 3(1):246–59.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 2012; 40(19):9379–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Hotelling H. Relations between two sets of variates. Biometrika. 1936; 28(3-4):321–77.View ArticleGoogle Scholar
- Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009; 8:28.View ArticlePubMed CentralGoogle Scholar
- Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 10(3):515–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015; 7:113.View ArticlePubMedPubMed CentralGoogle Scholar
- van Dongen J, Nivard MG, Willemsen G, Hottenga JJ, Helmer Q, Dolan CV, et al.Genetic and environmental influences interact with age and sex in shaping the human methylome. Nat Commun. 2016; 7:11115.View ArticlePubMedPubMed CentralGoogle Scholar
- Gaunt TR, Shihab HA, Hemani G, Min JL, Woodward G, Lyttleton O, et al.Systematic identification of genetic influences on methylation across the human life course. Genome Biol. 2016; 17:61.View ArticlePubMedPubMed CentralGoogle Scholar
- Satake W, Nakabayashi Y, Mizuta I, Hirota Y, Ito C, Kubo M, et al.Genome-wide association study identifies common variants at four loci as genetic risk factors for Parkinson’s disease. Nat Genet. 2009; 41(12):1303–7.View ArticlePubMedGoogle Scholar
- Kahler AK, Djurovic S, Kulle B, Jonsson EG, Agartz I, Hall H, et al.Association analysis of schizophrenia on 18 genes involved in neuronal migration: MDGA1 as a new susceptibility gene. Am J Med Genet B Neuropsychiatr Genet. 2008; 7:1089–100.View ArticleGoogle Scholar
- Chen L, Ge B, Casale FP, Vasquez L, Kwan T, Garrido-Martin D, et al.Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell. 2016; 167(5):1398–414.View ArticlePubMedPubMed CentralGoogle Scholar
- Teschendorff AE, Zheng SC, Feber A, Yang Z, Beck S, Widschwendter M. The multi-omic landscape of transcription factor inactivation in cancer. Genome Med. 2016; 8(1):89.View ArticlePubMedPubMed CentralGoogle Scholar
- Mayrhofer M, Kultima HG, Birgisson H, Sundstrom M, Mathot L, Edlund K, et al.1p36 deletion is a marker for tumour dissemination in microsatellite stable stage II–III colon cancer. BMC Cancer. 2014; 14:872.View ArticlePubMedPubMed CentralGoogle Scholar
- Teschendorff AE, Relton CL. Statistical and integrative system-level analysis of DNA methylation data. Nat Rev Genet. 2018; 19(3):129–47.View ArticlePubMedGoogle Scholar
- Bonder MJ, Luijk R, Zhernakova DV, Moed M, Deelen P, Vermaat M, et al.Disease variants alter transcription factor levels and methylation of their binding sites. Nat Genet. 2017; 49(1):131–8.View ArticlePubMedGoogle Scholar
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al.The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012; 486(7403):346–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Ding S, Cook RD. Tensor sliced inverse regression. J Multivar Anal. 2015; 133:216–31.View ArticleGoogle Scholar
- Virta J, Li B, Nordhausen K, Oja H. JADE for tensor-valued observations. Accepted J Comput Graph Stat. https://doi.org/10.1080/10618600.2017.1407324. preprint arXiv:1603.05406.
- Plerou V, Gopikrishnan P, Rosenow B, Amaral LA, Guhr T, Stanley HE. Random matrix approach to cross correlations in financial data. Phys Rev E; 65(6):066126.Google Scholar
- Virta J, Li B, Nordhausen K, Oja H. tensorBSS: blind source separation methods for tensor-valued observations. 2017. R package version 0.3. https://CRAN.R-project.org/package=tensorBSS.
- Cardoso JF. Source separation using higher order moments. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP-89. Glasgow: IEEE: 1989. p. 2109–12.Google Scholar
- Cardoso JF, Souloumiac A. Blind beamforming for non Gaussian signals. IEEE Proc F. 1993; 140:362–70.Google Scholar
- Chen YA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al.Discovery of cross-reactive probes and polymorphic CpGs in the illumina infinium humanmethylation450 microarray. Epigenetics. 2013; 8(2):203–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Jing H, Teschendorff EA. R-scripts for implementing tensor decomposition methods. 2018. https://doi.org/10.5281/zenodo.1208040.