Below we briefly describe the main tensorial BSS algorithms [8, 9, 35] as implemented here. For more technical details, see [8, 9, 35]. We also provide brief details of our implementation of JIVE, PARAFAC, iCluster, CCA and SCCA. All these implementations are available as R functions within Additional file 2.
Tensorial PCA
We assume that we have i=1,…,p independent and identically distributed realisations of a matrix \(X_{i}\in \mathbf {R}^{p_{1}\times p_{2}}\), which can be structured as an order-3 data tensor X of dimension p1×p2×p. Then, tPCA decomposes X as follows:
$$ X=S\odot_{m=1}^{2}\Omega_{m}, $$
(1)
where S is also a 3-tensor of dimension p1×p2×p and Ω
m
(m=1,2) are orthogonal p
m
×p
m
matrices, i.e. \(\Omega _{m}^{T}\Omega _{m}=I_{p_{m}}\). Here, ⊙ denotes the tensor contraction operator. For instance, for Z an r-tensor of dimension p1×⋯×p
r
and A a matrix of dimension p
m
×p
m
, Z⊙
m
A describes the r-tensor with entries \((Z\odot _{m}A)_{i_{1} \ldots i_{m} {\ldots } i_{r}}=Z_{i_{1} {\ldots } j_{m} {\ldots } i_{r}}A_{i_{m}j_{m}}\phantom {\dot {i}\!}\) where the Einstein summation convention is assumed (i.e. indices appearing twice are summed over, e.g. \(M_{ik}M_{in}=\sum _{i}{M_{ik}M_{in}}=\left (M^{T}M)_{kn}\right)\). Thus, \(S\odot _{m=1}^{2}\Omega _{m}\) is a 3-tensor with entries
$$ \left(S\odot_{m=1}^{2}\Omega_{m}\right)_{i_{1}i_{2}i}=S_{k_{1}k_{2}i}(\Omega_{1})_{i_{1}k_{1}}(\Omega_{2})_{i_{2}k_{2}}. $$
(2)
In the above tPCA decomposition, the entries \(S_{k_{1}k_{2}}\) are assumed to be linearly uncorrelated. Introducing the operator ⊙−m, which for general r is defined in entry form by
$$ (X\odot_{-m}X)_{uv}=X_{i_{1} {\ldots} i_{m-1}ui_{m+1} {\ldots} i_{r}i}X_{i_{1} {\ldots} i_{m-1}vi_{m+1} {\ldots} i_{r}i} $$
(3)
uncorrelated components, means that the covariance matrix S⊙−mS=Λ
m
is diagonal of dimension p
m
×p
m
. Its entries are the ranked eigenvalues of the m-mode covariance matrix (X⊙−mX), which can be expressed as
$$ (X\odot_{-m}X)=\Omega_{m}\Lambda_{m}\Omega_{m}^{T}. $$
(4)
These ranked eigenvalues are useful for performing dimensional reduction, i.e. projecting the data onto subspaces carrying significant variation. For instance, one could use random matrix theory (RMT) [17, 36] on each of the m-mode covariance matrices above to estimate the appropriate dimensionalities d1,…,d
r
. This would lead to a tPCA decomposition of the form \(X=S\odot _{m=1}^{2}\Omega ^{(R)}_{m}\), with S a d1×d2×p tensor and each \(\Omega ^{(R)}_{m}\) a reduced matrix obtained from Ω
m
by selecting the first d
m
columns. We note that for any of the original dimensions p1,…,p
r
that are small, such dimensional reduction is not necessary.
In the applications considered here, our data tensor X is typically of dimension n
t
×n
s
×n
G
, where n
t
denotes the number of data or tissue types, n
s
the number of samples and n
G
the number of features (e.g. genes or CpGs). We note that the tPCA decomposition is performed on the first two dimensions (typically data type and samples), so there are two relevant covariance matrices. In the special case of a data matrix (a 2-tensor), standard PCA involves the diagonalisation of one data covariance matrix.Hence, for a 3-tensor, there are two data covariance matrices, and for an (r+1)-tensor, there are r. Here we use tPCA as implemented in the tensorBSS R package [37].
tICA: the tWFOBI and tWJADE algorithms
For a data tensor \(X\in \mathbf {R}^{p_{1}\times \dots \times p_{r}\times p}\), the tICA model is
$$ X = S\odot_{m=1}^{r}\Omega_{m}, $$
(5)
but now with the p1,…,p
r
random variables \(S_{k_{1} {\ldots } k_{r}}\in \mathbf {R}^{p}\) (\(S\in \mathbf {R}^{p_{1}\times {\ldots } \times p_{r}\times p}\phantom {\dot {i}\!}\)) mutually statistically independent and satisfying \(\operatorname {E}[\!S_{k_{1} {\ldots } k_{r}}]=0\) and \(\operatorname {Var}[\!S_{k_{1} {\ldots } k_{r}}]=I\). We note that X could be a suitably dimensionally reduced version X(R) of X, such as that obtained using tPCA. For instance, in our applications, X(R) would typically be a 3-tensor of dimension n
t
×d
S
×n
G
where d
S
<n
S
. This dimensional reduction, and optionally the scaling of variances, is known as whitening (W).
As with ordinary ICA, there are different algorithms for inferring mutually statistical ICs \(S_{k_{1} {\ldots } k_{r}}\). One algorithm is based on the concept of simultaneously maximising the fourth-order moments (kurtosis) of the ICs (since by the central limit theorem, linear mixtures of these are more Gaussian and therefore, have smaller kurtosis values). This approach is known as fourth-order blind identification (FOBI) [38]. Alternatively, one may attempt a joint approximate diagonalisation of higher order eigenmatrices (JADE) [35, 39]. We note that although we use the tFOBI and tJADE functions in tensorBSS, that these do not implement tPCA beforehand. Hence, in this work we implement modified versions of tFOBI and tJADE, which include a prior whitening transformation with tPCA. We call these modified versions tWFOBI and tWJADE.
Benchmarking of tPCA and tICA against other tensor decomposition algorithms
JIVE (joint and individual variation explained) [5] is a powerful decomposition algorithm that identifies both joint and individual sources of data variation, i.e. sources of variation that are common and specific to each data type. For two data types (i.e. two tissue types or two types of molecular features), three key parameters need to be specified or estimated to run JIVE. These are the number of components of joint variation (dJ) and the number of components of variation that are specific to each data type (dI1 and dI2). On simulated data, these parameters are chosen to be equal to the true (known) values, i.e. for our simulation model, dJ=1, dI1=1 and dI2=1. In our real-data applications, dJ is estimated using RMT on the concatenated matrix obtaining by merging the two data type matrices together (after z-score normalising features to make them comparable), whilst dI
i
are estimated using RMT [17]. We note that these are likely upper bounds on the true number of individual sources of variation that are not also joint. We implemented JIVE using the r.jive R package available from http://www.r-project.org.
PARAFAC (parallel factor analysis) [4, 6] is a tensor decomposition algorithm in which a data tensor is decomposed into the sum of R terms. Each term is a factorised outer product of rank-1 tensors (i.e. vectors) over each mode. Thus, the one key parameter is R, which is the number of terms or components in the decomposition. In our simulation model, we chose R=4. Although one of the two sources of variation in each data type is common to both (hence, there are three independent sources), we nevertheless ran PARAFAC with one additional component to assess its ability to infer components of joint variation more fairly. In the real-data applications, we estimated R as \(\sum _{i=1}^{n_{t}}{dI_{i}}-dJ\) (with n
t
the number of tissue or cell types), since this should approximately equal the total number of independent sources of variation. We implemented PARAFAC using the multiway R package available from http://www.r-project.org.
iCluster [7] is a joint clustering algorithm for multi-way data. It models joint and individual sources of variation as latent Gaussian factors. The key parameter is K, which is the total number of clusters to infer. Although for the simulated data there were only three independent sources of variation, we chose K=4 to assess more fairly the ability of the algorithm to infer the joint variation (choosing K=3 would force the algorithm to find the source of joint variation). We implemented iCluster using the iCluster R package available from http://www.r-project.org. CCA [20] and its sparse version, sparse CCA (sCCA/SCCA) [21, 22], are methods to identify joint sources of variation (called canonical vectors) between two data matrices, where at least one of the dimensions is matched across data types. Here we implement the version of CCA and sCCA in the R package PMA available from http://www.r-project.org. One key parameter is K, the maximum number of canonical vectors to search for. Another parameter is the number of permutations used to estimate the significance of the covariance of each of the K canonical vectors. In each permutation, one of the data matrices is randomised (say by permuting the features around) and CCA/sCCA is reapplied. Since the data matrices are typically large, the distribution of covariances for the permuted cases is very tight. Thus, even 25 permutations are sufficient to estimate the number of significant canonical vectors reasonably well. The number of significant canonical vectors was defined as the number of components that exhibit observed covariances larger than the maximum value obtained over all 25 permutations, and is, thus, bounded above by K. In the non-sparse case, the two penalty parameters were chosen to be equal to 1, which means no penalty term is used. For sCCA, we estimated the best penalty parameters using an optimisation procedure, as described in [21, 22], with the number of permutations set to 25 and the number of iterations equal to 15. On the simulated data, we ran CCA with K=3, as K needs to specify only the maximum number of components to search for (the actual number of significant canonical vectors is one in our instance, as we have one source of joint variation). In the real-data applications, we chose K to be equal to dJ, as estimated using the procedure for JIVE, and used a larger number of iterations (50) per run.
Evaluation on simulated data
Here we describe the simulation model. The model first generates two data matrices of dimension 1000×100, representing two data types (e.g. DNA methylation and gene expression) where rows represent features and columns samples. We assume that the column and row labels (i.e. samples and genes) of the two matrices are identical and ordered in the same way. We assume one source of individual variation (IV) for each data matrix, each driven by 50 genes and 10 samples with the 50 genes and 10 samples unique to each data matrix. We also assume one source of common variation driven by a common set of 20 samples. The genes driving this common source of variation, however, are assumed distinct for each data matrix. In total, there are 100 genes (50 for each data matrix) associated with this joint variation (JV). For the 50 genes driving the JV in one data type and the 20 samples associated with this JV, we draw the values from a Gaussian distribution \(\mathcal {N}(e,\sigma)\), whereas for the other 50 genes in the other data type, we draw them from \(\mathcal {N}(-e,\sigma)\), all with e=3 and σ representing the noise level. Likewise for the IV, we use Gaussians \(\mathcal {N}(e,\sigma)\). The rest of the data is modelled as noise \(\mathcal {N}(0,\sigma)\). We consider a range of nine noise levels, with σ ranging from 1 to 5 in steps of 0.5. Thus, at σ=3, the SNR=e/σ=1. For each noise level, we perform 1000 Monte Carlo runs, and for each run and algorithm, we estimate SE and SP for correctly identifying the 100 genes driving the JV.
For tPCA, tWJADE and tWFOBI, SE and SP were calculated as follows. We inferred a total of 12 components over the combined data type and sample modes (2 in data type mode × 6 in sample space). We then projected the inferred components onto the original data-type dimensions, using the inferred 2×2 mixing matrix. For each data type and each of the six components, we then selected the top-ranked 50 genes by absolute weight in the component. This allowed us to compute a SE and SP value for each data type and component. For each component, we then averaged the SE and SP values over the two data types. In the last step, we select the component with the largest SE and SP value and record these values. We note that the resulting SE and SP values are not dependent on choosing 12 components. As long as the number of estimated components is larger than the total number of components of variation in the data (which for the simulated data is four), the results are invariant to the number of inferred components.
For CCA, which can only infer sources of joint variation, we ran it to infer a number of components (K=3) larger than the true number (there is only one source of JV). Pairs of canonical vectors were then selected according to whether their joint variance is larger than expected, as assessed using permutations. From hereon, the procedure to compute SE and SP proceeds as for the other algorithms, by selecting the component with the best SE and SP value. As with the other methods, the results do not depend on how we choose K as long as K is larger than or equal to 1.
For PARAFAC, we ran it to infer R=4 components. Since for PARAFAC there is only one inferred projection across features per component, for each component we rank the features according to their absolute weight, select the top-ranked 50, and then compute two separate SE (or SP) values, one for each of the two sets of 50 true positive genes driving JV. We then select for each set of JV driver genes, the component achieving the best SE (or SP). Finally, we average the SE and SP values for the two sets of true positives. As with the other algorithms, the results do not depend on the choice of R, as long as R is larger then or equal to 4 (since there are four sources of variation, two of IV and two of JV, which counts as two in the PARAFAC setting). For JIVE, we ran it to infer one source of JV and two sources of IV. Because JIVE stacks the data matrices corresponding to the two data types together, we then select the 100 top-ranked genes, ranked by absolute weight in the inferred JV matrix. SE and SP are then computed. Once again, the results are stable to choosing a larger number of inferred sources of JV, because for the simulated data there is only one source of JV. Further details for all methods can be found in Additional file 2. Finally, for each algorithm and noise level, SE and SP are averaged over all the 1000 Monte Carlo runs. Finally, the statistical significance of the SE and SP values between algorithms was assessed using paired non-parametric Wilcoxon rank sum tests. The whole analysis above was repeated for sources of variation drawn from a Laplace distribution (with the same mean and standard deviation as the Gaussians above), to capture the super-Gaussian nature of real biological data better.
Illumina 450k DNA methylation and multi-way TCGA datasets
We analysed Illumina 450k datasets from three main sources. One dataset is a multi-blood-cell subtype EWAS derived from 47 healthy individuals and three cell types (B cells, T cells and monocytes) [3]. Specifically, we used the same normalised data as used in [3], with the resulting data tensor being of dimension 3×47×388 618, after removing poor quality probes and probes with SNPs [40].
Another dataset was generated in [2]. It consists of two tissue types (whole blood and buccal), 152 women and 447 259 probes, resulting in a data tensor of dimension 2×152×447 259. After quality control, and after removing probes on the X and Y chromosomes, polymorphic CpGs, probes with SNPs at the single-base extension site and probes containing SNPs in their body as determined by Chen et al. [40], we were left with 447 259 probes.
Finally, we also analysed six datasets from TCGA. Specifically, we processed the RNA-seq, Illumina 450k DNAm and copy-number data for six different cancer types: colon adenocarcinoma (COAD), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LSCC), kidney renal cell carcinoma (KIRC), kidney papillary carcinoma (KIRP) and bladder adenocarcinoma (BLCA). All of these contained a reasonable number of normal-adjacent samples. The processing was carried out following the same procedure described by us in [29], which resulted in data tensors over three data types (mRNA, DNAm and copy number), 14 593 common genes and the following sample numbers: 273 cancers and 8 normals for LSCC, 390 cancers and 20 normals for LUAD, 292 cancers and 24 normals for KIRC, 195 cancers and 21 normals for KIRP, 194 cancers and 13 normals for BLCA, and 253 cancers and 19 normals for COAD. We note that although these numbers of normal samples are small, that these are the normal samples with data for all three data types.
Identifying smoking-associated CpGs in the multi-tissue (whole blood + buccal) EWAS
To test the algorithms on real data, we considered the matched multi-tissue (whole blood and buccal) Illumina 450k DNAm dataset for 152 women [2]. Smoking has been shown to be reproducibly associated with DNAm changes at a number of different loci [23]. We, therefore, used as a true positive set a gold-standard list of 62 smkDMCs, which have been shown to be correlated with smoking exposure in at least three independent whole blood EWAS [23]. The 62 smkCpGs were combined with 1000 randomly selected CpGs (non-smoking-associated), resulting in a data tensor of dimension 2×152×1062. Robustness was assessed by performing 1000 different Monte Carlo runs, each run with a different random selection of 1000 non-smoking associated CpGs. The whole analysis was then repeated for 10 000 randomly selected CpGs (data tensor of dimension 2×152×10 062) and for a total of 1000 different Monte Carlo runs. For the tPCA/tICA algorithms, the dimensionality parameters were chosen based on RMT as applied on the two separate matrices. Specifically, estimated unmixing matrices were of dimension 2×2 (for tissue-type mode) and d×d (for sample mode) with d the maximum of the two RMT estimates obtained from each tissue-type matrix.
SE to capture the 62 smkCpGs was calculated in two different ways. In one approach, we used the maximum SE attained by any IC, denoted SE(max), whilst in the other approach, we allowed for the possibility that different enriched ICs could capture different subsets of smkCpGs. Thus, in the second approach, the SE was estimated using the union of the selected CpGs over all enriched ICs. We note that enrichment of ICs for the smkCpGs was assessed using a simple binomial test and selecting those with a P value less than the Bonferroni corrected value (i.e. less than 0.05 per number of ICs). In both approaches, the CpGs selected per component were the 62 with the largest absolute weights in the component, i.e. the number of selected CpGs was matched to the number of true positives.
For JIVE, the number of components of joint variation was determined by applying RMT to the data matrix obtained by concatenating the features of the blood and buccal sets together with features standardised to unit variance to ensure comparability between data types. For the number of components of individual variation, we used the RMT estimates of each individual dataset, as this provides a safe upper bound. For PARAFAC, the number of components was determined by the sum of the RMT estimates for the blood and buccal sets separately minus the value estimated for the concatenated matrix, as we reasoned that this would best approximate the total number of components of variation across the two data types (joint or individual). For CCA and sCCA, the maximum number of canonical vectors to search for was set to be equal to the RMT estimate of the concatenated matrix, i.e. equal to the dimension of joint variation used in JIVE. For all methods, we selected the top-ranked 62 CpGs with the largest absolute weights in each component, and estimated SE using the same two approaches described above for tPCA/tICA.
mQTL and chromosome enrichment analysis
We applied tWFOBI to the data tensor of a multi-cell-type EWAS (Illumina 450k) for 47 healthy individuals and three cell types (B cells, T cells and monocytes) [3]. Specifically, we used the same normalised data as used in [3], i.e. a data tensor of dimension 3×47×388 618, after removing poor quality probes and probes with SNPs [40]. Using RMT [17], we estimated a total of 11 components in the sample-mode space, and so we inferred a source tensor of dimension 3×11 618, and mixing matrices of dimension 3×3 and 11×11. We also applied tWFOBI to the previous blood plus buccal DNAm dataset, but for all 447 259 probes that passed quality control. Applying RMT, we estimated 26 significant components in the sample space. Hence, we applied tWFOBI on the 2×152×447 259 data tensor to infer a source tensor of dimension 2×26×447 259 and mixing matrices of dimension 2×2 and 26×26. For both datasets, and for each inferred IC, we selected the 500 probes with the largest absolute weights and tested enrichment of mQTLs against a high-confidence mQTL list from [25] (22 245 mQTLs). This list was generated as the overlap of mQTLs (passing a stringent P value threshold of 1×10−14) in blood derived from five different cohorts representing five different age groups. Odds ratios and P values of enrichment were estimated using Fisher’s exact test. For chromosome enrichment, we obtained P values using a binomial test. In selecting the top-500 probes from each component, we note that this threshold is conservative, as all inferred ICs exhibited positive kurtosis with kurtosis values that remained significantly positive after removing the top-500 ranked probes.
To obtain estimates of cell-type-independent and cell-type-specific mQTLs, we used the following approach. The first mode/dimension of the estimated source tensor was rotated back to the original cell types, using the estimated mixing matrix (of dimension 3×3, since there were three cell types). For each of the previously enriched mQTLs, we compared its weights in all three components, each component being associated with a given cell type. For instance, if St,cp,∗ denotes the component cp for cell type t, thus defining a vector of weights over all CpGs, we asked if the absolute weight of the given mQTL CpG is large for all cell types or not. If it was sufficiently large (i.e. if within the top 10% quantile of the weight distribution) for all cell types, it was declared to be cell type independent. If the mQTL weight for one or two cell types fell within the lower 50% quantile of weights, we declared it a cell-type-specific mQTL.
We also performed a comparative analysis of all multi-way algorithms in terms of their sensitivity to detect mQTLs, as given by the high-confidence list of 22 245 mQTLs from the Aries database [25]. To assess the stability of the conclusions, we computed SE as described earlier, but considered a range of top selected CpGs per component, ranging from 500 up to 22 245 in units of 500. As before, we estimated the overall SE by taking into account the union of all selected CpGs from each component, as well as the maximum SE attained by any single component. Since the SE attained by any single component is bounded by the number of selected CpGs, we also considered the SE normalised for the number of selected CpGs.
Application of tICA to multi-omic cancer data
We used the same normalised integrated copy-number state (segment values), Illumina 450k DNAm and RNA-seq datasets of six cancer types from TCGA [1], as used in our previous work [29]. For the cancer types considered, see above. We initially applied tWFOBI to the colon adenocarcinoma TCGA dataset, estimating unmixing matrices of dimension 3×3 (for data type) and K×K (for sample mode) where K was the maximum RMT estimate over each of the three data-type matrices. Features driving each IC in each data-type dimension were selected using an iterative approach in which genes were ranked by absolute weight, and recursively removed until the kurtosis of the IC was less than 1, or the number of removed genes was larger than 500. Genes selected in common between the CNV and mRNA modes, or between the DNAm and mRNA modes, were declared driver genes between the respective data types. To identify components correlating with normal/cancer status, we obtained the mixing matrix of the samples and then correlated each component to normal/cancer status using Wilcoxon’s rank sum test.