CONFINED: distinguishing biological from technical sources of variation by leveraging multiple methylation datasets

Methylation datasets are affected by innumerable sources of variability, both biological (cell-type composition, genetics) and technical (batch effects). Here, we propose a reference-free method based on sparse canonical correlation analysis to separate the biological from technical sources of variability. We show through simulations and real data that our method, CONFINED, is not only more accurate than the state-of-the-art reference-free methods for capturing known, replicable biological variability, but it is also considerably more robust to dataset-specific technical variability than previous approaches. CONFINED is available as an R package as detailed at https://github.com/cozygene/CONFINED. Electronic supplementary material The online version of this article (10.1186/s13059-019-1743-y) contains supplementary material, which is available to authorized users.


S3 Single matrix decomposition on the union of two matrices
In this analysis, we evaluated the best-performing single-matrix method for capturing cell-type composition-ReFACTor-against CONFINED, using the union of the two input matrices to CONFINED as the input for the ReFACTor. To do this, we simply column-wise concatenated the individuals from one dataset to the other, so that the dimension of the new input matrix was m⇥ (n 1 +n 2 ). We ran ReFACTor while also supplying it with a covariate as the covfile argument a vector that indicated from which dataset the individual originated (e.g. 0 for dataset 1 and 1 for dataset 2 ). We ran CONFINED using the learned rule for selecting the sparsity parameter since the experiment was evaluating each method's ability to capture cell-type accuracy. Notably, in this analysis, using ReFACTor on the concatenated datasets is analogous to CONFINED, only that the basis of ReFACTor is PCA, whereas the basis of CONFINED is CCA. In both methods, a feature-selection step based on either PCA or CCA is performed on the entirety of the input matrices, which is then followed by performing PCA or CCA on the selected features to generate the components of interest.

S4 Measured sources of variability across methods
In this section, we compare the performance of CONFINED to previous methods designed to capture cell-type composition in methylation datasets. Notably, the aim of several of the previous methods (ReFACTor, NNMF) is simply to capture cell-type composition accuracy, and they do not have an emphasis on other global sources of variation such as age, sex, or environment. For CONFINED, we used as input datasets from Hannum et al. [38] and Liu et al. [39]. All other single-matrix decomposition methods used just the dataset from Liu et al. as input. We used 10 components from each method to predict each source of variability in the Liu et al. [39] dataset. CONFINED greatly outperformed other methods when capturing both age and sex and performed best for these factors using a large number of features (i.e. low sparsity). This may be an indicator that changes in observed methylation signal due to age or sex are rather subtle, and require many sites to capture. Interestingly, when CONFINED used 5000 sites, age was captured with much greater accuracy than sex.

S5 Enrichment Analysis
In this section, we extend the enrichment analysis presented in the main text. We first verified that CONFINED selects for immune-related CpG sites in a separate pair of whole-blood datasets (GSE80417 and GSE84727 of Hannon et al. [55], Figure S1). Notably, neither of the Hannon datasets studied a healthy phenotype whereas in the other pair of datasets we used a dataset studying aging. The immune-related pathways were the most greatly enriched pathways when using CONFINED's feature selection for this pair of datasets, however, they were less significantly enriched than in the analysis of the datasets of Liu et al. and Hannum et al.. We also examined the enrichment in a brain-brain pair of datasets (Horvath et al. [65], Ja↵e et al. [73]) as well as an adipose-adipose pair of datasets (Bonder et al. [68]). For these two tissue types, we saw enrichment for myelin, neuron and axon-related functions (Table S3) as well as angiogenic functions (Table S2) respectively. The enrichments recapitulated tissue-related functions, however they were not statistically significant using our permutation testing.
We also performed enrichment analysis when there was simulated technical variation added to the more variable probes in the datasets. In this experiment, we added noise to the sites whose standard deviation was in the top p th percent of standard deviations, where we varied p from 1 to 50. For each site, we added normal noise with mean zero and standard deviation equal to the square of the observed standard deviation of the probe. We show the results for the enrichment when pairing two whole-blood datasets (Liu et al. [39],Hannum et al. [38]) and using t = 2072 (following the rule we learned from cross-validation and as we did in Table 1 of the manuscript). Notably, the p-value of the enrichment for the top three sites was relatively consistent, and we continued to observe enrichment for immune-related functions ( Figure S19).

S6 Cross validation for cell-type composition
Notably, our method has two hyper-parameters, t the number of sites to include, and l the rank of the transformation used to obtain the informative sites. In this section, we explain how we chose values for both of the hyper-parameters. As a reminder, we pick the sites most correlated between their original data matrix and their low-rank approximation, e.g. X i andX i (wherẽ X i =Ũ iŨ T i X i ). We will first explain how we choose l. i Perform CCA on the input matrices ii Define to be some threshold [0,1] iii Set l to the number of canonical variables of U 1 and U 2 that have correlation , i.e. the number of pairs of columns of U 1 and U 2 whose correlation is greater than or equal to iv If no canonical variables have correlation , set l = 1.
We next detail our cross-validation process. We will assume we have l and a ranked list of our features: 1 First, we store a partition of the data for validation purposes. i Hold out one third of the sites of each matrix, storing X 1validate and X 2validate , size m 3 ⇥ n 1 and m 3 ⇥ n 2 respectively 2 Using the remaining two thirds of the data, X 1 2 3 and X 2 2 3 , we will perform training and testing procedures.
i Randomly split the input matrices X 1 2 3 and X 2 2 3 into two halves: X 1train , and X 1test , and X 2train , and X 2test such that each matrix has m 3 sites and their corresponding n i individuals.
ii On the train partitions of the data, run CONFINED using the top t features to obtain A 1train and A 2train , the canonical loadings for X 1train and X 2train respectively. iii Find the top t sites of the test data partitions and subset the test data partitions to size t ⇥ n i where n i is the number of individuals in that dataset. iv Using A 1train and A 2train , obtain the t ⇥ n i canonical variables U 1test and U 2test : X 1test A 1train and X 2test A 2train respectively. v Use X T 1test U 1test and X T 2test U 2test to predict cell-type composition for each individual in the test partition of the dataset. 3 After learning the optimal parameters t ⇤ and l ⇤ , perform our method on the validation partition of the datasets.
In this setting, we essentially learn the axes of the most correlated space for the sites of the train datasets, and then leverage this space on the test datasets to estimate cell-type composition. The canonical weights used for each test partition were obtained without using data from any of the samples in the test partitions. We performed our method on train and test partitions of the data (2 above) while varying both the value of t and the threshold . For each combination of t and , we randomly split the data 10 times, then took the average of the R 2 value when using the first 10 components to capture cell-type composition as a metric of accuracy. Regressing t of the best performing set of hyperparameters onto the number of individuals in the datasets, we learned a rule for selecting t in the case of predicting cell-composition in a pair of whole-blood datasets.
When performing the cross-validation of CONFINED, we also recorded the runtime as a function of the number of individuals in the datasets. For 100 to 500 individuals (with a step size of 100) in both datasets, the average runtime of CONFINED across ten iterations was 35.35, 45.18, 54.44, 73.94, 118.89 seconds respectively. The bottleneck of the computation is a result of calculating the correlation between the input matrices' CpG sites and their lowrank approximations (Methods section) as this operation is typically done on several hundred thousands of sites. The value of generally will not a↵ect run-time, as typically < 10 CCA components will have a high correlation. We show the runtime of our CCA implementation in Figure S18 and note that the runtime of CONFINED is dominated by the first performance of CCA as t will typically be smaller than the number of sites in the input datasets. We tested CONFINED on our cluster (Linux CentOS 6.10) using a single node and R 3.3.3.

S7 Permutation testing
To validate the enrichment results reported by missMethyl, we performed permutation testing.
missMethyl takes as input a set (i.e. sample) of CpG sites used to test for enrichment of gene ontology pathways, along with the population from which the sample of CpG sites was chosen. For the purpose of the permutation tests, our sample of CpG sites consisted of the top t sites reported by CONFINED, and the population of CpG sites was made up of the m sites in the input matrices. In this context, we varied t, the number of features to use, and compared the enrichment p-values when using the top t features sorted by CONFINED and t randomly selected features. We specifically compare the p-values of the top three most enriched pathways when using t CONFINED sites and the single most enriched pathway when using t randomly selected sites from the size m CpG population. The number of features we tested ranged from 1000 to 10000 with a step size of 1000, and we performed 1000 permutations at each number of features. In this experiment, we focused on a blood-blood pair of datasets (Liu et al. [39], Hannum et al. [38]).

S8 Batch e↵ect simulations
Consider one model of principal components analysis: Where X is a data matrix of size n ⇥ p, W is an p ⇥ k matrix containing the k principal components of X (eigenvectors of the covariance matrix of X), Z is a n ⇥ k matrix of scores for each principal component, and ✏ is an n-length vector containing noise. Intuitively, by finding the eigenvectors corresponding to the top k eigenvalues of the covariance matrix of X, we are finding the directions that explain the most variance in the data. While we might expect that cell counts or some other phenotype might be driving the variance of methylation data, variance in biological data is often confounded by di↵erent measurement protocols or human error-in other words, batch e↵ects [9]. Therefore, the top k directions of variance in a dataset may correspond to batch e↵ects, or the observed variance in the data may simply by due to di↵erent protocols used to produce the data. For our simulations, we generated noise for each dataset X i following the previously described structure: Where Z i is a random matrix of "scores" of size m ⇥ r with every entry z jk drawn from the standard normal distribution and W i is a matrix of "weights" of size n i ⇥ r where every entry w jk is drawn from the standard uniform distribution and each column w (k) i is standardized to have norm 1.
In doing so, we add some structured, normally distributed noise that is specific to each dataset. By varying the number and length of the weight vectors w (k) i , we can also control the rank and magnitude of the structured noise. Intuitively, this noise emulates technical variation, as each dataset will have its own unique set of weight vectors.
To elucidate the consequences of the simulated batch e↵ects on PCA-based methods, we examined the correlation of ReFACTor's components and the simulated weight vectors. Regressing the artificial noise vectors onto the ReFACTor components, we observe high R 2 values.
Additionally, if the noise we introduced had rank k and large strength (norm), it was captured by exactly the top k ReFACTor components. In cases where the norm of the weight vector was relatively low, ReFACTor's components still captured some of the signal corresponding to the batch e↵ects. These results emphasize that single-matrix decomposition methods may produce components whose signal also includes noise from technical variation.

S9 Preprocessing using Removal of Unwanted Variation (RUV)
In these experiments, we compared previous reference-free methods to CONFINED after using RUV as a preprocessing step for the previous methods. RUV uses a set of control probes (either known a priori or learned empirically) to generate components that are unrelated to the factor or phenotype of interest [2,9]. In this case, we used the dataset of Liu et al. [39] considering as the factor of interest the phenotype they studied, Rheumatoid Arthritis. Prior to running any of the other reference-free methods, we used RUV to remove unwanted variation, or residual technical e↵ects. In the case of simulated technical noise, we supplied RUV with the specific number of batch vectors at each iteration in the experiment. Here, the performance of earlier reference-free methods improved quite dramatically ( Figure S14). We also used this preprocessing step for the real data setting the number of unwanted factors k = 2, following the default arguments for similar packages in the minfi [57] R package. The performance in this case was not improved as much as in the scenario with simulated technical noise, and CONFINED outperformed previous methods ( Figure S14). This may be because in attempting to remove unwanted signal and keep the signal related to the phenotype of interest, RUV is perhaps adjusting away variability associated with cell-type composition.

S10 Using CONFINED on a single dataset split into halves
In these experiments, we considered the performance of CONFINED when using as input two halves of a single dataset (Hannum et al. [38]). Notably, this violates the key assumption of CONFINED-the two input matrices are not independent datasets, therefore the technical variability will be shared across both datasets. In this scenario, we posit that our method will perform similarly to previous methods based on decomposition of a single matrix. Here, we randomly divided the dataset into two halves 10 separate times, and took the average performance of each of the splits. We also evaluated CONFINED when simulated technical noise was added to highlight that dataset-specific variability will be an issue in this scenario. S11 Extending CONFINED to more than two datasets Here, we show some exploratory analysis for extended CONFINED to operate on more than two datasets. We considered two cases: (1) when we simple column-wise concatenate an additional dataset to one of the datasets in the original pair and (2) using the iterative sparse CCA algorithm from Witten et al. [36]. Using the concatenated data as input to CONFINED performed much better than when using separate matrices as input into the sparse multiple CCA implementation. Further, the results using the concatenated datasets with CONFINED were consistent to those reported when using the datasets separately with CONFINED .

S12 Feature selection
In this section, we compare the feature selection steps taken by CONFINED (a type of sparse CCA) and ReFACTor (a type of sparse PCA). Given input matrices of size m ⇥ n 1 and m ⇥ n 2 , or more generally for the single-matrix decomposition case m ⇥ n.
1 Our features are selected in the following manner: i Obtain U 1 and U 2 both of size m ⇥ min{n 1 , n 2 } following Equations (1) and (2). ii ConstructŨ 1 andŨ 2 from the first l columns of U 1 and U 2 respectively. iii Generate a low-rank approximation of each dataset: iv For each site j in dataset i compute a score based on its correlation between itself and its low-rank approximation: v Rank the sites with the highest inter-dataset score: vi Use t sites with the top t scores when performing CCA. 2 ReFACTor selects CpG sites in the following way: i Compute the singular value decomposition (SVD) of a matrix X and to obtain V , the left singular vectors of X. ii ConstructṼ by taking the first l columns of V . iii Construct a low-rank approximation of X: e X =ṼṼ T X iv Find the sites that are most correlated between the original dataset and the low-rank approximation of the dataset. v Use the top t most correlated sites for X when performing PCA.
Notably, the features selected by our method are important to both datasets. In both datasets, the features are well-represented in a low-dimensional, correlated subspace. In ReFACTor, the features are selected if they are well-approximated by the first few principal components in one specific dataset's variable subspace. Performing feature selection based on a single-matrix decomposition method does not consider that some of the first few principal components in a dataset may be driven by batch e↵ects. Age correlation Sex correlation Age correlation  S1. Evaluation of the ability of CONFINED to capture measured sources of variability in brain datasets. We paired a two brain datasets (Horvath et al. [65], Ja↵e et al. [73]) to capture sources of variability in each dataset across a range of sparsity. Above, we included the sex chromosomes to show that the accuracy of CONFINED to predict biological sex can be improved by the sex chromosomes' site inclusion. Below, we remove the sites along the sex chromosomes.  Age correlation  B cell prediction accuracy  CD8 T cell prediction accuracy B cell prediction accuracy  B cell prediction accuracy  Table S1. Gene Ontology Enrichment of sites ranked by CONFINED. We tested enrichment of the highest-ranked sites by CONFINED in a blood-blood pair of datasets (GSE80417 and GSE84727,Hannon et al. [55]). Here, we set the sparsity parameter based on a rule learned through cross-validation (Additional File 1: Section S6), however we observed similar results across a range of sparsity parameters (Additional File 1: Section S7).  Table S2. Gene Ontology Enrichment of sites ranked by CONFINED. We tested enrichment of the highest-ranked sites by CONFINED in an adipose-adipose pair of datasets (Bonder et al. [68]). Here, we set the sparsity parameter to 2000 following the blood-blood enrichment experiments (Table 1, Additional File 1: Figure S11).  Table S3. Gene Ontology Enrichment of sites ranked by CONFINED. We tested enrichment of the highest-ranked sites by CONFINED in a brain-brain pair of datasets (Horvath et al. [65], Ja↵e et al. [73]). Here, we set the sparsity parameter to 2000 following the blood-blood enrichment experiments (Table 1, Additional File 1: Figure S11).  . We tried this approach on the dataset both with simulated technical noise (dark green) and without simulated technical noise (light green).