We completed a series of benchmark studies to assess the performance of MIRTH in different imputation tasks. We evaluated the performance of MIRTH on a collection of nine previously-published mass-spectrometry metabolomics datasets, consisting of original raw ion counts before any preprocessing, e.g., normalization and imputation [9]. Metabolite names were harmonized by maximizing consistency across multiple metabolite identifiers, as previously described [9]. Details on the number of samples and features in each dataset are reported in Additional file 1: Table S1.
For each benchmark, we measured the concordance between the true and imputed ranks of metabolite samples in held-out data. Across experiments, we found MIRTH performed well in high sample-to-metabolite scenarios. In contrast, MIRTH performed poorly when there were insufficient samples to train on and when samples were highly censored. Furthermore, we found that a subset of metabolites were reproducibly well-imputed across different datasets and imputation tasks, ascribing a quantitative metric of confidence to MIRTH’s predictions.
MIRTH recovers missing metabolites within metabolomics datasets
We first verified that MIRTH accurately imputed missing measurements within a single dataset. This represented the most straightforward imputation task because there were no batch effects associated with merging of data from two or more distinct datasets. We performed an in silico experiment, simulating a scenario where a set of metabolites was not measured in a subset of samples in the dataset. First, we randomly select 50% of the samples to serve as hold-out samples. Next, we randomly selected 10% of all the metabolites to serve as hold-out metabolites (Fig. 2a). We masked the hold-out metabolites in the hold-out samples to simulate that they were not measured in half the dataset. This effectively split the dataset into two pseudo-datasets, where a smaller set of features was measured in the hold-out pseudo-dataset.
For each of the 9 benchmarking datasets (Additional file 1: Table S1), we split the data as described above and applied MIRTH to impute the held-out values (Fig. 2a and “Methods”). Since the held-out values in the dataset were actually known, the performance of MIRTH was assessable by comparing the actual and predicted ranks in the simulated-missing features. We repeated this experiment 200 times for each of the 9 datasets, randomly selecting a set of metabolites in a random set of samples to mask in each iteration. The optimal number of embedding dimensions, chosen through cross-validation separately for each dataset, ranged between 3 and 37 (Additional file 2: Fig. S1a). We also evaluated the performance of MIRTH and its sensitivity to dataset size and feature missingness using simulated metabolomics data. The method and results of testing single-set performance on simulated data are available in the Supplementary Information section.
When 10% of features were simulated missing in each trial, MIRTH successfully predicted the abundance of approximately 89% of metabolites (significant positive correlation with the true ranks in at least 90% of trials, \(\rho >0, q<0.05\), BH-corrected, Fig. 2b). This proportion ranged from 11% in BrCa2 (likely due to its small number of samples) to 100% in COAD (Additional file 1: Table S2). Variation in MIRTH performance across datasets was partly explained by dataset size: median imputation performance was better for the five datasets with the most samples, compared to the other four (Fig. 2c). Using the same metric of predictive success, the proportion of well-predicted metabolites in each dataset likewise increased as the sample size of the dataset increased (Fig. 2d). This suggested that poor predictions may be the result of an insufficient quantity of data from which MIRTH could learn. Similarly, when the proportion of features simulated as missing was increased, imputation performance worsened (Additional file 2: Fig. S1b). This result was expected since higher proportions of masked metabolites leave less data from which MIRTH can learn. We also investigated factors associated with each metabolite feature which might influence imputation accuracy. We observed that metabolite features with larger coefficients of variation (in the raw, non-rank-transformed data) tended to have lower imputation accuracy (Additional file 2: Fig. S1c). Similarly, predictions of features with a greater number of censored measurements (i.e., those where more measurements were below the detection threshold) scored lower. This likely arose from a poor correlation between censored values’ tied-for-last ranks at the input and their uniformly-mapped ranks at the output (Additional file 2: Fig. S1d).
A total of 306 metabolites were reproducibly well-predicted by MIRTH across multiple datasets, meaning that they were measured in at least 4 datasets and well-predicted in at least three-quarters of the datasets in which they were measured (Fig. 2e). Among these well-predicted metabolites were 84 amino acids, 22 carbohydrates, 16 cofactors and vitamins, 5 energy carriers, 120 lipids, 28 nucleotides, 21 peptides, 8 xenobiotics, and 2 uncharacterized metabolites. Well-predicted metabolites were enriched in specific metabolite classes, including dipeptides, proteinogenic amino acids and various lipid subsets (Additional file 1: Table S3). For example, palmitate and methionine, which were measured in 8 and 9 datasets respectively, were both well-predicted in 8 experiments (Fig. 2f). While these benchmarking experiments were conducted in a setting where the ground truth was known, the observation that certain metabolites were reproducibly well-imputed in different settings suggests that imputation of their abundance by MIRTH in settings where the true abundance is unknown should be associated with additional confidence relative to all other metabolites. Furthermore, the ability of MIRTH to recover ranks of subsets of samples of metabolites in a single dataset motivated the use of MIRTH to impute entirely missing metabolites in a single dataset by learning from a matrix of aggregate datasets.
MIRTH recovers missing metabolites by transferring knowledge across datasets
To determine if imputation of entirely unmeasured features could produce biologically sound predictions of missing metabolites, we applied MIRTH to two independent kidney cancer datasets, RC12 and RC3, each consisting of both tumor and adjacent normal tissue samples. We reasoned that metabolites distinguishing tumor and normal tissues should be highly concordant across both datasets. To test this assumption, we calculated the differential abundance of all metabolites (tumor vs. normal) in RC12 and RC3 (Additional file 2: Fig. S2). Of the 169 metabolites measured in both RC12 and RC3 that showed statistically significant differences between tumor and normal in both datasets, we observed that 159 (94\(\%\)) showed identical changes, including canonical metabolites such as glutathione (GSSG and GSH), lactate, NAD+, and fructose (\(q<0.05\), Wilcoxon test, Fig. 3a). This demonstrated that the metabolite differences between tumor and normal samples in these datasets were comparable. Next, we repeated the analysis above considering metabolites measured in RC12, but entirely unmeasured in RC3, which we called test metabolites. After joint imputation of RC12 and RC3 with MIRTH, we compared the differential abundance of test metabolites in RC12 (where they were measured) and RC3 (where their true abundance is unmeasured, but where they have been imputed by MIRTH, Additional file 2: Fig. S2). Doing so, we again observed a strong correlation between differential abundance (tumor vs. normal) in test metabolites in RC12 and RC3. Out of 252 test metabolites, there were 235 (93%) significant and consistently differentially abundant metabolites, including glucose-1-phosphate (G1P), fructose-6-phosphate (F6P), fructose-1-phosphate (F1P), and gamma-aminobutyric acid (GABA). Only 17 metabolites inconsistently distinguished tumor and normal samples (\(q<0.05\), Wilcoxon test, Fig. 3b). This analysis confirms that MIRTH preserves relationships between sample types and biologically important metabolites when imputing data across datasets, and suggests that MIRTH can be successfully applied to impute the ranks of entirely unmeasured metabolites in metabolomic data.
To further assess the ability of MIRTH to accurately impute missing features across datasets, we designated one of the nine datasets under analysis as the target, from which we completely masked a set of features to simulate as unmeasured. MIRTH was then applied to impute these unmeasured features, using data from the remaining eight datasets (and therefore testing the performance of MIRTH in the presence of a dataset-specific batch effect). We conducted nine such experiments, where each dataset was the target for one experiment (Fig. 3c). We repeated each experiment for 200 trials for each target dataset, randomly selecting 10% of features to simulate as missing each time. The optimal number of embedding dimensions ranged between 26 and 48, but equivalent-to-optimal performance could be achieved with approximately 30 dimensions (Additional file 2: Fig. S3a). Performance was evaluated similarly to the within-dataset imputation, comparing imputed and true ranks of the features simulated as missing.
Across the nine datasets under analysis, between 38% and 85% of the simulated-missing metabolites entirely masked from a target dataset were well-predicted with the MIRTH approach (\(\rho >0, q<0.05\) in >90% of trials, Fig. 3d, Additional file 1: Table S4). Similar to the within-dataset imputation, performance degraded as a larger proportion of features was simulated as missing (Additional file 2: Fig. S3b). Properties of the raw data, such as the variance of the feature in the target dataset or the number of samples in other datasets where the feature was measured, partially explained why some features were better predicted than others (Additional file 2: Figs. S3c,d). Similar to the within-set imputation, MIRTH reliably predicted the ranks of certain metabolites regardless of the target dataset (Fig. 3e). There were 218 reproducibly well-predicted metabolites, consisting of 56 amino acids, 13 carbohydrates, 10 cofactors and vitamins, 94 lipids, 15 nucleotides, 18 peptides, 10 xenobiotics, and 2 uncharacterized metabolites (Fig. 3e, Additional file 1: Table S5). Similar to the within-dataset imputation, reproducibly well-predicted metabolites were enriched for lipid subsets and amino acids (Fig. 3e). Tyrosine and palmitate, for instance, were reproducibly well-predicted with a median \(\rho\) of 0.892 and 0.845 respectively (Fig. 3f). These results outline a set of metabolites that are likely to be reliably imputed in a new target dataset if one were to be added to our existing aggregate set.
We also evaluated MIRTH’s across-set imputation performance on simulated metabolomics data and on data from the Cancer Cell Line Encyclopedia (CCLE) [21], the full results of which can be found in the Supplementary Information section. To summarize, applying MIRTH to the CCLE dataset resulted in comparable or better imputation of missing metabolites than with our 9 benchmarking datasets. For within- and across-dataset treatments, 100% and, typically, 94.2% of metabolites were well-predicted.
MIRTH embeddings separate tissue of origin and metabolite class
MIRTH involves a matrix factorization that associates both metabolite features and samples with a small number of embedding dimensions. In other contexts, analysis of features and samples in embedding space can be used to interpret the similarity between samples or the covariation of different features. We therefore applied MIRTH jointly to all data available, factorizing the complete aggregate set (\(\varvec{X}\)) of all nine datasets. The optimal number of dimensions for the factorization of the aggregate data matrix into embedding matrices \(\varvec{W}\) and \(\varvec{H}\) was 30 (Additional file 2: Fig. S4a). Weights were mostly small and right-skewed (Additional file 2: Fig. S4b). We used UMAP to visualize the sample and feature embedding spaces [22].
In sample embedding space, some clustering occurred by tissue of origin, with samples from the three kidney cancer datasets, RC12, RC18, and RC3, overlapping (Fig. 4a). COAD samples also separated from other tissues of origin. Interestingly, PrCa samples separated into three distinct clusters, raising the possibility of a latent batch effect, i.e., that the PrCa dataset consists of three sub-datasets that are not preprocessed individually by MIRTH. Definition between other tissue types, i.e. between BrCa1 & BrCa2, PaCa, and HCC, was less discernible. Tumor and normal samples from the same dataset also separated in embedding space along UMAP axis 2 (Fig. 4b).
Dimensionality reduction of the feature embedding matrix also revealed separation between certain metabolite classes, in particular of peptides and lipids from the rest of the measured features (Fig. 4c). The outlying peptide features predominantly represented dipeptides (Additional file 2: Fig. S4c). To determine whether individual embedding vectors were associated with functionally-related groups of metabolites, we performed a Fisher’s exact test for enrichment of a given metabolic pathway in each embedding vector after setting a cutoff value above which a feature was considered to be appreciably weighted (here, \(weight=0.2\)). This analysis was limited by statistical power, due to the relatively small number of metabolites in each annotated pathway. Nevertheless, the analysis identified enrichment of certain metabolite classes across multiple embedding dimensions, including dipeptides, sphingomyleins, diacylglycerols, and lysolipids (Fig. 4d).
MIRTH imputes metabolites across MS ionization modes
Mass spectrometry-based metabolomics can be conducted in positive or negative ionization modes, which allows for quantification of metabolites that are more amenable to acquiring a positive or negative charge [23,24,25]. We devised an experiment to assess the viability of predicting positive-mode measurements from negative-mode ones (or vice-versa), using a dataset where samples were profiled in both modes [24].
This dataset consisted of 448 features across 638 samples. Of the 241 quantified metabolites, 191 were measured in both positive and negative modes (accounting for 398 features due to redundancies). Of the remaining metabolites, 24 were measured only in positive mode and 16 were measured only in negative mode. We devised a test scenario for MIRTH whereby positive- or negative-mode measurements were completely masked from half the samples. This simulated a scenario where half of the samples were measured in both modes and the remaining samples were measured in just one mode (Fig. 5a). Imputation performance was assessed on metabolites only measured in one mode across 200 trials with a different set of samples chosen for masking each time. All non-overlapping metabolites were well-predicted (\(\rho >0, q<0.05\) in \(> 90\%\) of trials). Overall, negative-mode features were predicted with a higher \(\rho\) than positive-mode features (Fig. 5b), perhaps due to the greater reproducibility of the positive-mode measurements on which negative-mode predictions were based [24]. Predictions for glyceraldehyde-3-phosphate, cadaverine, and putrescine - features measured only in positive mode - were notably accurate, with median \(\rho\) values of 0.765, 0.788, and 0.777, respectively (Fig. 5c, Additional file 1: Table S6). Likewise, aconitate, carbamoyl phosphate, and riboflavin were among the best-predicted negative-mode features, with median \(\rho\) values of 0.901, 0.888, and 0.887, respectively (Fig. 5c, Additional file 1: Table S6). Once again, these results indicate that MIRTH can impute the ranks of metabolites that were entirely unmeasured by leveraging latent information in metabolomics data—in this case, in one ionization mode.