 Method
 Open access
 Published:
SCADIE: simultaneous estimation of cell type proportions and cell typespecific gene expressions using SCADbased iterative estimating procedure
Genome Biology volume 23, Article number: 129 (2022)
Abstract
A challenge in bulk gene differential expression analysis is to differentiate changes due to cell typespecific gene expression and cell type proportions. SCADIE is an iterative algorithm that simultaneously estimates cell typespecific gene expression profiles and cell type proportions, and performs cell typespecific differential expression analysis at the group level. Through its unique penalty and objective function, SCADIE more accurately identifies cell typespecific differentially expressed genes than existing methods, including those that may be missed from single cell RNASeq data. SCADIE has robust performance with respect to the choice of deconvolution methods and the sources and quality of input data.
Background
The past three decades have seen rapid development in gene expression analysis using microarray and sequencing technologies, where bulk samples are analyzed to answer specific biological questions, e.g., to identify genes with different expression levels between cancer samples versus controls. Because bulk samples contain many distinct cell types, such analyses only provide limited granularity. Most differential expression analyses on bulk samples often assume that the measured gene expression is from the primary cell type, e.g., tumor cell in bulk tumor sample.
Recent progresses in singlecell RNAsequencing (scRNAseq) techniques have demonstrated substantial heterogeneity in bulk samples. However due to the high cost and complexity for scRNAseq, most available data have remained to be from bulk samples. To make better use of bulk sample data, many in silico deconvolution methods have been proposed to infer cell type proportions from bulk data. Most deconvolution methods assume that the observed bulk gene expression profile is a convex mixture of celltype specific gene expression profiles, i.e.,
Here Y is the bulk gene expression matrix with m genes and n samples, W is the cell typespecific gene expression matrix of the k component cell types, and each column in H represents the cell type proportions for the corresponding bulk sample. All entries in these matrices are nonnegative, which is indicated by the “+” sign in the notations.
The principle behind the designs of most existing deconvolution methods is to utilize genes that have distinct expression levels across cell types to infer cell type proportions. To this end, some methods curate a signature matrix \(\underline {W} \in R^{m_{sub}\times k}\) with only a subset of cell typespecific genes and gather their expression profiles either from pure cell types [24, 29, 41] or scRNAseq data [4]; others use all genes but assign higher weights to genes with more differentiating power to produce a weighted version \(\tilde {\bar {W}} \in R^{m \times k}\) [42]. Both genres of methods then solve the constraint regression problem specified in the following Eqs. 2(3) with a variety of techniques [24, 37, 40, 42].
Although enormous insights on cell type proportion changes have been drawn from the applications of these deconvolution methods, most of these downstream analyses were performed under the scheme of single signature matrix, i.e., the same signature matrix was used for different groups of bulk data. In real data analyses, a more appropriate model would be that the observed differences in the bulk samples result from not only cell type compositional changes, but also from changes in cell typespecific gene expression profiles. In mathematical terms, it is Y_{1}=W_{1}H_{1} and Y_{2}=W_{2}H_{2}, when an analysis is performed assuming W_{1}=W_{2}, it intrinsically overattributes changes to cell type proportion changes.
In this article, we aim to simultaneously estimate groupspecific Ws and Hs in a twogroup comparison setting, thus to accurately infer cell typespecific differentially expressed genes (DEGs) as well as cell type proportion changes. To this end, we present a smoothly clipped absolute deviationbased (SCAD) iterative estimation (SCADIE) framework that can address this challenging problem.
The SCAD penalty and weighted ℓ_{1} penalty using the derivative of SCAD are widely used in the penalization methods [7, 21]. SCAD is defined as
where a,ζ_{n}>0 are parameters to be tuned. SCAD can be viewed as a hybrid of ℓ_{0} and ℓ_{1} regularizers in the sense that it resembles the ℓ_{1} norm in a neighborhood of the origin, but stabilizes to constant at larger values [21]. Although the nonconvexity of the SCAD leads to the nonconvex optimization problem, various empirical studies have shown that it often produces estimators with smaller estimation error than the estimators via the convex ℓ_{1} penalty.
At high level, SCADIE is built on existing supervised deconvolution methods. It takes bulk gene expression along with a common signature matrix or initial cell type proportions as input and then estimates group specific Ws and Hs. Its underlying assumption is that the cell typespecific W_{1},W_{2} are reasonably similar but not exactly the same, thus it is possible to initialize with the same W and use an iterative algorithm to search for optimal groupspecific Ws. Through comprehensive simulation and real data analyses, we demonstrate that SCADIE is capable of identifying cell typespecific DEGs between Ws while maintaining high accuracy in estimating Hs.
Results
Overview of SCADIE
The goal of the SCADIE framework is to estimate matched W_{1},H_{1} and W_{2},H_{2} from bulk data Y_{1},Y_{2} and then perform hypothesis tests to identify cell typespecific DEGs. In the most common deconvolution scenario, only Y_{1},Y_{2} and a shared signature matrix \(\underline {W}\) are provided, going from shared W_{sub} with only signature genes to groupspecific W_{1},W_{2} containing all genes remains to be a challenge.
In view of this challenge, we assume the following conditions hold in our estimation setting: First, most entries in W_{1} and W_{2} are not differentially expressed. This should especially hold for signature genes in \(\underline {W}_{1}\) and \(\underline {W}_{2}\). In the case where systematic changes in expression profile occur across all cell types, neither should we use shared W_{sub} for initialization, nor is SCADIE applicable. Second, the compositional cell types in W_{1},W_{2} should remain the same; otherwise, different models should be used for group 1 and group 2. Common applicable scenarios include tumor microenvironments between different responding groups, or different subtypes of the same disease.
With the above assumptions, we propose a smoothly clipped absolute deviation (SCAD) penaltybased iterative estimation procedure (SCADIE) that consists of the following steps:

1
Jointly estimate cell type proportions for both groups, obtaining \(H_{1}^{(0)}\) and \(H_{2}^{(0)}\). Any deconvolution method can be used in this step, and by joint estimation, we assume that both groups share the same signature gene matrix \(\underline {W}\) (Fig. 1 (a)). Users can also directly input \(H_{1}^{(0)}\) and \(H_{2}^{(0)}\) from other methods.

2
Obtain the initial estimates of \(W_{1}^{(0)}\) and \(W_{2}^{(0)}\) separately, then iteratively update Ws and Hs for a few rounds using nonnegative least squares (NNLS).

3
Derive the weight matrix E used for the main estimation procedure with the derivative of SCAD function [7] (Fig. 1 (b), section “Warmup run and weight matrix derivation”). We justify the choice of the SCAD derivative based penalty in the “Rationale behind SCAD penalty” section.

4
After completing steps 1 to 3, the main SCADIE estimation procedure consists of iteratively updating H_{1} and H_{2} using NNLS, respectively, and jointly updating W_{1} and W_{2} with a SCADbased matrix factorization. A parallel leaveoneout jackknife procedure is also run to obtain entrylevel standard error for all entries in W_{1}−W_{2}, and these standard errors can be summarized in a matrix \(\Sigma _{W_{1}W_{2}}\) (Fig. 1 (c) and section “Update W and H”).

5
For each pair of entries \(\hat {W}^{ij}_{1}\) and \(\hat {W}^{ij}_{2}\), we can calculate their zscore based on the standard errors of their difference \(\Sigma _{W_{1}W_{2}}^{ij}\) and then obtain a pvalue for testing differential expression.
The above procedure outputs \(\hat {H}_{1}, \hat {H}_{2},\hat {W}_{1},\hat {W}_{2}\), and \(\Sigma _{W_{1}W_{2}}\). Among these, \(\hat {H}_{1}\) and \(\hat {H}_{2}\) can be used for cell type proportion comparison; \(\hat {W}_{1},\hat {W}_{2}\), in combination with \(\Sigma _{W_{1}W_{2}}\), can be used to perform hypothesis testing for cell typespecific differential expression analysis between the two groups.
Simulation results
SCADIE maintains high cell proportion estimation accuracy
One concern of SCADIE’s algorithm is that its iterative procedure and full Wbased H update might result in reduced H estimation accuracy. To evaluate SCADIE’s performance on cell type proportion estimates, we benchmarked SCADIE against four deconvolution algorithms, including DWLS[40], CIBERSORTx[25], MuSiC[42], and a naive version of SCADIE using NNLS in updating W. We tested these four methods on a simulated data set, a pseudobulk data set[43], and a bulk microarray data with known cell type proportions[34]. We used two metrics to evaluate the accuracy of the estimated Hs: KL divergence and rootmeansquared error (RMSE). KL divergence is a suitable measure because it measures the distance between two sumto1 discrete distributions, which are the same format as cell proportions, while RMSE is widely used by previous deconvolution methods.
Additional File 1: Supplementary Fig. S3ab shows the final output H accuracy compared to CIBERSORTx, DWLS, MuSiC, and NNLSiteration, measured by KL Divergence and RMSE, respectively. The overall result patterns between these two metrics are very consistent. In terms of performance, SCADIE showed equal or better accuracies than the other four methods except in the mouse ISC pseudo bulk dataset, where MuSiC substantially outperformed all other methods. MuSiC is specifically tailored for single cell count data, and its significantly better performance in the single cell data suggests the same. Besides, although the NNLS iteration only differs from SCADIE in its Wupdate step, the results were substantially inferior. This was due to the uncontrolled changes in Ws over iteration and it highlights the importance of using the SCAD penalty (this will be discussed in detail in the section “Rationale behind SCAD penalty”). Further, the accuracy of the estimated H was stable over iterations (Additional File 1: Supplementary Fig. S3cde). Specifically, the results for the true bulk data were flat because there was only one group of samples; thus, there was no separate updating.
These results suggest that although our iterative procedure uses full W to update H, it would not negatively impact cell type proportion estimation. However, to accommodate potentially different needs, we also made signatureonly updates as an option in the SCADIE package.
SCADIE can better identify DEGs
One of SCADIE’s key features is to estimate conditionspecific W matrices. To demonstrate the efficacy of SCADIE’s framework for this feature, we next compared its performance with four other methods with similar functions.
The most straightforward way to estimate W is by solving Y^{T}=H^{T}W^{T} using NNLS. We implemented this method in our SCADIE package and labeled it as “NNLS”. A similar technique using ordinary regression was adopted in a microarray deconvolution method called csSAM from [34]. A recently proposed statistical framework named TOAST that aims at performing hypothesis testing for cell typespecific gene expression [16] is also included. Finally, CIBERSORTx has a high resolution mode for samplespecific W estimate, whose results can also be used for DEG analysis.
Here we benchmarked SCADIE against the above four methods on one simulated dataset and two pseudobulk datasets generated from scRNAseq data. Because these were simulated data with known DEG statuses, we were able to measure the truepositive and falsepositive rates (see the “Methods” section for more details). We plot the truepositive rates against falsepositive rates over a range of pvalues (from 10^{−12} to 0.5) for each method on each dataset in Fig. 2.
As shown in Fig. 2, SCADIE performed the best for all three datasets in identifying both upDEGs and downDEGs. Specifically, SCADIE outperformed CIBERSORTx, csSAM, and TOAST in both truepositive identification and falsepositive control. It should be noted that since CIBERSORTx’ high resolution mode is not designed to impute groupspecific gene expression levels for whole transcriptome, only a small fraction of Ws was output from it, and the missing results for most genes were reflected in the overall low truepositive and falsepositive rates. When compared to the NNLS Wupdate, SCADIE showed similar power in truepositive identification, but better falsepositive controls. This is consistent with our understanding of SCADpenalty’s advantage, which is also shown in simulation results from Fig. S2 (where false positives measured by PPV): as separate NNLS Wupdate may cause too much divergence between W_{1} and W_{2} over iterations, leading to more false positive results.
SCADIE can improve the estimates from other methods
We next asked the question whether the initial results from CIBERSOTx and csSAM can be improved via SCADIE’s iterative procedure. To investigate this, we initialized the SCADIE algorithm with output from CIBERSORTx and csSAM on the same three datasets above and used SCADIE to iteratively estimate Ws and Hs, followed by DEG analysis with SCADIE’s framework. TOAST was not included in this analysis because it only performs hypothesis test without providing point estimates for Ws.
As can be seen from Fig. 3, the accuracy of DEG improved in 11 out of 12 cases (including both down and upDEGs) through this scheme. This demonstrates the efficacy of SCADIE’s iterative procedure in improving DEG identification accuracy.
Robustness with respect to ζ _{n}
Since the parameter ζ_{n} of the SCAD as in (6) (see the “Methods” section at the end) plays a crucial role in defining similarity penalty via SCAD, it is important to examine the robustness of SCADIE with respect to the choice of ζ_{n}. To this end, we performed comprehensive sensitivity analyses on both simulation and real data for a wide range of ζ_{n} (Additional File 1: Supplementary section S3.2). The results suggest that when ζ_{n} is within a reasonable range (from 1 to 8), SCADIE’s output is highly robust in terms of H estimates, W estimates, and the DEG identifications (see Additional File 1: Supplementary section S3.2 for more details).
SCADIE’s performance on real data
In the previous sections, we examined SCADIE’s performance extensively through simulation and pseudobulk data. Compared to simulated data, real datasets rarely come with ground truth cell type proportions nor cell typespecific gene expression. In addition, biopsy heterogeneity and platform/technical variation present huge uncertainty in estimation outcome.
To evaluate SCADIE’s performance on real datasets, we applied SCADIE on four bulk datasets with distinct features, from microarray to postmortem RNASeq. Due to the lack of groundtruth, we primarily evaluated from the following four aspects: 1. Can SCADIE identify biologically meaningful cell type proportion changes? 2. Can SCADIE identify known cell typespecific DEGs? 3. For DEGs identified from SCADIE, are they associated with known biological processes? and 4. Can the iterative procedure improve estimation accuracy?
SCADIE accurately infers cell type proportions and cell typespecific genes in chronic obstructive pulmonary disease
Chronic obstructive pulmonary disease (COPD) is a chronic lung disease and a leading cause of death. Many efforts have been made to profile the transcriptomes from COPD patients [1, 14, 28, 32]. To assess SCAIDE’s performance on COPD data, we derived signature matrix from a COPD single cell dataset [1] and performed deconvolution on an independent bulk data set [14] with both COPD and control samples (98 COPD samples; 91 control samples), for five major cell types (stromal, myeloid, lymphoid, epithelial, and endothelial) as clustered in [32].
Although COPD causes pathological changes in several myeloid, epithelial, and endothelial cell types, previous studies did not find any systematic changes in cell type proportions associated with COPD [28, 32]. Reasons for this include the high heterogeneity of disease [26], as well as high variability in cell type compositions across specimens, which makes it difficult to identify consistent patterns. Cell type proportions estimated from SCADIE suggest similar pattern, where the mean cell type proportions are consistent between the COPD and control groups (Fig. 4a), while individual compositions varied across samples (Fig. 4b). The epithelial proportion varied most, as it is not only associated with biopsy spatial location, but also with disease severity [39].
We next evaluated SCADIE’s performance in identifying cell typespecific DEGs from bulk data. “Ground truth” cell typespecific DEGs were first obtained by performing differential expression analysis on the COPDControl single cell data cohort[32] for the five major cell types. Their log2 fold changes as well as p values are shown in volcano plots in Fig. 5a to e. Across all cell types, more than 60% of the single cellidentified DEGs showed concordant directional changes from SCADIE output. In terms of significant DEGs, 9–33% of single cell identified DEGs were also inferred to be significant DEGs by SCADIE from bulk data (Additional File 1: Supplementary Fig. S4). Since there were only fewer than half of single cell DEGs replicated in bulk data, we next asked the question of whether it was due to data heterogeneity or method deficiency. To this end, we compared SCADIE with TOAST in terms of correct direction percentage (percentage of single cell derived cell typespecific DEGs that have same directional change from SCADIE or TOAST) and correct significant DEGs percentage (percentage of single cell derived cell typespecific DEGs that are also identified significant from SCADIE or TOAST). SCADIE consistently outperformed TOAST by significant margins in both aspects (see Fig. 5f). This result suggests that there is indeed significant heterogeneity between these two unrelated single cell and bulk data cohorts, and SCADIE outperformed existing method even under this noisy circumstance. Here we only included TOAST in comparisons because csSAM only works on microarray data, while CIBERSORTx does not infer cell typespecific DEGs at the whole transcriptome level.
Note that from Supplementary Fig. S4 there is a substantial number of SCADIEinferred DEGs that were not present in single cell DEGs; it is important to find out whether the large proportion of nonoverlapping DEGs was due to false positives. To this regard, we ran Gene Set Enrichment Analysis (GSEA) on single cell and SCADIEderived DEGs separately and compared their top enriched pathways. Results from Additional File 1: Supplementary Fig. S5 show that the top pathways from single cell and SCADIE were significantly more overlapped than that expected by chance, indicating that the underlying biological signals do share consistent patterns even the actual DEG sets differ.
Finally, we asked the question whether SCADIE can infer DEGs that were not identified in single cell data. To answer this question, we looked into the top 10 SCADIEinferred down DEGs in each cell type. Among the 50 genes, only 8 of them were also identified from single cell DE analysis. Among the 42 DEGs unique to SCADIE, 39 were also present in the single cell dataset, and 33 showed concordant cell typespecific directional changes as SCADIE. This finding suggests that the cell typespecific expression changes from SCADIE were likely true signals, and the main reason they were missed in single cell data was the limited power due to data sparsity (See Additional File 2: Supplementary Table S1 for details). In addition, literature search showed that 19 out of the 42 SCADIEunique genes were associated with COPD from previous studies (See Additional File 2: Supplementary Table S1 for details). This suggests that SCADIE is capable of mining DEG information that is too sparse to be identified in single cell data.
In summary, SCADIE is not only capable of identifying known cell type proportion patterns and cell typespecific DEGs, it can also infer DEGs that may be missed by single cell data due to the high noise and dropout in single cell data.
SCADIE reveals biologically meaningful composition and expression differences in DLBLC subtypes
Diffuse large B cell lymphoma (DLBCL) is the most common type of nonHodgkin lymphoma and can be classified into two main subtypes based on gene expression: germinal centerlike (GCB) and activated B celllike (ABC) DLBCL [2, 19]. Traditionally DLBCL patients were treated with cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP), and in recent years CHOP in combination with immunotherapeutic drug rituximab (RCHOP) has gained more popularity due to its benefit in clinical outcome [33].
Here we analyzed a dataset consisting of 414 samples from both subtypes who received either CHOP or RCHOP [15]. Inspired by a previous analysis from [25], we first asked the question whether the DEGs between GCB and ABC can be attributed mostly to B cells. To this regard, we applied SCADIE to all samples and compared the cell typespecific DEGs by treatment groups using the same cell type characterization and signature matrix from [25]. From Fig. 6ab, we can observe distinct DEG composition patterns between CHOP and RCHOP patients: in the CHOP group, DEGs of GCB are dominantly by genes from B cells, and those of ABC are also from B cells or activated B cells (plasma cell), while in the RCHOP group, there is a substantial reduction in B cell DEGs and most DEGs are instead from T cells. While this may seem counterintuitive at first, given the rituximab’s nature as an antibody against B cells, it is consistent with several lines of previous studies [18, 30] that rituximab reduces B cell proportion substantially in the GCB group (Fig. 6d) and alters T cells gene expressions more than B cells.
Next, we compiled a list of validated markers for GCB and ABC from published studies [2, 10] (see a list of all genes at [38]) and examined if SCADIE could accurately infer their distinct expressions. The log2 fold changes in the estimated B cell expressions between ABC and GCB are shown in Fig.6c, where we can see that not only all markers show higher expression levels in their corresponding cell types; their estimated fold changes in B cell are also higher compared to those in the bulk data.
Unlike previous analyses, although in real data we do not have comprehensive differential expression ground truth for all genes, we are able to demonstrate that the cell type proportions and cell typespecific DEGs inferred from SCADIE align well with the literature.
SCADIE improves ADassociated cell typespecific DEG estimation
Alzheimer’s disease (AD) is a leading threat to global elder population and has been under intensive research over decades. However, gene expression analyses remain challenging due to the difficulty in accessing samples and the low quality of postmortem RNASeq samples. To examine SCADIE’s performance on these challenging data, we applied SCADIE to an AD bulk RNASeq dataset where the cell type proportions had been measured by immunohistochemistry (IHC) [27]. The dataset contains RNASeq samples from 31 healthy individuals and 18 AD patients. We initialized Ws with the IHCestimated proportions and ran the iterative procedure subsequently. Due to different cell type categorizations between the IHC study [27] and DEG study [22], we only examined the three overlapped cell types: astrocyte, oligodendrocyte, and microglia (see the section “Methods—Real data processing and analyses” for details.)
To evaluate SCADIE’s performance, we obtained a list of cell typespecific DEGs from [22] and tested if SCADIE could correctly identify them. Figure 7 shows the estimated expression levels for those known up and downDEGs in the initial and final Ws, respectively. In the initial Ws, 34%, 20%, and 14% of known DEGs were estimated to have foldchanges in the opposite directions (i.e., up(down)DEGs were estimated to be down(up)regulated) with  log2FC>1) in astrocyte, oligodendrocyte, and microglia. After SCADIE procedure, these ratios reduced to 5%, 9%, and 7%, respectively (Fig. 7).
These results suggest that, although low initialization quality might limit the performance of SCADIE, its iterative procedure could still improve and recover the DEGs’ directional signals.
DEG identification under poor initialization and limited sample size
Although SCADIE can be well tuned to identify DEGs, it may run the risk of increasing error when the initial W and H deviate from ground truth. In addition, the jackknife method might perform poorly when sample size is relatively small compared to the number of cell types. To study the potential impacts of these issues, we first benchmarked SCADIE’s performances under different initial H accuracy levels and sample sizes. SCADIE’s accuracy decreases with both the initial H’s accuracy (Additional File 1: Supplementary Fig. S6) and the reduction in sample size (Additional File 1: Supplementary section S3.4), while H accuracy has a larger effect. This is in line with our expectation because H directly affects point estimates of Ws while sample size’s effect is more on standard error.
On real data performance, we benchmarked SCADIE on a previously studied follicular lymphoma (FL) bulk dataset with both issues present. FL is a common type of B cell lymphoma and can be classified into two subtypes by the presence of CREBBP mutation [9]. We reanalyzed a cohort of 26 samples whose genotypes are known (16 CREBBPmutant and 10 CREBBPwild type) from [25] and benchmarked against a list of known DEGs identified by [9].
We inferred the initial H via deconvolution using CIBERSORTx’s LM22 signature matrix. Since this H contains 22 cell types while our sample number and leaveoneout jackknife require the number of cell type not exceeding 9, we merged H into eight main cell types based on their abundances and similarities. Then, the initial Ws were obtained for CREBBPmutant and CREBBPwild type groups respectively. To measure the qualities of the initial Ws, we made the volcano plot for known DEGs in initial Ws. The results show few correctly identified downDEGs with some false positives (Fig. 8a “SCADIE initial W”) in the initial Ws. After iteration, the final Ws correctly identified more DEGs in both directions (upregulated and downregulated, see Fig. 8a “SCADIE final W”). This is consistent with the table in Fig. 8b.
We next compared SCADIE with CIBERSORTx’s high resolution mode and TOAST. Since CIBERSORTx cannot perform wholetranscriptome imputation, we only input the 467 known DEGs for imputation. The results suggest that although SCADIE’s estimations had more directional errors, its overall power (Fig. 8a) and number of correctly identified DEGs (Fig. 8b) exceeded CIBERSORTx. Outcome from TOAST is very similar to SCADIE’s initial outcome (Fig. 8a), but SCADIE’s final Ws can identify more DEGs in both directions, though at a cost of higher false positives.
The above results suggest that incorrect initialization and limited sample size do lower the estimation quality of SCADIE; however, SCADIE can still improve and maintain competitive performance in these scenarios. It is recommended to consider SCADIE’s applicability since poor initialization quality or limited sample size is common in real applications. Though it is usually not possible to evaluate initial H’s quality, we recommend having sample size at least 1.5x the number of cell types when performing DEG identification with SCADIE.
Discussion
Recent years have seen rapid developments of technologies and computational methods to enable researchers to identify genes with different expression levels across conditions, either through bulk samples or single cells. When a gene is differentially expressed between two groups of bulk samples, should we attribute it to cell type proportions changes, or cell typespecific gene expression changes?
To this end, we have developed SCADIE, an estimation framework to simultaneously estimate both cell type proportions H and cell typespecific gene expressions W. SCADIE features an iterative update procedure of W and H, a SCADbased penalty for similarity control, as well as a jackknifebased standard error estimation. This framework enjoys several advantages over existing methods. First of all, compared to scRNAseq, SCADIE’s cell type proportion estimate is not affected by technical dropouts. Second, SCADIE has shown better sensitivity and specificity compared to existing methods. In addition, SCADIE also has certain advantages in functionality compared to other methods mentioned in this paper: when compared to CIBERSORTx, SCADIE’s design can estimate and test all the genes to identify DEGs (while CIBERSORTx only imputes a subset of given genes); it can also accommodate more than microarray data and output matched cell proportions (while csSAM only works on microarray data); when compared to TOAST, SCADIE can not only perform hypothesis testing for DEGs, but produce matched point estimates in the meantime. It is worth mentioning that recently two additional methods, CellR [5] and CARseq [12], have been proposed to specifically perform cell typespecific DEG identification based on RNASeq raw counts data. We were unable to include them in our benchmarking because most of the data in our study were only available either as normalized RNASeq or microarray, but they might have superior performance in heavily normalized brain tissue data where SCADIE’s performance was not ideal.
Through extensive comparisons using simulated and real datasets, we demonstrated that SCADIE can not only maintain high cell proportion estimate accuracy, it can also effectively identify cell typespecific DEGs. In the cases where initialization quality is poor or with barely sufficient sample size, SCADIE’s performance will be affected; however, its iteration procedure can still recover signals to certain extent. We have provided guidelines regarding SCADIE’s applicability under these extreme situations. Besides, its performance is highly robust with respect to a range of parameter values. These features all ensure that SCADIE can be broadly applicable to different settings.
Although SCADIE enjoys several advantages over existing methods, there are several aspects that can be further explored. First, the proposed NMF estimate may involve bias due to the nonnegative constraint and the penalization term. One can consider a debiased estimate to fix the potential bias of the proposed penalization method to improve inference. Second, although SCADIE can take initial input from any deconvolution algorithm, it only uses NNLS in its iterative Hupdate step. We may expand its modularity and enable it to fully take advantage of other deconvolution methods (e.g., DWLS for cases with rare cell types, and weighted NNLS [42] for single cell counts data). It shall also be noted that although SCADIE is built as a supervised deconvolution tool, it is also compatible with all the unsupervised deconvolution methods that only require bulk gene expression data [13, 31, 44] as long as they could provide initial W and H. Unsupervised methods are useful in situations of cell type discovery or lack of supervising information, but as there is no guarantee that their inferred cell types have onetoone mapping to actual cell types, annotating cell types remains a challenge. Third, we have shown that using the full W along with NNLS has provided robust and accurate H estimates over iteration, further investigation into the mechanism behind this may enlighten the simplification of deconvolution. Finally, we can improve the DE hypothesis testing procedure by incorporating more advanced DE techniques and better false discovery control into the framework.
Conclusions
Simultaneous estimation of cell type proportions and cell typespecific gene expressions from bulk gene expression data remains a challenge due to its nonidentifiable nature. In this article, through our proposed method SCADIE, we demonstrated that with reasonable assumptions on the similarity between group level cell typespecific gene expression profiles, proper design of objective function, and reasonable initial deconvolution accuracy, it is possible to infer cell type proportions along with cell typespecific gene expressions with robustness and high accuracy. Despite this progress, technical challenges including multigroup comparison, limited sample size, and poor initialization quality still remain to be further addressed in the future.
Methods
For a full list of notations, refer to Additional File 1: Supplementary section S1.
Rationale behind SCAD penalty
A key consideration in our proposed method is to maintain a proper dissimilarity level between W_{1} and W_{2}, where the true DEGs can be identified without introducing many falsepositive DEGs. In our simulation analyses, we found that keeping W_{1} and W_{2} similar by a ridge penalty did increase the accuracy in Hs (Additional File 1: Supplementary Fig. S1cd). However, the W accuracy and DEG identifying power are reduced (Additional File 1: Supplementary Figs. S1ab, S2). An intuitive explanation to this is that forcing W_{1} and W_{2} prevents them from being too divergent with each other, thus increasing H accuracy; however, this penalty also makes W_{1} and W_{2} oversimilar, thus reducing its sensitivity substantially (Fig. S2).
To combine the advantages with and without ridge penalty, we adopt the SCADbased penalty that imposes entryspecific dissimilarity penalty based on the prior difference between W_{1} and W_{2}: if the separately estimated \(\bar {W}_{1}\) and \(\bar {W}_{2}\) have similar entries for the (k,j) component, we put a high but bounded penalty on their difference in our procedure, whereas if the components are quite different, we penalize less on the difference. Specifically, we penalize \(\sum _{k,j} E_{jk} \left ([W_{1}]^{T}_{jk}  [W_{2}]^{T}_{jk}\right)^{2}\), where \(E_{jk} = P'_{\zeta _{n}}\left \{ \left [\bar {W}_{1}^{T}  \bar {W}^{T}_{2}\right ]^{2}_{jk}\right \}\) and \(P^{\prime }_{\zeta _{n}}(\cdot)\) is the derivative of the SCAD penalty function as defined in (6). To the best of our knowledge, weighted ℓ_{2} penalty using the derivative of SCAD has not been investigated. By doing so, we can incorporate the structure pattern in \(\bar {W}_{1}  \bar {W}_{2}\) when estimating W_{1} and W_{2} in a more adaptive manner (see the sections “Warmup run and weight matrix derivation” and “Update W and H” for more details).
Theoretical analyses suggest that this novel penalty structure can achieve high accuracy in the estimation of W and H (Additional File 1: Supplementary section S3.1). In addition, we performed simulations comparing sensitivity, specificity, and positive predictive rate (PPV) by using the following: (1) independent W_{1},W_{2} updates with NNLS, (2) ridge regression imposing similarity between W_{1} and W_{2}, and (3) SCADpenalty. The results suggest that SCADpenalty can keep sensitivity high while in the meantime better control false positives through its precise penalty (Figs. S1 and S2).
Initialization
For SCADIE’s initialization, users can either input bulk matrices along with corresponding Hs, or input bulk and signature matrices to perform generic deconvolution using NNLS or DWLS[40]. Given that many deconvolution methods have been proposed to accommodate various data and conditions, we recommend users provide bulk matrix along with the best initial Hs available.
In real applications, obtaining accurate deconvolution results is often difficult. We have shown that although poor initialization does affect SCADIE’s performance, its iterative procedure could recover the signals and produce decent results (see the section “SCADIE improves ADassociated cell typespecific DEG estimation” and Fig. 8).
Warmup run and weight matrix derivation
The proposed penalty requires a prior weight matrix E of the same dimension as W, which provides prior information on how likely certain entries differ between two Ws.
To obtain E, we perform a few steps of “warmup” iterations:

1
We first obtain full \(W_{1}^{(0)}\) and \(W_{2}^{(0)}\) from solving the NNLS problem \(\min _{W_{i} \in \mathbb {R}_{+}^{m \times k}} \left \Y_{i}^{T}  H_{i}^{T}W_{i}^{T} \right \_{F}^{2}, i = 1,2\).

2
The Hs are subsequently updated by NNLS using full Ys and Ws.

3
Repeat steps 1–2 for a few rounds (default 5 rounds, but can be manually changed), and plug in the output \(\hat {W}_{1}^{NNLS},\hat {W}_{2}^{NNLS}\) to Eq. 6 in the section “Update W and H” to obtain the weight matrix E.
Update W and H
In this subsection, we provide details of updating W and H, which corresponds to step (b) in Fig. 1.
Note that for the simplicity of notation, we intrinsically assume the sample sizes of group 1 and group 2 are both n throughout analyses in this paper, i.e., Y_{1},Y_{2}∈R^{(m×n)+} and H_{1},H_{2}∈R^{(k×n)+}. Although this is not the case in most real applications, making this assumption does not affect either our theoretic derivation or most implementations. For scenarios where n_{1}≠n_{2} makes a difference (e.g., jackknife estimation), we will discuss our handling of this issue specifically.
For the update of H in the main iterative procedure, we simply use NNLS to solve for the problems below for the two groups separately:
Note that here we use the full Ws instead of signature genes only, and this alteration still produces good estimation accuracy (see the section “SCADIE maintains high cell proportion estimation accuracy”).
To update W_{1} and W_{2} simultaneously, we consider the following weightedregressionbased optimization:
where \(\W_{1}^{T}W_{2}^{T}\_{E,F}^{2} = \sum _{j,k} E_{jk} \left (W_{1}^{T}W_{2}^{T} \right)_{jk}^{2}\). Note that (5) is not separable with respect to W_{1} and W_{2} due to the third term, from which the information is shared between the two groups.
Let \(\tilde {W} = [{W}_{1}, {W}_{2}]^{T} \in \mathbb {R}^{(2k \times m)+}\). We update each column of \(\tilde {W}\) separately. Specifically, updating the jth column of \(\tilde {W}\) is equivalent to solving
where
where A_{j} represents the jth column of a matrix A. Then the minimizer \(\hat {x}^{(j)}\) corresponds to the jth column of \(\tilde {W}\).
For the weight matrix \(E \in \mathbb {R}^{k \times m}\), we set \(E_{jk} = P'_{\zeta _{n}}\left \{ \left [\bar {W}^{T}_{1}  \bar {W}_{2}^{T}\right ]^{2}_{jk}\right \}\), where \(P^{\prime }_{\zeta _{n}}(\cdot)\) is the derivative of the SCAD penalty function, i.e.,
with a regularization parameter ζ_{n}≥0, and \(\bar {W}_{1}\) and \(\bar {W}_{2}\) are the separate estimates obtained from the previous step. We set a=3.7 as suggested by [8], which is known to be optimal based on crossvalidated empirical studies [20]. For the choice of parameter ζ_{n}, we keep ζ_{n}=4 throughout all our analyses. We also demonstrate that SCADIE output is robust with respect to ζ_{n} in terms of Hs, Ws, and DEG identification, if ζ_{n} is in an appropriate range. See the section “Robustness with respect to ζ_{n}” and Additional File 1: Supplementary section S3.2 for more details.
The proposed weightedregressionbased optimization (5) with the weight based on SCAD derivative function can be understood as the onestep local linear approximation of the following SCAD penalty [45]:
Hence, the proposed weighted regressionbased optimization penalizes differently based on the range of \([W_{1}^{T}  W_{2}^{T}]_{jk}\). Compared with the SCAD penalty, the proposed method can be efficiently computed and enjoys the same theoretical properties known as “oracle properties” [45].
The SCAD penalty and weighted ℓ_{1} penalty using the derivative of SCAD are widely used in the penalization methods [7, 21]. SCAD enjoys variable selection consistency and unbiasedness property by imposing no weights on the signal that is beyond 3.7ζ_{n}, which may reduce bias. By using the weighted Frobenius penalization, we may obtain less biased estimate compared to that of Frobenius penalization. In theory, we derive the estimation error bound of the proposed method under certain regularity conditions if ζ_{n} is in an appropriate range; see Additional File 1: Supplementary section S3.1 for more details.
Jackknife standard error estimation
The above iterative procedure only provides us with point estimates of W_{1},W_{2},H_{1}, and H_{2}. For differential expression analysis between W_{1} and W_{2}, we use a leaveoneout jackknife procedure to estimate standard error for W_{1}−W_{2}.
For standard error estimation in most real data, Y_{1} and Y_{2} have different sample sizes (number of columns). We denote n_{1} and n_{2} as column numbers of Y_{1},Y_{2}, respectively, and let n_{0}= min{n_{1},n_{2}}. Then we run the iterative procedure n_{0} times—each time leaving one sample out from Y_{1},H_{1} and Y_{2},H_{2}, respectively; this will give us n_{0} different \(\hat {W}_{1}\hat {W}_{2}\)s, and elementwise jackknife standard error estimates can be obtained by: \(\sigma ^{jackknife}_{ij} = \frac {n_{0}1}{\sqrt {n_{0} }}\hat {\sigma }_{ij}, i=1,...,m; j = 1,...,k\), where \(\hat {\sigma }_{ij}\) is the sample standard deviation for the ijth entry from the \(n_{0} \hat {W}_{1}\hat {W}_{2}\)s [23]. With these, we can then conduct elementwise hypothesis testing to identify DEGs between W_{1} and W_{2}.
We compared the jackknife standard error estimates with those from bootstrap, and both led to highly consistent results (Additional File 1: Supplementary section S3.6). In the R package implementation of SCADIE, bootstrap is also available for standard error estimation. However, we recommend using jackknife for general purposes to avoid the potential singularity issue arising from bootstrap’s sampling with replacement; see Additional File 1: Supplementary section S3.6 for a detailed explanation.
Simulation models and benchmarking
Simulation datasets
The simulation data used in the section “SCADIE maintains high cell proportion estimation accuracy” to section “SCADIE can improve the estimates from other methods” were generated as follows: first, W_{1}∈R^{5000×5} was generated with all its entries following the lognormal distribution with mean 8 and standard deviation 3; then to generate W_{2}, 2.5% of the entries in W_{1} were upregulated to 1.5× or 2×, with another 2.5% downregulated to 0.67× or 0.5×; H_{1} and H_{2} were generated using two distinct Dirichlet distributions, each group consisting of 20 samples; bulk expression matrices Y were generated by W·H+ε, where ε is a Gaussian white noise matrix with sd=4, if negative entries were present after adding noise, these entries were reset back to 0.
For signature matrix generation, we first obtained \(\bar {W}=\frac {W_{1}+W_{2}}{2}\) and used the top 5% rows in terms of the largestentry/secondlargest entry ratio as signature gene rows. In H’s benchmarking, all methods support H estimation with bulk gene expression and signature matrix as input; since MuSiC only supports count data, we rounded the data before input into MuSiC.
Mouse ISC pseudobulk data set
The pseudobulk data used in the section “SCADIE maintains high cell proportion estimation accuracy” to section “SCADIE can improve the estimates from other methods” were generated as follows. We downloaded the original ISC scRNAseq data through GEO using accession number GSE92865. We clustered its 14 cell types into four major cell types: ISC, TA, Ent, and other, which was based on the tSNE results of the paper [43]. We then separated the scRNAseq data by treatment status, where the Fc treatment was group 1, the scFvDKK1c treatment was group 2, and the RSPO2 treatment was group 3. The corresponding W_{1},W_{2},W_{3} matrices were generated by averaging over all cells in each major cell type of the same treatment group. H_{1} to H_{3} were generated the same way as previous section, also with 20 samples in each group. Finally, Y_{1} to Y_{3} were generated by Y_{i}=W_{i}·H_{i}+ε,i=1,2,3, where ε is a Gaussian white noise matrix with sd=4.
Signature matrix was derived using the buildSignatureMatrixUsingSeurat function from the DWLS package [32] using all the single cells regardless of their treatment status.
Ground truth DEGs were derived by performing differential expression analysis for each cell type between group 2 and group 1 and group 3 and group 1, using the DEAnalysis function in the DWLS package [40].
In the sections “SCADIE can better identify DEGs” and “SCADIE can improve the estimates from other methods,” mouse ISC pseudobulk 1 dataset consists of group 1 and group 2 data, while the pseudobulk 2 dataset consists of group1 and group3 data.
Mouse bulk data set
In thesection “SCADIE maintains high cell proportion estimation accuracy,” we used a mouse brainliverlung mixture microarray dataset (referred as mouse bulk); the data were accessed through GEO using accession number GSE19830. Raw data were preprocessed with the affy package in R and normalized using the rma method. We used rma normalization to keep the data comparable to [34]. The signature matrix was generated using the DWLS package [40]. In H’s benchmarking, since MuSiC only supports count level data, we rounded the bulk matrices and signature gene matrices before input into MuSiC.
Real data processing and analyses
COPD single cell and bulk data
Raw scRNAseq data were obtained from [1]. Data preprocessing, quality control, and normalization were done using Seurat V3 [36] in R. The original data contained samples of control, COPD, and idiopathic pulmonary fibrosis (IPF), where cells from IPF samples were excluded in our analysis. There were 37 distinct cell types originally; we used the five major cell type clusters according to UMAP clustering from [32], and we did not further subdivide because (1) there are limited cell numbers in some clusters and (2) the correlated expression profiles of cell types within each cluster might introduce unwanted collinearity in W. “Groundtruth” DEGs were identified for each cell type between control and COPD using the DWLS package[32].
Signature matrices were generated using the CIBERSORTx [25] and DWLS packages [40]. The cell type proportion results from DWLS were considered better, and we proceeded with its signature matrix.
Bulk RNASeq data in FPKM was obtained through GEO with accession code GSE57148, the FPKM matrix was further transformed into log scale to accommodate the scale of signature matrix.
For benchmarking with TOAST, we input the bulk gene expression and the same initial Hs as SCADIE; the control and COPD groups were denoted as groups 0 and 1, respectively. We used the output effect size as indication of DEGs’ direction, and pvalue to identify significant DEGs.
For GSEA analysis, we first transformed the single cell and SCADIEderived DEGs into zscores. For single cell DEGs, the zscores were obtained from normal distribution quantiles of their pvalues, while for SCADIE output, since we know the point estimates and standard error estimates for all genes, we directly calculated their zscores. The zscore lists for both groups were sorted and input for preranked GSEA analysis using R’s fgsea package, and Molecular Signatures Database v7.4 database [17]. For the top enrichment analysis, pathways of each group were first ranked by their GSEA p values, then the top 5% or 10% were chosen and compared accordingly. The overlapping pvalues were calculated using a binomial distribution, where the model parameter N equals the total number of shared pathways, while p=0.01 for top 10% overlapping and p=0.0025 for top 5% overlapping under the null hypothesis.
DLBLC data
Raw bulk microarray data were obtained through GEO with accession number GSE10846. Raw data were preprocessed with affy package in R and normalized using mas5 method. Sample treatment and subtype information was retrieved using the GEOquery package. LM22 matrix from [25] was used for initial deconvolution.
Alzheimer’s disease data
Bulk postmortem RNASeq samples of prefrontal cortex were downloaded from the ROSMAP cohort [3]. We only kept the subset of 49 samples whose cell type proportion results were measured in [27]. The IHC results measured four major cell types (neuron, astrocyte, oligodendrocyte and microglia) without differentiating between excitatory and inhibitory neurons, while the cell typespecific DEGs from [22] did separate these two cell types. In this regard, we only included results for the three overlapping cell types.
Follicular lymphoma data
The raw bulk microarray data were accessed through GEO with accession number GSE127462; preprocessing and normalizing were performed the same as the above procedure for DLBLC. The initial deconvolution used LM22 matrix from [25] and we merged the cell types into the following 8 major groups (B cell, CD8 T CELL, CD4 T cell, NK, Monocyte/Macrophage, DC cell, Mast, Neutrophil) before inputting this updated H into SCADIE. For the DEGs comparison, unlike SCADIE’s whole transcriptome approach, we ran CIBERSORTx’s high resolution mode only inputting those known DEGs. For TOAST run, we input bulk gene expression and initial the same initial Hs as SCADIE and CIBERSORTx; WT group was denoted as group 0, and MT group was denoted as group 1. The output DEGs’ directions were determined by their effect sizes; pvalues were directly from output. Noted that in our analysis for COPD, we kept its original effect sizes as fold change measure, instead of trying to transform to log2 fold change.
Availability of data and materials
The code for SCADIE is available at https://github.com/tdw1221/SCADIE under MIT license and the source code used in this paper is available at Zenodo [38] under Creative Commons Attribution 4.0 International license. The original scRNAseq data used for generating pseudobulk data used in the section “SCADIE maintains high cell proportion estimation accuracy” to section “SCADIE can improve the estimates from other methods” can be accessed through GEO with accession number GSE92865. The original microarray data used in the section “SCADIE maintains high cell proportion estimation accuracy” can be accessed through GEO with accession number GSE19830. For COPD analyses, raw scRNAseq data can be obtained from [1], and bulk RNASeq data in FPKM can be obtained through GEO with accession number GSE57148. DLBLC data can be accessed through GEO with accession number GSE10846. For analyses performed in the section “SCADIE improves ADassociated cell typespecific DEG estimation,” bulk postmortem RNASeq samples can be downloaded from the ROSMAP cohort [3]. The raw bulk microarray data of follicular lymphoma in the section “DEG identification under poor initialization and limited sample size” can be accessed through GEO with accession number GSE12746.
References
Adams TS, Schupp JC, Poli S, Ayaub EA, Neumark N, Ahangari F, Chu SG, Raby BA, DeIuliis G, Januszyk M, et al., Vol. 6. Singlecell rnaseq reveals ectopic and aberrant lungresident cell populations in idiopathic pulmonary fibrosis; 2020, p. eaba1983.
Blenk S, Engelmann J, Weniger M, Schultz J, Dittrich M, Rosenwald A, MüllerHermelink HK, Müller T, Dandekar T. Germinal center b celllike (gcb) and activated b celllike (abc) type of diffuse large b cell lymphoma (dlbcl): analysis of molecular predictors, signatures, cell cycle state and patient survival. Cancer Informat. 2007; 3(117693510700300):004.
De Jager PL, Ma Y, McCabe C, Xu J, Vardarajan BN, Felsky D, Klein HU, White CC, Peters MA, Lodgson B, et al.A multiomic atlas of the human frontal cortex for aging and alzheimer’s disease research. Sci Data. 2018; 5(1):1–13.
Dong M, Thennavan A, Urrutia E, Li Y, Perou CM, Zou F, Jiang Y. Scdc: bulk gene expression deconvolution by multiple singlecell rna sequencing references. Brief Bioinform. 2021; 22(1):416–27.
Doostparast Torshizi A, Duan J, Wang K. A computational method for direct imputation of cell typespecific expression profiles and cellular compositions from bulktissue rnaseq in brain disorders. NAR Genomics Bioinforma. 2021; 3(2):lqab056.
Efron B. The jackknife, the bootstrap and other resampling plans. CBMSNSF Regional Conference Series in Applied Mathematics, Monograph 38. Philadelphia: SIAM; 1982.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001; 96(456):1348–60.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001; 96:1348–60.
Green MR, Kihira S, Liu CL, Nair RV, Salari R, Gentles AJ, Irish J, Stehr H, VicenteDueñas C, RomeroCamarero I, et al.Mutations in early follicular lymphoma progenitors are associated with suppressed antigen presentation. Proc Natl Acad Sci. 2015; 112(10):E1116–25.
Hardee J, Ouyang Z, Zhang Y, Kundaje A, Lacroute P, Snyder M. Stat3 targets suggest mechanisms of aggressive tumorigenesis in diffuse large bcell lymphoma. G3: Genes, Genomes. Genetics. 2013; 3(12):2173–85.
Ingram JM, Marsh MM. Projections onto convex cones in hilbert space. J Approx Theory. 1991; 64(3):343–50.
Jin C, Chen M, Lin DY, Sun W. Celltypeaware analysis of rnaseq data. Nat Comput Sci. 2021; 1(4):253–61.
Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X, Li L, Vol. 15. Cdseq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data; 2019, p. 1007510.
Kim WJ, Lim JH, Lee JS, Lee SD, Kim JH, Oh YM. Comprehensive analysis of transcriptome sequencing data in the lung tissues of copd subjects. Int J Genomics. 2015; 2015.
Lenz G, Wright G, Dave S, Xiao W, Powell J, Zhao H, Xu W, Tan B, Goldschmidt N, Iqbal J, et al.Stromal gene signatures in largebcell lymphomas. N Engl J Med. 2008; 359(22):2313–23.
Li Z, Wu Z, Jin P, Wu H. Dissecting differential signals in highthroughput data from complex tissues. Bioinformatics. 2019; 35(20):3898–905.
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015; 1(6):417–25.
Linsley PS, Greenbaum CJ, Rosasco M, Presnell S, Herold KC, Dufort MJ. Elevated t cell levels in peripheral blood predict poor clinical response following rituximab treatment in newonset type 1 diabetes. Genes Immun. 2019; 20(4):293–307.
Liu R, Chen Z, Wang S, Zhao G, Gu Y, Han Q, Chen B. Screening of key genes associated with rchop immunochemotherapy and construction of a prognostic risk model in diffuse large bcell lymphoma. Mol Med Rep. 2019; 20(4):3679–90.
Loh P, Wainwright MJ. Regularized mestimators with nonconvexity: statistical and algorithmic theory for local optima. J Mach Learn Res. 2014; 1:1–56.
Loh PL, Wainwright MJ. Support recovery without incoherence: a case for nonconvex regularization. Ann Stat. 2017; 45(6):2455–82.
Mathys H, DavilaVelderrain J, Peng Z, Gao F, Mohammadi S, Young JZ, Menon M, He L, Abdurrob F, Jiang X, et al.Singlecell transcriptomic analysis of Alzheimer’s disease. Nature. 2019; 570(7761):332–37.
McIntosh A. The jackknife estimation method. arXiv preprint arXiv. 2016; 1606:00497.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015; 12(5):453–57.
Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, Khodadoust MS, Esfahani MS, Luca BA, Steiner D, et al.Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol. 2019; 37(7):773–82.
O’donnell R, Breen D, Wilson S, Djukanovic R. Inflammatory cells in the airways in copd. Thorax. 2006; 61(5):448–54.
Patrick E, Taga M, Ergun A, Ng B, Casazza W, Cimpean M, Yung C, Schneider JA, Bennett DA, Gaiteri C, et al.Deconvolving the contributions of celltype heterogeneity on cortical gene expression. PLoS Comput Biol. 2020; 16(8):e1008,120.
Polverino F, Celli BR, Owen CA. Copd as an endothelial disorder: endothelial injury linking lesions in the lungs and other organs?(2017 grover conference series). Pulm Circ. 2018; 8(1):2045894018758,528.
Racle J, de Jonge K, Baumgaertner P, Speiser DE, Gfeller D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. elife. 2017; 6:e26,476.
Ramwadhdoebe TH, van Baarsen LG, Boumans MJ, Bruijnen ST, Safy M, Berger FH, Semmelink JF, van der Laken CJ, Gerlag DM, Thurlings RM, et al.Effect of rituximab treatment on t and b cell subsets in lymph node biopsies of patients with rheumatoid arthritis. Rheumatology. 2019; 58(6):1075–85.
Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH, Jacobsen M. Biomarker discovery in heterogeneous tissue samplestaking the insilico deconfounding approach. BMC Bioinformatics. 2010; 1(1):1–15.
Sauler M, McDonough JE, Adams TS, Kothapalli N, Schupp JS, Nouws J, Chioccioli M, Omote N, Cosme C, Poli S, et al.Singlecell rna sequencing identifies aberrant transcriptional profiles of cellular populations and altered alveolar niche signalling networks in chronic obstructive pulmonary disease (copd); 2020.
Savage KJ, Yenson PR, Shenkier T, Klasa R, Villa D, Goktepe O, Steidl C, Slack GW, Gascoyne RD, Connors JM, et al.The outcome of primary mediastinal large bcell lymphoma (pmbcl) in the rchop treatment era. 2012.
ShenOrr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM, Butte AJ. Cell type–specific gene expression differences in complex tissues. Nat Methods. 2010; 7(4):287–89.
Sinharay S. Jackknife methods. International Encyclopedia of Education, Third Edition: Elsevier; 2010.
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck III WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of singlecell data. Cell. 2019; 177:1888–902. https://doi.org/10.1016/j.cell.2019.05.031.
Tang D, Park S, Zhao H. Nitumid: nonnegative matrix factorizationbased immunetumor microenvironment deconvolution. Bioinformatics. 2020; 36(5):1344–50.
Tang D, Park S, Zhao H. SCADIE: simultaneous estimation of cell type proportions and cell typespecific gene expressions using SCADbased iterative estimating procedure. https://doi.org/10.5281/zenodo.6509668.
Tetley TD. Inflammatory cells and chronic obstructive pulmonary disease. Curr Drug TargetsInflamm Allergy. 2005; 4(6):607–18.
Tsoucas D, Dong R, Chen H, Zhu Q, Guo G, Yuan GC. Accurate estimation of celltype composition from gene expression data. Nat Commun. 2019; 10(1):1–9.
Vallania F, Tam A, Lofgren S, Schaffert S, Azad TD, Bongen E, Haynes W, Alsup M, Alonso M, Davis M, et al.Leveraging heterogeneity across multiple datasets increases cellmixture deconvolution accuracy and reduces biological and technical biases. Nat Commun. 2018; 9(1):1–8.
Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multisubject singlecell expression reference. Nat Commun. 2019; 10(1):1–9.
Yan KS, Janda CY, Chang J, Zheng GX, Larkin KA, Luca VC, Chia LA, Mah AT, Han A, Terry JM, et al.Nonequivalence of wnt and rspondin ligands during lgr5+ intestinal stemcell selfrenewal. Nature. 2017; 545(7653):238–42.
Zaitsev K, Bambouskova M, Swain A, Artyomov M. N. Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nat Commun. 2019; 10(1):1–16.
Zou H, Li R. Onestep sparse estimates in nonconcave penalized likelihood models. Ann Stat. 2008; 36(4):1509–33.
Acknowledgements
The authors would like to thank Jiawei Wang, Wenxuan Deng, and Chang Su from Yale University for their suggestions and help on data of this paper. And we would also like to thank Prof. Katerina Politi and Prof. Steven Kleinstein from Yale University for their suggestions on the results presented in this paper.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 3.
Funding
DT and HZ were supported in part by NIH P50 CA196530 and R56 AG074015 grants. SP was supported by a National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. NRF2022R1A2C4002150).
Author information
Authors and Affiliations
Contributions
DT collected the data, designed the computational algorithm, conducted the real and simulated data analysis, and wrote the manuscript. SP designed the computational algorithm, conducted the theoretical analysis, and wrote the manuscript. HZ supervised the research and wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethics approvals were not needed for the study.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Includes notations, additional figures, theoretical properties of the proposed method, simulation models, details of the simulation and real data processing.
Additional file 2
Supplementary Table S1 contains details on the DEGs exclusively identified by SCADIE as discussed in Section ‘SCADIE accurately infers cell proportions and cell typespecific genes in Chronic Obstructive Pulmonary Disease’.
Additional file 3
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tang, D., Park, S. & Zhao, H. SCADIE: simultaneous estimation of cell type proportions and cell typespecific gene expressions using SCADbased iterative estimating procedure. Genome Biol 23, 129 (2022). https://doi.org/10.1186/s1305902202688w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305902202688w