Robustness and applicability of functional genomics tools on scRNA-seq data

Many tools have been developed to extract functional and mechanistic insight from bulk transcriptome profiling data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells. However, scRNA-seq data has characteristics such as drop-out events, low library sizes and a comparatively large number of samples/cells. It is thus not clear if functional genomics tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way. To address this question, we performed benchmark studies on in silico and in vitro single-cell RNA-seq data. We included the bulk-RNA tools PROGENy, GO enrichment and DoRothEA that estimate pathway and transcription factor (TF) activities, respectively, and compared them against the tools AUCell and metaVIPER, designed for scRNA-seq. For the in silico study we simulated single cells from TF/pathway perturbation bulk RNA-seq experiments. Our simulation strategy guarantees that the information of the original perturbation is preserved while resembling the characteristics of scRNA-seq data. We complemented the in silico data with in vitro scRNA-seq data upon CRISPR-mediated knock-out. Our benchmarks on both the simulated and real data revealed comparable performance to the original bulk data. Additionally, we showed that the TF and pathway activities preserve cell-type specific variability by analysing a mixture sample sequenced with 13 scRNA-seq different protocols. Our analyses suggest that bulk functional genomics tools can be applied to scRNA-seq data, outperforming dedicated single cell tools. Furthermore we provide a benchmark for further methods development by the community.


Abstract 31
Many tools have been developed to extract functional and mechanistic insight from bulk 32 transcriptome profiling data. With the advent of single-cell RNA sequencing (scRNA-seq), it is 33 in principle possible to do such an analysis for single cells. However, scRNA-seq data has 34 characteristics such as drop-out events, low library sizes and a comparatively large number 35 of samples/cells. It is thus not clear if functional genomics tools established for bulk 36 sequencing can be applied to scRNA-seq in a meaningful way. To address this question, we 37 performed benchmark studies on in silico and in vitro single-cell RNA-seq data. We included 38 the bulk-RNA tools PROGENy, GO enrichment and DoRothEA that estimate pathway and 39 transcription factor (TF) activities, respectively, and compared them against the tools AUCell 40 and metaVIPER, designed for scRNA-seq. For the in silico study we simulated single cells 41 from TF/pathway perturbation bulk RNA-seq experiments. Our simulation strategy guarantees 42 that the information of the original perturbation is preserved while resembling the 43 characteristics of scRNA-seq data. We complemented the in silico data with in vitro scRNA-44 seq data upon CRISPR-mediated knock-out. Our benchmarks on both the simulated and real 45 data revealed comparable performance to the original bulk data. Additionally, we showed that 46 the TF and pathway activities preserve cell-type specific variability by analysing a mixture 47 sample sequenced with 13 scRNA-seq different protocols. Our analyses suggest that bulk 48 functional genomics tools can be applied to scRNA-seq data, outperforming dedicated single 49 cell tools. Furthermore we provide a benchmark for further methods development by the 50 community. 51 3 Background 52 Gene expression profiles provide a blueprint of the status of cells. Thanks to diverse high-53 throughput techniques, such as microarrays and RNA-seq, expression profiles can be 54 collected relatively easily, and are hence very common. To extract functional and mechanistic 55 information from these profiles, many tools have been developed, that can, for example, 56 estimate the status of molecular processes such as the activity of pathways or transcription 57 factors (TFs). These functional genomics tools are broadly used and belong to the standard 58 toolkit to analyze expression data [1][2][3]. 59 60 Functional genomics tools typically combine prior knowledge with a statistical method to gain 61 functional and mechanistic insights from omics data. In the case of transcriptomics, prior 62 knowledge is typically rendered as gene sets containing genes belonging to, e.g., the same 63 biological process or to the same Gene Ontology (GO) annotation. The Molecular Signature 64 Database (MSigDB) is one of the largest collections of curated and annotated gene sets [4]. 65 Statistical methods are as abundant as the different types of gene sets. Among them, the most 66 commonly used are over-representation analysis (ORA) [5] and Gene Set Enrichment 67 Analysis (GSEA) [6]. Still, there is a growing number of statistical methods spanning from 68 simple linear models to advanced machine learning methods [7,8]. 69 70 Recent technological advances in single-cell RNA-seq (scRNA-seq) enable the profiling of 71 gene expression at the individual cell level [9]. Multiple methods have been developed, and 72 they have experienced a dramatic improvement over recent years. However, they still have a 73 number of limitations and biases, including low library size, and drop-outs. Bulk RNA-seq tools 74 that focus on cell type identification and characterization as well as on inferring regulatory 75 networks can be readily applied to scRNA-seq data [10]. This suggests that functional 76 genomics tools should in principle be applicable to scRNA-seq data as well. However, it has 77 not been investigated yet whether these limitations could distort and confound the results, 78 rendering the tools not applicable to single-cell data. 79 80 In this paper, we benchmarked the robustness and applicability of different functional 81 genomics methods on simulated and real scRNA-seq data. We focused on three tools for bulk 82 and two for single cell RNA data. published data profiling a complex sample with 13 different scRNA-seq technologies [21]. 109 Here, we showed that summarizing gene expression into TF and pathway activities preserves 110 cell type specific information. Collectively, our results suggest that the bulk based functional 111 analysis tools DoRothEA and PROGENy outperform the single cell tools AUCell and 112 metaVIPER. Although on scRNA-seq data DoRothEA and PROGENy were less accurate than 113 on bulk RNA-seq, we were still able to extract relevant functional insights from scRNA-seq 114 data. 115

Results 116
Robustness of bulk RNA based functional genomics tools against low gene coverage 117 Single-cell RNA-seq profiling is hampered by low gene coverage due to drop-out events [22]. 118 In our first analysis we focused solely on the low gene coverage aspect and whether tools 119 designed for bulk can deal with it. Specifically, We aimed to explore how DoRothEA, 120 PROGENy and GO gene sets combined with GSEA (GO-GSEA) can handle low gene 121 coverage in general, independently of other artefacts and characteristics from scRNA-seq 122 protocols. Thus, we conducted this benchmark using bulk transcriptome benchmark data. In 123 these studies, TFs and pathways are perturbed experimentally, and the transcriptome profile 124 is measured before and after the perturbation. These experiments can be used to benchmark 125 tools for TF/pathway activity estimation, as they should estimate correctly the change in the 126 perturbed TF or pathway. The use of these datasets allowed us to systematically control for 127 the gene coverage (see Methods). The workflow consisted in four steps (Fig. S1a). In the first 128 step we summarized all perturbation experiments into a matrix of contrasts (with genes in rows 129 and contrasts in columns) by differential gene expression analysis. Subsequently, we 130 randomly replaced, independently for each contrast, logFC values with 0 so that we obtain a 131 predefined number of "covered" genes with a logFC unequal to zero. Accordingly, a gene with 132 a logFC = 0 was considered as missing/not covered. and GO-GSEA we used only those 14 GO terms that match the 14 PROGENy pathways (Fig. 163 S1b). In general GO-GSEA showed weaker performance than PROGENy. The decrease in 164 performance was more prominent as gene coverage decreased (from AUROC of 0. GSEA showed a worse performance than PROGENy, especially in the low gene coverage 173 range. Since DoRothEA and PROGENy showed promising performance in low gene coverage 174 ranges, we decided to explore them on scRNA-seq data. Due to its poor performance, we did 175 not include GO-GSEA in the subsequent analyses. 176  Methods). We repeated the simulation of numerous single cells from each bulk sample 220 template to account for the stochasticity of the simulation procedure. We tested our simulation 221 strategy by comparing the characteristics of the simulated cells to real single cells. We 222 compared the count distribution ( Fig. S3a and  Unlike in our first benchmark, we applied the functional genomics tools directly on single 228 samples/cells and built the contrasts at the level of pathway and TF activities (see Methods). 229 We compared the performance of all tools to recover the perturbed TFs/pathways. We also 230 considered the performance of the bulk based tools DoRothEA and PROGENy on the 231 template bulk data as a baseline for comparison to their respective performance on the single 232 cell data. 233 234 We show, as an example, the workflow of the performance evaluation for DoRothEA (Fig. 2b). 235

Benchmark of bulk and single-cell functional genomics tools on simulated scRNA-seq
As a first step we applied DoRothEA to single cells generated for one specific parameter 236 combination (number of cells = 10, mean library size = 5000) and bulk samples, performed 237 differential activity analysis (see Methods), and evaluated the performance with ROC and PR 238 curves including only TFs with confidence level A. Each repetition of the simulation is depicted 239 by an individual ROC curve, which shows the variance in performance of DoRothEA on 240 simulated single cell data ( the library size and the number of cells improved the performance, but not to the extent of its 295 performance on bulk (Fig. S6e). For most parameter combinations P-AUCell with 100 or 200 296 footprint genes per pathway yielded the best performance. 297 298 In summary, both PROGENy and P-AUCell performed well on the simulated single cells, and 299 PROGENy performed slightly better. For the pathway analysis the P-AUCell did not perform 300 better on scRNA-seq than on bulk data. We then went on to perform a benchmark analysis on 301 real scRNA-seq datasets. 302    S7a). The datasets showed a variation in terms of drop-out rate, number of cells and 343 sequencing depths (Fig. S7b). We evaluated the performance of DoRothEA, D-AUCell and metaVIPER on each benchmark 355 dataset individually. We found that DoRothEA outperformed D-AUCell and metaVIPER across 356 different combinations of DoRothEA confidence levels on Perturb-seq (7d) and CRISPRi 357 dataset (Fig. 3a). metaVIPER did not perform better than a random model for these datasets. 358 Interestingly, the performance of all three methods on the Perturb-seq (13d) dataset was very 359 weak independently of DoRothEA's confidence level (Fig. 3a). The captured trends are also 360 reported in PR-space (Fig. S7e). 361 In summary, these analyses suggested that DoRothEA is the best-performing method to 362 recover TF perturbation at the single cell level on in vitro data. 363 In our last analysis we wanted to test the performance of all tested tools in a more 370 heterogeneous system that would illustrate a typical scRNA-seq data analysis scenario where 371 multiple cell types are present. We used a dataset from the Human Cell Atlas project [23] that 372 contains scRNA-seq profiles of human Peripheral blood mononuclear cells (PBMCs) and 373 HEK293T with annotated cell types [21]. This dataset was analysed with 13 different scRNA-374 seq protocols (see Methods). In this study no ground truth (in contrast to the previous 375 perturbation experiments) for TF and pathway activities were available. To evaluate the 376 performance of all methods, we assessed the potential of TF and pathway activity estimations 377 to cluster cells from the same cell type together based on a priori annotated cell types. We 378 performed our analysis for each scRNA-seq technology separately to identify protocol-specific 379 and protocol-independent trends. We assumed that the cell-type information should be 380 preserved also on the reduced dimension space of TF / pathway activities, if these 381 meaningfully capture the corresponding functional processes. Hence, we assessed how well 382 the individual clusters correspond to the annotated cell types by a two-step approach. First we 383 applied UMAP on different input matrices e.g. TF/pathway activities or gene expression and 384 then we evaluated how well cells from the same cell type cluster together. We considered 385 silhouette widths as a metric of cluster purity (see Methods). Silhouette widths derived from a 386 set of highly variable genes (HVGs) set the baseline for the silhouette widths derived from 387 pathway/TF activities. We identified the top 2000 HVGs with Seurat [24] using the selection 388 method "vst" as it worked the best in our hands (Fig. S8) we decided to evaluate the cluster purity at different levels of the cell-type hierarchy from fine-401 grained to coarse-grained. We start with the hierarchy level 0 where every cell type forms a 402 distinct cluster and end with hierarchy level 4, where all PBMC cell types and all cell lines form 403 a distinct cluster (Fig. 4a). 404 405 To evaluate the performance of the TF activity inference methods and the utility of TF activity 406 scores, we determined the cluster purity derived from TF activities (using only DoRothEA TFs 407 with confidence level A and B), TF expression, positive and negative control. Both scRNA-seq 408 protocols and matrices used for dimensionality reduction affected cluster purity significantly 409 (2-way ANOVA p-values <2.2e-16 and 1.4e-10, respectively, p-values and estimations for 410 corresponding linear model coefficients in Fig. S9a; see Methods). The cluster purity based 411 on TF activities inferred using DoRothEA and D-AUCell did not differ significantly (Fig. 4b,  412 corresponding plots for all hierarchy levels in Fig. S9b). In addition the cluster purity of both 413 tools were not significantly worse that the purity based on all 2000 HVGs, though we observed 414 a slight trend indicating a better cluster purity based on HVGs. This trend is expected due to 415 the large difference of available features for dimensionality reduction. Instead a comparison 416 to the positive and negative control is more appropriate. Both DoRothEA and D-AUCell 417 performed comparably to the positive control but significantly better than the negative control 418 across all scRNA-seq protocols (TukeyHSD post hoc test, adj. p-value of 1.05e-4 for 419 DoRothEA and 5.7e-4 for D-AUCell). The cluster purity derived from metaVIPER was 420 significantly worse than for DoRothEA (adj. p-value of 0.0423) and tend to be worse than D-421 AUCell (TukeyHSD post hoc test, adj. p-value of 0.130) as well. Also metaVIPER wasn't better 422 than than the negative control. Regardless of the underlying TF activity inference method, the 423 cluster purity derived from TF activities outperformed the purity derived from TF expression 424 (adj. p-value of 5.42e-6 for DoRothEA, 3.33-e5 for D-AUCell and 0.146 for metaVIPER). This 425 underlines the advantage and relevance of using TF activities over the expression of the TF 426 itself (Fig. 4c). With a comparable performance to a similar number of HVG and also to 2000 427 HVGs, we concluded that TF activities serve -independently of the underlying scRNA-seq 428 protocol -as a complementary approach for cluster analysis that is based on generally more 429 interpretable cell type marker. 430 431 To evaluate the performance of pathway inference methods and the utility of pathway activity 432 scores we determined cluster purity with pathway matrices generated with different methods. 433 We used 200 and 500 footprint genes per pathway for PROGENy and P-AUCell, respectively, 434 since they provided the best performance in the previous analyses. As observed already 435 before, both scRNA-seq protocols and matrices used for dimensionality reduction affected 436 cluster purity significantly (2-way ANOVA p-values of 2.84e-7 and 1.13e-13, respectively, p 437 values and estimations for corresponding linear model coefficients in Fig.S10b; see Methods). 438 The cluster purity derived from pathway activity matrices is not significantly different 439 PROGENy and P-AUCell, while worse than all HVGs (adj. p-value of 4.07e-10 for PROGENy 440 and 4.59e-9 for P-AUCell; Fig. 4d, corresponding plots for all hierarchy levels in Fig. S9b). 441 This is expected due to the large difference in the number of available features for 442 dimensionality reduction (2000 HVGs vs 14 pathways). The cluster purity of both approaches 443 was comparable to the positive control but significantly better than the negative control (adj. 444 p-value of 0.077 for PROGENy and 0.013 for P-AUCell vs. negative control). In summary, this 445 study indicated that the pathway activities contain relevant and cell-type specific information, 446 even though they do not capture enough functional differences to be used for effective 447 clustering analysis. Overall, the cluster purity of cells represented by the estimated pathway 448 activities is worse than the cluster purity of cells represented by the estimated TF activities 449 [21]. In addition we observed that input matrices derived from Quartz-Seq2 protocol yielded 450 for hierarchy level 2 in significantly better cluster purity than all other protocols which is in 451 agreement with the original study of of the PBMC + HEK293T data ( Fig. S9a and S10a) [21]. 452 453 TF and pathway activity scores are more interpretable than expression of single genes. Hence, 454 we were interested to explore whether we could recover known cell-type specific TF and 455 pathway activities from the PBMC data. We decided to focus on the dataset measured with 456 Quartz-Seq2 as this protocol showed superior performance over all other platforms [21]. We 457 calculated mean TF and pathway activity scores for each cell type using DoRothEA, D-AUCell, 458 metaVIPER (all using only TFs with confidence levels A and B, Figure 4e, 4f and 459 Supplementary Figure S11, respectively), PROGENy with 500 and P-AUCell with 200 footprint 460 genes per pathway (Figure 4e-  In this paper we tested the robustness and applicability of functional genomics tools on 496 scRNA-seq data. We included both bulk-and single-cell-based functional genomics tools that 497 estimate either TF or pathway activities from gene expression data and for which well-defined 498 benchmark data exist. The bulk based tools were DoRothEA, PROGENy and GO gene sets 499 analysed with GSEA (GO-GSEA). The functional genomics tools specifically designed for the 500 application in single cells were the statistical method AUCell combined with DoRothEA (D-501 AUCell) and PROGENy (P-AUCell) gene sets and metaVIPER. 502 503 We first explored the effect of low gene coverage in bulk data on the performance of the bulk 504 based tools DoRothEA, PROGENy and GO-GSEA. We found that for all tools the performance 505 dropped with decreasing gene coverage but at a different rate. While PROGENy was robust 506 down to 500 covered genes, DoRothEA's performance dropped markedly at 2000 covered 507 genes. In addition, the results related to PROGENy suggested that increasing the number of 508 footprint genes per pathway counteracted low gene coverage. GO-GSEA showed the 509 strongest drop and did not perform better than a random guess below 2000 covered genes. 510 Comparing the performance of both pathway analysis tools suggests that footprint based gene 511 sets are superior over gene sets containing pathway members (e.g. GO gene sets) in 512 recovering perturbed pathways. This observation is in agreement with previous studies 513 conducted by us and others [11,30]. Given this fact and that GO-GSEA cannot handle low 514 gene coverage (in our hands) we concluded that this approach is not suitable for scRNA-seq 515 analysis. Hence, we decided to focus only on PROGENy as bulk based pathway analysis tool 516 for the following analyses. 517 518 Afterwards, we benchmarked DoRothEA, PROGENy, D-AUCell, P-AUCell and metaVIPER 519 on simulated single cells which we sampled from bulk pathway/TF perturbation samples. We 520 showed that our simulated single cells possess characteristics comparable to real single cell 521 data, supporting the relevance of this strategy. Different combinations of simulation 522 parameters can be related to different scRNA-seq technologies. For each combination we 523 provide a recommendation of how to use DoRothEA's and PROGENy's gene sets (in terms 524 of confidence level combination or number of footprint genes per pathway) to yield the best 525 performance. It should be noted that our simulation approach, as it is now, allows only the 526 simulation of a homogenous cell population. This would correspond to a single cell experiment 527 where the transcriptome of a cell line is profiled. In future work this simulation strategy could 528 be adapted to account for a heterogeneous dataset which would resemble more realistic single 529 cell datasets [31]. 530 531 In terms of TF activity inference, DoRothEA performed best on the simulated single cells 532 followed by D-AUCell and then metaVIPER. Both DoRothEA and D-AUCell shared 533 DoRothEA's gene set collection but applied different statistics. Thus, we concluded that, in our 534 data, VIPER is more suitable to analyse scRNA-seq data than AUCell. The tool metaVIPER 535 performed only slightly better than a random model and since it uses VIPER like DoRothEA 536 the weak performance must be caused by the selection of the gene set resource. DoRothEA's 537 resources with in silico predicted resources. Moreover, we hypothesize based on the pairwise 544 comparison that for functional genomics analysis the choice of gene sets is of higher relevance 545 than the choice of the underlying statistical method. 546 547 Related to pathway analysis, both PROGENy and P-AUCell performed well on the simulated 548 single cells. The original framework of PROGENy uses a linear model that incorporates 549 individual weights of the footprint genes, denoting the importance and also the sign of the 550 contribution (positive/negative) to the pathway activity score. Those weights cannot be 551 considered when applying AUCell with PROGENy gene sets. The slightly higher performance 552 of PROGENy suggests that individual weights assigned to gene set members can improve 553 the activity estimation of biological processes. 554 Especially in the benchmark of both TF analysis methods we observed that the D-AUCell and 555 metaVIPER performed better on single cells than on the original bulk samples. This trend 556 becomes more pronounced with increasing library size and number of cells. However, the bulk 557 based tools perform better on single cells than the scRNA specific tools for both benchmarks. 558 559 Subsequently, we aimed to validate the aforementioned tools on real single cell data. While 560 we could not find suitable benchmark data of pathway perturbations, we exploited two 561 independent datasets of TF perturbations to benchmark the TF activity inference methods. 562 These datasets combined CRISPR-mediated TF knock-out/knock-down (Perturb-Seq and 563 CRISPRi) with scRNA-seq. It should be noted that pooled screenings of gene knock-outs with 564 Perturb-seq suffer from an often faulty assignment of guide-RNA and single cell [34]. Those 565 mislabeled data confound the benchmark as the groundtruth is not reliable. Nevertheless, we 566 showed that DoRothEA's gene sets were globally effective in inferring TF activity from single 567 cell data with varying performance dependent on the used statistical method. As already 568 shown in the in silico benchmark D-AUCell showed a weaker performance than DoRothEA, 569 supporting that VIPER performs better than AUCell. Interestingly, metaVIPER's performance 570 was no better than random across all datasets. metaVIPER used the same statistical method 571 as DoRothEA but different gene set resources. This further supports our hypothesis that the 572 selection of gene sets is more important than the statistical method for functional genomics 573 analysis. 574 575 Furthermore, the perturbation time had a profound effect on the performance of the tools: while 576 DoRothEA and D-AUCell worked well for a perturbation duration of 6 (CRISPRi) and 7 days 577 (Perturb-Seq (7d)), the performance dropped markedly for 13 days. 578 We reasoned that, within 13 days of perturbation, compensation effects are taking place at 579 the molecular level that confound the prediction of TF activities. In addition, it is possible that 580 cells without a gene edit outgrow cells with a successful knock-out after 13 days as the knock-581 out typically yield in a lower fitness and thus proliferation rate. 582 583 In summary, DoRothEA subsetted to confidence levels A and B performed the best on real 584 scRNA-seq data but at the cost of the TF coverage. The results of the in silico and in vitro 585 benchmark are in agreement. Accordingly, we believe that it is reasonable to assume that also 586 PROGENy works on real data given the positive benchmark results on simulated data. 587 588 Finally, we applied our tools of interest to a mixture sample of PBMCs and HEK cells profiled 589 with 13 different scRNA-seq protocols. We investigated to which extent pathway and TF 590 matrices retain cell-type specific information, by evaluating how well cells belonging to the 591 same cell type or cell type family cluster together in reduced dimensionality space. Given the 592 lower numbers of features available for dimensionality reduction using TF and pathway 593 activities, cell types could be recovered equally well as when using the same number of the 594 top highly variable genes. In addition, we showed that cell types could be recovered more 595 precisely using TF activities than TF expression, which is in agreement with previous studies 596 [18]. This suggests that summarising gene expression as TF and pathway activities can lead 597 to noise filtering, particularly relevant for scRNA-seq data. Though, TF activities performed 598 better than pathway activities which is again attributed to the even lower number of pathways. 599 600 Our analysis suggested at different points that the performance of functional genomics tools 601 is more sensitive to the selection of gene sets than the statistical methods. This hypothesis 602 could be tested in future by decoupling functional genomics tools into gene sets and statistics. 603 Benchmarking all possible combinations of gene sets and statistic (i.e. DoRothEA gene sets 604 with a linear model or PROGENy gene sets with VIPER) would shed light on this question 605 which we believe if of high relevance for the community. 606

Conclusions 607
Our systematic and comprehensive benchmark study suggests that DoRothEA and 608 PROGENy are effective in inferring TF and pathway activity from scRNA-seq data, 609 outperforming tools specifically designed for scRNA-seq analysis. We showed the limits of 610 both tools with respect to low gene coverage and also provided as part of the in silico 611 benchmark recommendations on how to use DoRothEA's and PROGENy's gene sets in the 612 best way dependent on the number of cells and mean library size. These two parameters are 613 technology specific, so that our recommendations are transferable to various scRNA-seq 614 protocols. Furthermore, we showed that TF and pathway activities are rich in cell type specific 615 information with reduced amount of noise and provide an intuitive way of interpretation and 616 hypothesis generation. We provide our benchmarks and code to the community for further 617 assessment of methods for functional analysis. 618

Methods 619
PROGENy 620 Contrast matrices were constructed by comparing cells originating from one of the perturbation 660 bulk samples vs cells originating from one of the control bulk samples. 661

Induction of artificial low gene coverage in bulk microarray data 662
We induce the reduction of gene coverage with inserting zeros on the contrast level. In detail 663 we insert for each contrast separately randomly zeros until we obtained a predefined number 664 of genes with a logFC unequal zero which we consider as "covered"/"measured" genes. We 665 perform this analysis for a gene coverage of 500, 1000, 2000, 3000, 5000, 7000, 8000 and as 666 reference all available genes. To account for stochasticity effects during inserting randomly 667 zero we repeat this analysis 10 times for each gene coverage value. 668

669
We applied PROGENy on matrices of single samples (genes in rows and either bulk samples 670 or single cells in columns) containing normalized gene expression scores or on contrast 671 matrices (genes in rows and summarized perturbation experiments into contrasts in columns) 672 containing log fold changes. In case of single sample analysis the contrasts were built based 673 on pathway activity matrices yielding the change in pathway activity (perturbed samples -674 control sample) summarized as logFC. Independent of input matrix we scaled each pathway 675 to have a mean activity of 0 and a standard deviation of 1. 676 We build different PROGENy versions by varying the number of footprint genes per pathway 677 (100, 200, 300, 500, 1000 or all which corresponds to ~29,000 genes). 678

Application of VIPER on single samples and contrasts 679
We applied VIPER with DoRothEA as regulatory network resource on matrices of single 680 samples (genes in rows and either bulk samples or single cells in columns) containing 681 normalized gene expression scores scaled gene-wise to a mean value of 0 and standard 682 deviation of 1 or on contrast matrices (genes in rows and summarized perturbation 683 experiments into contrasts in columns) containing log fold changes. In case of single sample 684 analysis the contrasts were built based on TF activity matrices yielding the change in TF 685 activity (perturbed samples -control sample) summarized as logFC. TFs with less than 4 686 targets listed in the corresponding input matrix were discarded from the analysis. VIPER 687 provides a NES enrichment score for each TF which we consider as a metric for the activity. 688 We TFs/pathways belong to a binary class either deregulated or not regulated and that the 732 perturbed pathway/TF has in the ideal case the highest activity. 733 We performed the ROC and PR analysis with the R package yardstick (version 0.0.3; 734 https://github.com/tidymodels/yardstick). For the construction of ROC and PR curves we 735 calculated for each perturbation experiment pathway (or TF) activities using PROGENy (or 736 VIPER). As each perturbation experiment targets either a single pathway (or TF) only the 737 activity score of the perturbed pathway (or TF) is associated with the positive class (e.g. EGFR 738 pathway activity score in an experiment where EGFR was perturbed). Accordingly the activity 739 scores of all non-perturbed pathways (or TFs) belong to the negative class (e.g. EGFR 740 pathway activity score in an experiment where JAK-STAT pathway was perturbed). Using 741 these positive and negative classes Sensitivity / (1-Specificity) or Precision / Recall values 742 were calculated at different thresholds of activity, producing the ROC / PRC curves. 743 Collecting, curating and processing of microarray data 744 We extracted single pathway and single TF perturbation data profiled with classical 745 microarrays from a previous study conducted by us [35]. We followed the same procedure of 746 collection, curating and processing the data as described in the previous study. 747

FCGR3A+ Monocytes, Dendritic cells, Megakaryocytes, HEK cells). Cells without annotation 792
were discarded for this analysis. 793 We normalized the count matrices for each technology separately by using the R package 794 scran (version 1.11.27) [36]. 795

Dimensionality reduction with UMAP and assessment of cluster quality 796
We used the R package umap (version 0.2.0.0) calling the Python implementation of Uniform 797 Manifold Approximation and Projection (UMAP) with the argument "method = 'umap-learn'" to 798 perform dimensionality reduction on various input input matrices (gene expression matrix, 799 pathway/TF activity matrix, etc.). We assume that the dimensionality reduction will result in 800 clustering of cells that corresponds well to the cell type/cell type family. To assess the validity 801 of this assumption, we assigned a cell-type/cell family specific cluster id to each point in the 802 low-dimensional space. We then defined a global cluster purity measure based on silhouette 803 widths [43], which is a well known clustering quality measure. 804 Given the cluster assignments, in the low-dimensional space, for each cell the average 805 distance (a) to the cells that belong to the same cluster is calculated. Then the smallest 806 average distance (b) to all cells belonging to the newest foreign cluster is calculated. The 807 difference, between the latter and the former indicates the width of the silhouette for that cell, 808 i.e. how well the cell is embedded in the assigned cluster. To make the silhouette widths 809 comparable, they are normalized by dividing the difference with the larger of the two average 810 distances = ./0 102(0,.) . Therefore, the possible values for the silhouette widths lie in the range 811 -1 to 1, where higher values indicate good cluster assignment, while lower values close to 0 812 indicate poor cluster assignment. Finally, the average silhouette width for every cluster is 813 calculated, and averages are aggregated to obtain a measure of the global purity of clusters. 814 For the silhouette analysis we used the R package cluster (version 2.0.8). 815 For statistical analysis of cluster quality, we fitted a linear model score=f(scRNA-seq protocol 816 + input matrix), where score corresponds to average silhouette width for a given scRNA-seq 817 protocol -input matrix pair. Protocol and input matrix are factors, with reference level Quartz-818 Seq2 and positive control, respectively. We fitted two separate linear model for transcription 819 factor and pathway activity inference methods. We report the estimates and p values for the 820 different coefficients of these linear models. Based on these linear models, we performed a 2-821 way ANOVA, and pairwise comparisons using