Method overview
We aim to identify genes that display spatial expression patterns, commonly referred to as SE genes, in large-scale spatially resolved transcriptomic studies. These large-scale studies rely on various high-throughput spatial transcriptomic technologies [13,14,15] and collect gene expression measurements on tens or hundreds of thousands of spatial locations. Gene expression measurements in these large-scale spatial transcriptomic studies are often in the form of low counts with a large fraction of zero values. For SE analysis, we examine one gene at a time and consider its expression measurements collected on n different spatial locations. We refer to the spatial locations as samples in the present study. Depending on the technology, a sample may be a single cell (in the case of STARmap technology) or a cell-sized local region (in the case of HDST technology) or a local region that consists of dozens of cells (in the case of Slide-seq and Visium technologies). The sampled locations have known spatial coordinates recorded during the experiment. We denote si as the d-vector of spatial coordinates for ith sample, with i ∈ (1, …, n), and \( \boldsymbol{S}={\left({\boldsymbol{s}}_1^T,\cdots, {\boldsymbol{s}}_n^T\right)}^T \) as the corresponding n × d matrix of spatial coordinates. These spatial coordinates vary continuously over a two-dimensional space (d = 2; si = (si1, si2) ∈ R2) or a three-dimensional space (d = 3; si = (si1, si2, si3) ∈ R3) depending on the technology. We denote yi(si) as the gene-expression measurement for the ith sample and y = (y1(s1), ⋯, yn(sn))T as the n-vector of gene expression across all samples. We assume that both y and each coordinate of S has been centered and scaled to have mean 0 and standard deviation of 1. Centering and scaling do not influence type I error control but can affect statistical power. Here, our goal is to test whether the expression level of the gene of focus display any spatial expression pattern. Equivalently, we aim to test whether y is dependent on the spatial coordinates S. We rely on a general class of covariance tests [24,25,26,27], which includes the Hilber-Schmidt independence criteria test [24] and the distance covariance test [25] as special cases, to perform SE analysis in a non-parametric fashion. Non-parametric testing ensures robust performance and wide applicability of our method to spatial transcriptomic data that are collected from various technologies with potentially different data features and different data generating mechanisms. Our method builds upon the following intuition: if y is independent of S, then the spatial distance between two locations i and j would also be independent of the gene-expression difference between the two locations. Consequently, we can construct two sample by sample relationship matrices, one based on gene expression and one based on spatial coordinates, to examine whether these two matrices are more similar to each other than expected by chance alone.
Technically, we construct an expression covariance matrix based on the gene-expression levels as an n by n matrix E = y(yTy)−1yT. We also construct a distance covariance matrix for all samples based on spatial locations as an n by n matrix Σ = S(STS)−1ST. We refer to both matrices as the covariance matrices as they are generated from the projection covariance function and possess the two key covariance matrix properties including being symmetric and positive semi-definite. A covariance matrix is also known as a kernel matrix and a covariance function is also known as a kernel function. The projection covariance function has been widely used in many applications in genetics [67,68,69]. For both matrices, we center them as EC = HEH and ΣC = HΣH, where \( \boldsymbol{H}=\left(\boldsymbol{I}-{\mathbf{1}}_n{\mathbf{1}}_n^{\boldsymbol{T}}/n\right) \) with I being an n by n identity matrix and 1n being an n-vector of 1s. Centering does not alter results here as we have already centered both y and S before constructing these covariance matrices. We then construct the following test statistic:
$$ T= trace\left({\boldsymbol{E}}_C{\boldsymbol{\varSigma}}_C\right)/n. $$
Intuitively, each element in either covariance matrix measures the similarity between pairs of locations in terms of their coordinated deviation from the mean. When y and S are independent of each other, the similarity measurement between a location pair in terms of gene expression will not be correlated with the similarity measurement between the location pair in terms of distance. Consequently, the test statistics T, which is effectively a summation of the products between the two similarity measurements across all location pairs, will be small. When y and S are not independent of each other, the similarity measurement in terms of gene expression will be correlated with the similarity measurement in terms of the distance across location pairs, thus leading to a large test statistics T. Formally, under the null hypothesis that y and S are independent of each other, T asymptotically follows a mixture of chi-square distributions [27]
$$ \frac{1}{n^2}{\sum}_{i,j}{\lambda}_{E,i}{\lambda}_{\varSigma, j}{z}_{ij}^2, $$
where λE, i is the ith ordered non-zero eigenvalue of EC; λΣ, j is the jth ordered non-zero eigenvalue of ΣC; and \( {z}_{ij}^2 \) are independent and identically distributed \( {\chi}_1^2 \) variables. An extremely large T that is rare under the above null distribution constitutes evidence against the null hypothesis. Consequently, we can compute a P value to measure the probability of encountering the same or a larger T as observed in the data based on the null distribution. The P value for testing the null hypothesis can be calculated using Davies’ exact method [70].
We employ several important algebraic manipulations to ensure that both computational complexity and memory requirement of our method are linear with respect to the number of spatial locations. First, we note that the eigenvalues of EC and ΣC are equivalent to the eigenvalues of (yTy )−1yTHy (a scalar) and (STS)−1STHS (a d × d matrix) [71], respectively. The computational cost for obtaining these eigenvalues based on the later forms is only O(nd2). Second, we note that
$$ Tr\left({\boldsymbol{E}}_{\boldsymbol{c}}{\boldsymbol{\Sigma}}_{\boldsymbol{c}}\right)= Tr\left(\boldsymbol{y}{\left({\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{y}\right)}^{-\mathbf{1}}{\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{H}\boldsymbol{\Sigma } \boldsymbol{H}\right)={\left({\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{y}\right)}^{-\mathbf{1}} Tr\left({\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{H}\boldsymbol{\Sigma } \boldsymbol{H}\boldsymbol{y}\right). $$
Consequently, we never need to compute E, Σ and their centered versions EC and ΣC throughout the algorithm. Instead, we only need to compute the key quantities yTy, yTHy, STS, STHS, and yTHΣHy, all of which require at most O(nd2) computational complexity and O(nd) memory requirement. Specifically, these key quantities can be computed efficiently in the following forms:
$$ {\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{Hy}={\left(\boldsymbol{Hy}\right)}^{\boldsymbol{T}}\left(\boldsymbol{Hy}\right)={\left(\boldsymbol{y}-\overline{y}\right)}^{\boldsymbol{T}}\left(\boldsymbol{y}-\overline{y}\right), $$
$$ {\boldsymbol{S}}^{\boldsymbol{T}}\boldsymbol{HS}={\left(\boldsymbol{HS}\right)}^{\boldsymbol{T}}\left(\boldsymbol{HS}\right)={\left(\boldsymbol{S}-\frac{\mathbf{1}{\mathbf{1}}^{\boldsymbol{T}}}{\boldsymbol{n}}\boldsymbol{S}\right)}^{\boldsymbol{T}}\left(\boldsymbol{S}-\frac{\mathbf{1}{\mathbf{1}}^{\boldsymbol{T}}}{\boldsymbol{n}}\boldsymbol{S}\right), $$
$$ {\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{H}\boldsymbol{\Sigma } \boldsymbol{H}\boldsymbol{y}={\boldsymbol{y}}^{\boldsymbol{T}}\boldsymbol{H}\boldsymbol{S}{\left({\boldsymbol{S}}^{\boldsymbol{T}}\boldsymbol{S}\right)}^{-\mathbf{1}}{\boldsymbol{S}}^{\boldsymbol{T}}\boldsymbol{H}\boldsymbol{y}={\left(\boldsymbol{y}-\overline{y}\right)}^{\boldsymbol{T}}\boldsymbol{S}{\left({\boldsymbol{S}}^{\boldsymbol{T}}\boldsymbol{S}\right)}^{-\mathbf{1}}{\boldsymbol{S}}^{\boldsymbol{T}}\left(\boldsymbol{y}-\overline{y}\right). $$
Finally, we note that the quantities involving S, including the computation of STS and STHS as well as the eigen decomposition of (STS)−1STHS, only need to be performed once at the beginning and need not to be re-computed for every gene in turn. The quantities involving y, including yTy, yTHy, and yTHΣHy, would vary across genes but could be computed efficiently relying on the sparsity of y. Indeed, these three quantities can be computed with a computational complexity that scales linearly with respect to the number of samples with non-zero values, resulting in substantial computational savings for large sparse data. Therefore, our method has an overall computational complexity of O(nd2 + pn ′ d) and memory requirement of O(nd2), where p is the number of analyzed genes and n′ is the number of spatial locations with non-zero counts averaged across genes.
The statistical power of the above covariance test will inevitably depend on how the distance covariance matrix Σ is constructed and how it matches the true underlying spatial pattern displayed by the gene of interest. While the above projection kernel construction allows us to achieve orders of magnitude of computational gains as compared to other kernels such as the Gaussian and periodic kernels used in SPARK, it is likely not optimal in detecting every possible expression patterns encountered in real data. For example, the projection kernel is likely suboptimal in detecting focal expression patterns that are targeted by Gaussian kernels or periodical expression patterns that are targeted by periodic kernels. To ensure robust identification of SE genes across various possible spatial expression patterns, we consider different transformations of the spatial coordinates si and subsequent construction of different distance covariance matrices. Specifically, we applied five Gaussian transformations on the coordinates si = (si1, si2) to obtain five sets of transformed coordinates s′i = (s′i1, s′i2) , with \( {s}_{i1}^{\prime }=\mathit{\exp}\left(\frac{-{s}_{i1}^2}{2{\sigma}_1^2}\right) \) being the transformed x-coordinate and \( {s}_{i2}^{\prime }=\mathit{\exp}\left(\frac{-{s}_{i2}^2}{2{\sigma}_2^2}\right) \) being the transformed y-coordinate in each set. In the transformation, we used different smoothness parameters σ1 and σ2 in each set to cover a range of possible local covariance patterns. In addition, we applied five cosine transformations on si to obtain another five sets of transformed coordinates s′i, with \( {s}_{i1}^{\prime }=\cos \left(\frac{2\pi {s}_{i1}}{\phi_1}\right) \) being the transformed x-coordinate and \( {s}_{i2}^{\prime }=\cos \left(\frac{2\pi {s}_{i2}}{\phi_2}\right) \) being the transformed y-coordinate in each set. We also used different periodicity parameters ϕ1 and ϕ2 in each set to cover a range of possible periodic patterns. The transformation parameters σ1, σ2, ϕ1, and ϕ2 are predetermined using the 20%, 40%, 60%, 80%, and 100% quantiles of the absolute values of the x and y coordinates in the data. Using the empirical quantiles of the data to construct different covariance matrices follows the main ideas of [5, 6]. Compared to the alternative approach of fixing the transformation parameters to some predetermined values, using the quantiles of the data for transformation has the benefits of being invariant to any scale transformation of the original data and allows us to construct the distance covariance matrices in a data-dependent fashion.
We used each transformed \( {\boldsymbol{s}}_{\boldsymbol{i}}^{\prime } \) to construct a distance covariance matrix as described above, resulting in a total of ten transformed distance covariance matrices in addition to the untransformed distance covariance matrix (Additional file 1: Figure S17). Intuitively, the kernel constructed based on the untransformed coordinates is likely useful to detect linear expression pattern across the coordinates. The kernels constructed based on the cosine transformed coordinates are likely useful to detect periodic expression patterns on the tissue. While the kernels constructed based on the Gaussian transformed coordinates are likely useful to detect focal expression patterns on the tissue. Therefore, combining the original and transformed distance covariance matrices would allow us to detect a wide variety of spatial expression patterns encountered in real data. To do so, we computed a P value using Davies’ method for each distance covariance matrix. We then combined all eleven P values into a single P value through the Cauchy P value combination rule [72, 73].
We have so far described the method in the absence of covariates. In the presence of covariates, we can replace the n-vector 1n in the H matrix with a corresponding covariate matrix X of dimensionality n by q. The covariate matrix contains a column of 1’s that represents the intercept, with the remaining columns representing the measurements for the q-1 covariates. Therefore, the centering matrix becomes H = (I − X(XTX)−1XT). Despite this change, the other steps remain the same.
We refer to the above method as SPARK-X (SPARK-eXpedited), which is implemented in the SPARK R package with underlying efficient C/C++ code linked through Rcpp and with multiple threads computing capability. The software, together with all the analysis code for reproducing the results presented in the present study, are freely available at www.xzlab.org/software.html.
Simulation designs
We performed extensive simulations to comprehensively evaluate the performance of SPARK-X along with several other existing methods. To make simulations as realistic as possible, we simulated data based on parameters inferred from two published data sets that include a spatial transcriptomic (ST) data set [11] and a Slide-seq data set (Additional file 1: Figure S16). The two data sets represent two different gene-expression data structures, with the ST data representing a moderately sparse data with 60% of zero values and the Slide-seq data representing a highly sparse data with 99.4% of zero values (Additional file 1: Table S1). In the simulations, we first randomly simulated the coordinates for a fixed number of spatial locations (n) through a random-point-pattern Poisson process. On these spatial locations, we simulated expression levels for 10,000 genes based on a negative binomial distribution with details provided below. These 10,000 genes were all non-SE genes in the null simulations and consisted of 1000 SE genes and 9000 non-SE genes in the power simulations. For both non-SE genes and SE genes, we varied the dispersion parameter of the negative binomial distribution to be either 0.1, 0.2, or 1 for the moderately sparse setting and to be either 1, 2.5, or 5 for the highly sparse setting. These values were selected to match the scale of dispersion parameter estimated in the two real data sets. For the non-SE genes, we varied the mean parameter of the negative binomial distribution to be either 0.005 or 0.5. The low value of 0.005 corresponds to the median mean estimate in the Slide-seq data and represents a highly sparse gene-expression setting. The high value of 0.5 corresponds to the median mean estimate in the ST data and represents a moderately sparse gene-expression setting. For the SE genes, we simulated their expression levels to display three distinct spatial patterns (hotspot, streak, and gradient patterns, Fig. 1D).
Specifically, for the first two spatial patterns, we created either a circle (for hotspot pattern) or a band (for streak pattern) in the middle of the panel and marked spatial locations residing in these areas. The size of the circle and the size of the band were designed so that the marked spatial locations inside these areas represent a fixed proportion of all spatial locations, with the proportion set to be either 10%, 20%, or 30%. The expression measurements of the non-marked spatial locations were randomly generated from a negative binomial distribution with the mean parameter set to be 0.005 or 0.5. For the moderately sparse setting, the expression measurements of the marked spatial locations were generated from a negative binomial distribution with a mean parameter being either 1.5, 2, or 3 times higher than that in the non-marked spatial locations (for 500 SE genes) or 2/3, 1/2, or 1/3 of that in the non-marked spatial locations (for 500 SE genes), representing low, moderate, or high SE signal strength, respectively. For the highly sparse setting, the expression measurements of the marked spatial locations were generated from a negative binomial distribution with a mean parameter being either 2, 3, or 4 higher than that in the non-marked spatial locations (for 500 SE genes) or 1/2, 1/3, or 1/4 of that in the in the non-marked spatial locations (for 500 SE genes), representing low, moderate, or high SE signal strength, respectively.
For the gradient pattern, the expression levels of a fraction of spatial locations (=20%, 30%, or 40%) were set either in an increasing order (for 500 SE genes) or a decreasing order (for the other 500 SE genes) along the x-axis. The three fractions used correspond to low, moderate, or high SE signal strength in this setting, respectively. In particular, we generated the expression measurements for all spatial locations from a negative binomial distribution. For each SE gene, we randomly selected a fraction of spatial locations where we assigned their gene-expression values in either increasing or decreasing order back to them based on their x-axis coordinates. In contrast, the expression measurements for the non-SE genes were randomly assigned to all spatial locations, regardless of their spatial locations.
In all these simulations, we varied the number of spatial locations (n = 300, 500, 1000, 2000, or 3000 for the moderate sparsity setting and n = 3000, 10,000, 20,000, 30,000, or 50,000 for the highly sparse setting), the expression sparsity level (moderate or high, as measured by the mean parameter in the negative binomial distribution), the noise level (low, moderate or high noise, as measured by the dispersion parameter in the negative binomial distribution), the SE strength (weak, moderate, or strong, as measured by fold change in the mean parameter for the first two spatial patterns and by the fraction of spatial locations displaying expression gradient for the third spatial pattern), as well as the fraction of spatial locations in the focal/streak area for the first two spatial patterns.
Real data analysis
Slide-seq data
Slide-seq is a technology which enables transcriptome-wide measurements with 10-micron spatial resolution by transferring RNA from tissue sections onto a surface covered in DNA-barcoded beads with known positions and inferring the locations of RNA using a sequencing-by-ligation strategy. We obtained the Slide-seq dataset collected on the mouse cerebellum from Broad Institute’s single-cell repository (https://singlecell.broadinstitute.org/single_cell/) with ID SCP354. We used the file “Puck_180430_6” which contains 18,671 genes measured on 25,551 beads with known spatial location information. The bead size is approaching the size of mammalian cells (10 microns), though each bead may overlap with multiple cells. After filtering out mitochondrial genes and genes that are not expressed on any bead, we analyzed a final set of 17,729 genes on 25,551 beads. The data is highly sparse with 99.46% entries being 0 (Additional file 1: Table S1).
Slide-seqV2 data
Slide-seqV2 is a technology that builds upon on Slide-seq with modifications to library generation, bead synthesis, and array-indexing, thereby markedly improving the mRNA capture sensitivity. We obtained the Slide-seqV2 dataset collected on the mouse cerebellum from Broad Institute’s single-cell repository with ID SCP948. The data contains 23,096 genes measured on 39,496 beads with known spatial location information. The bead size is the same as that in the Slide-seq data. Following [36], we first cropped the region of interest and filtered out beads with UMIs less than 100. After filtering out mitochondrial genes and genes that are not expressed on any bead, we analyzed a final set of 20,117 genes on 11,626 beads. While the capture sensitivity is improved in the Slide-seqV2 as compared to Slide-seqV1, the Slide-seqV2 data is still highly sparse with 98.35% entries being zero (Additional file 1: Table S1).
We performed cell decomposition and conditional SE analysis on the Slide-seqV2 data. Specifically, we used the recently developed RCTD software [36] (v.1.0.0) to infer cell type composition on each spatial location. Following the original RCTD paper, we used a single-nucleus RNA-seq data [74] to serve as the reference panel for RCTD fitting, which contains 19 cell types. In the analysis, RCTD rejected 1494 beads and assigned cell type labels to 11,061 cells confidently from 9554 beads. We converted the inferred cell types into binary indicators and used them as covariates in SE analysis. Because RCTD produced confident cell type assignment using a set of 3338 genes (after RCTD filtering) on 11,061 cells (inferred from 9554 beads), we performed analysis on these genes and locations in the covariate adjusted SE analysis.
Besides conditional SE analysis, we also performed the cell type-specific SE analysis in the data. Specifically, we rely on the cell type composition estimates from RCTD to extract the locations that are dominated by Purkinje cells or granular cells. The Purkinje cells are primarily located in the thin Purkinje cell layer while the granule cells are primarily located in the thick granular layer; both layers are of highly irregular shapes. After removing genes with no expression on any of the selected cells, we performed SE analysis using SPARK-X for 3006 genes on 652 Purkinje cells and 3288 genes on 5891 granule cells.
High-definition spatial transcriptomics data
High-definition spatial transcriptomics is a method to capture RNA from tissue sections on a dense, spatially barcoded bead array, allowing transcriptome-wide measurements with 2-micron resolution. We obtained the HDST dataset collected on the mouse olfactory bulb from Broad Institute’s single-cell repository with ID SCP420. We used the file “CN24_D1” which contains 19,951 genes measured in 181,380 spots with known spatial location information. Each spot is a 2-micron well, approaching one fifth of the size of mammalian cells. We filtered out mitochondrial genes and genes that are not expressed on any spot and we removed spots with no gene-expression count. We analyzed a final set of 19,913 genes on 177,455 spots. The HDST data is extremely sparse with 99.96% of entries being 0 (Additional file 1: Table S1). We performed clustering analysis on the detected SE genes. To do so, for each gene in turn, we log transformed the raw count and scaled the transformed value further to have a mean of zero and standard deviation of one across all spots. We then used the hierarchical agglomerative clustering algorithm in the R package amap (v.0.8–18) to cluster identified SE genes into five gene groups.
We performed conditional SE analysis on a subset of the HDST data to examine the extent to which SE genes display spatial expression pattern beyond those explained by spatial distribution of cell types. To do so, we first extracted the most likely cell type for each spot based on the original publication and kept the spots with confident cell type assignment (P-adjust< 0.05). After filtering out mitochondrial genes and genes that are not expressed on any spots, we analyzed a final set of 17,121 genes on 103,602 spots. There are 63 cell types including non-neuronal cell types and multiple neuronal subtypes clustered in the original publication. We treated the assigned cell types as covariates for SE analysis on these spots.
We also performed cell type-specific SE analysis on a subtype of inhibitory neurons, the olfactory bulb inner horizontal cells (OBINH2). We selected these inhibitory neurons as they have the largest cell numbers among all cell types in the data. In the analysis, we extracted spatial locations that are labeled as OBINH2 neurons and removed genes that are not expressed on any of the extracted locations. In total, we analyzed 11,504 genes measured on 15,650 spatial locations in the cell type-specific SE analysis.
10X Visium data
The 10X Visium is a platform that builds upon on the original Spatial Transcriptomics technology with improvements on both resolution (55-micron resolution, with smaller distance between barcoded regions) and experimental time. We obtained a Visium dataset collected on the human heart tissue from the 10X Visium spatial gene-expression repository (https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Human_Heart). The data contains 36,601 genes measured on 4247 spots with known spatial location information. Each spot is a 55-micron well. We filtered out mitochondrial genes and genes that are not expressed in any spot. We analyzed a final set of 20,904 genes on 4247 spots. The 10X Visium human heart data is relatively sparse with 90.81% of entries being 0 (Additional file 1: Table S1). In addition, we also obtained a Visium dataset collected on the human ovarian cancer tissue from the 10X Visium spatial gene-expression repository (https://support.10xgenomics.com/spatial-gene-expression/datasets/1.2.0/Targeted_Visium_Human_OvarianCancer_Pan_Cancer). The data contains 1198 genes measured on the 3493 spots. After removing spot with no gene-expression count, we analyzed a final set of 1198 genes on 3492 spots. Since this data is generated with an enriched library prepared using the Human Pan-Cancer Panel, only 73.78% of entries are 0.
STARmap data
Spatially resolved Transcript Amplicon Readout Mapping (STARmap) is a technology for 3D intact-tissue RNA sequencing. STARmap integrates hydrogel-tissue chemistry, targeted signal amplification, and in situ sequencing. We obtained the STARmap dataset collected on the mouse visual cortex from STARmap resources (https://www.starmapresources.com/data). We used the data collected in the sequentially encoded experiment, which contains 28 genes measured in 33,598 cells with known 3D spatial location information. These 28 genes include 23 cell type markers and 5 activity-regulated genes. After filtering out cells with no gene-expression count, we analyzed a final set of 28 genes on 32,845 cells. The STARmap data are of high counts with almost no zero values (1.2% zeros; Additional file 1: Table S1). We followed the procedures in the original paper [75] for cell type clustering. Specifically, we first applied log transformation to the raw count data and obtained the relative gene-expression levels through adjusting for the log-scale total read counts. We then clustered cells into inhibitory neurons, excitatory neurons, and non-neuronal cells using Gad1, Slc17a7, and four non-neuronal genes (Flt1, Mbp, Ctss, Gja1) using the K-means clustering algorithm.
For each of the above datasets, we performed permutations to construct an empirical null distribution of P values for each method by permuting the bead/spot/cell coordinates either ten times (for Slide-seq, Slide-seqV2, HDST, and 10X Visium data) or a thousand times (for STARmap data). Afterwards, we examined control of type I errors by the different methods on the basis of the empirical null distribution of P values. We declared an SE gene as significant based on an empirical FDR threshold of 0.01. We note that standard P value cutoffs such as Bonferroni-corrected P value threshold can also be used for SPARK-X due to its calibrated type I error control.
SE gene validation and functional gene set enrichment analysis
For the Slide-seq data, we obtained lists of genes that can be used to serve as unbiased validation for the SE genes identified by different methods. Specifically, we obtained the cerebellum gene list from the Harmonizome database [29], which consists of 2632 cerebellum-related genes identified in two datasets (Allen Brain Atlas adult mouse brain tissue gene-expression profiles; and TISSUES curated tissue protein expression evidence scores). In addition, we obtained a gene list from Wizeman et al. [30], which contains 4152 cell type markers genes in cerebellum. We used the two gene lists to validate the SE genes identified by different methods.
We also performed functional gene set enrichment analysis on the significant SE genes identified by SPARK-X and SPARK-G. We performed enrichment analyses using the R package clusterProfiler [76] (v.3.12.0) with GO terms and Reactome pathways. In the package, we used the default “BH” method for multiple-testing correction and set the default number of permutations to be 1000. We declared enrichment significance based on an FDR of 0.05.
Compared methods
We compared SPARK-X with three existing methods for detecting genes with spatial expression patterns: SPARK (v.1.1.0) [6], SPARK-G (the Gaussian version of SPARK), and SpatialDE (v.1.1.3) [5]. We did not include the trendsceek in the comparison due to its high computational burden. For example, it takes trendsceek 40 h to analyze a simulated data with 10,000 genes measured on 300 locations, which is 5000 times slower than SPARK-X. The computational burden of trendsceek becomes even heavier on larger datasets. We applied all four methods to the simulated data with SPARK being restricted to the settings that have a sample size ≤ 3000 due to its heavy computational burden. We applied SPARK-X, SPARK-G, and SpatialDE to the Slide-seq and Slide-seqV2 data. The SPARK-G and SpatialDE are not scalable when the sample size is over approximately 30,000. Therefore, we did not apply these two methods to the HDST data. The SpatialDE gave out error when we applied it to the STARmap data; thus, we only present the results from SPARK-X and SPARK-G there. For SPARK and SpatialDE, we adopted their default settings to filter data. Specifically, for SPARK, we filtered out genes that are expressed in less than 10% of the spatial locations and selected spatial locations with at least ten total read counts; for SpatialDE, we filtered out genes with aggregate expression count less than three and selected spatial locations with at least ten total read counts. For the SPARK-G, we did not perform any additional filtering. The number of analyzed gene for each method is provided in Additional file 1: Table S1.