simATAC: a single-cell ATAC-seq simulation framework

Single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) identifies regulated chromatin accessibility modules at the single-cell resolution. Robust evaluation is critical to the development of scATAC-seq pipelines, which calls for reproducible datasets for benchmarking. We hereby present the simATAC framework, an R package that generates scATAC-seq count matrices that highly resemble real scATAC-seq datasets in library size, sparsity, and chromatin accessibility signals. simATAC deploys statistical models derived from analyzing 90 real scATAC-seq cell groups. simATAC provides a robust and systematic approach to generate in silico scATAC-seq samples with known cell labels for assessing analytical pipelines. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02270-w).


Background
Single-cell sequencing has revolutionized and expedited our understanding of the structure and function of cells at unprecedented resolution. This technology resolves a fundamental limitation of bulk sequencing, which averages signals over a large number of cells resulting in obscured biological heterogeneity among individual cells [1]. The assay for transposase-accessible chromatin sequencing (ATAC-seq) measures chromatin's openness, a proxy for the activity of DNA binding proteins [2][3][4]. Single-cell ATAC-seq (scATAC-seq) has opened up vast fields of applications, including extracting accessibility and co-accessibility patterns of genomic regions to identify cell-type-specific enhancers, chromatin heterogeneity, and transcription factor activities.
The rapid advancement of scATAC-seq technology has given rise to the development of computational tools for scATAC-seq data and the integrative analysis of transcriptomic and epigenomic profiles [5][6][7]. Though simulated datasets with known cell labels have been the most common approach to benchmark the performance of analytical pipelines, there is no existing standard practice or simulation tool available to generate synthetic scATAC-seq datasetsfrom real single-cell samples. Some previous studies generated in silico data by downsampling reads from bulk or existing scATAC-seq data or deploying simple sampling algorithms [8][9][10][11][12][13][14][15][16][17][18]. However, these simulation methods were implemented as part of scATAC-seq analytical tool development and were usually incompletely documented, resulting in a lack of reproducibility. Further, due to the sparsity and noisy nature of scATAC-seq data, generating synthetic samples that closely resemble real datasets remains challenging.
Most scATAC-seq analytical pipelines consist of pre-processing (read quality control and alignment), core analysis (feature matrix generation), and downstream analyses (e.g., cell type clustering). A feature matrix summarizes the filtered reads from BAM files by counting the number of aligned reads that overlap within the defined genomic regions. The features represent a subset of genomic regions with specified genomic positions, nucleotide patterns, or biological functions [8,19,20]. A commonly used feature matrix for scATAC-seq is the peak-by-cell matrix that captures the highest accessibility signals from genomic regions (peaks) obtained from bulk ATAC-seq or single cells. However, a sufficient number of cells are required to identify such peaks, and consequently, the peak-by-cell feature matrix usually fails to recognize the rare cell type regulatory patterns [11]. Alternatively, the bin-by-cell feature matrix is generated by segmenting the whole genome into uniformly-sized non-overlapping "bins" and mapping the read counts to each bin [11]. Unlike the peak-by-cell matrices, the uniform segmentation of the genome does not screen out any genomic region, and thus has the potential to detect rare cell groups.
We hereby propose simATAC, a scATAC-seq simulation framework that generates simulated samples resembling real scATAC-seq data. Given a real scATAC-seq feature matrix as input, simATAC estimates the statistical parameters of the mapped read distributions by cell type and generates a synthetic count array that captures the unique regulatory landscape of cells with similar biological characteristics. We demonstrate that the synthetic samples generated by simATAC highly resemble real scATAC-seq datasets in library size, sparsity (proportion of zero entries), and averaged chromatin accessibility signals.

simATAC framework
simATAC deploys statistical distributions to model the properties of a bin-by-cell count matrix for a group of cells with similar biological characteristics. The main modeling parameters include read coverage of cells (library size), non-zero cell proportion in each bin, and the average of read counts per bin (bin mean). Bin-by-cell matrix quantifies the number of open chromatin read fragments overlapping with the fixed-length bins (5 kbp windows) across the whole genome. For each user-input real scATAC-seq dataset, simATAC performs two core simulation steps: (i) estimating the model parameters based on the input bin-by-cell matrix, including the library sizes of the cells, the non-zero cell proportion, and the read count average of each bin; (ii) generating a bin-by-cell matrix that resembles the original input scATAC-seq data by sampling from Gaussian mixture and polynomial models with the estimated parameters. simATAC outputs a count matrix as a SingleCellExperiment (SCE) object [21], with additional options to convert it to other formats of feature matrices. Figure 1 summarizes the simulation architecture of simATAC. We discuss the statistical modeling of simATAC in the next sections and the "Methods" section. The simATAC simulation framework. The red circles represent values directly extracted from the user-input bin-by-cell matrix, the white squares represent estimated parameters, and the brown circles represent the simulated values. simATAC initially estimates the library sizes, the non-zero cell proportions, and the bin means from the input cell group (including cells having similar biological characteristics). simATAC generates the library sizes with a Gaussian mixture distribution, the zero and non-zero status with a Bernoulli distribution, and the bin means with a polynomial regression model linking to the non-zero cell proportion. The synthetic counts are sampled from a Poisson distribution whose mean is a factor of the cell library size adjusted by the bin mean. simATAC offers optional sparsity adjustment factor γ and noise parameters to adjust the sparsity and noise level of the synthetic counts generated from the Poisson distribution. The simATAC framework simulates synthetic scATAC-seq data using the default values for all the parameters if no user-input real data or parameters are given

Library size
Library size refers to the number of aligned reads per cell. simATAC models cells' log-transformed library sizes through a Gaussian mixture model (GMM) with two components whose parameters are estimated based on the user-input real scATAC-seq data. With the estimated parameters, simATAC randomly samples library sizes of C single cells based on where l i is the simulated library size for the ith simulated single-cell. See Table 1 for the detailed definition of the Gaussian mixture model parameters. Table 1 Input parameters to the simATAC simulation step

Parameter Symbol Description
Library size mean μ 1 , μ 2 The estimated means of two Gaussian modals of library size.
Library size standard deviation σ 1 , σ 2 The estimated standard deviations of two Gaussian modals of library size.
Library size weight w The estimated weight parameter of the first Gaussian modal.
Non-zero cell proportion P The proportion of non-zero cells in real bin-by-cell matrix.

Polynomial coefficient β
The estimated coefficients of polynomial model fitted to the relation between bin non-zero cell proportions and bin means.
Previous studies have shown that the library sizes of cells significantly affect the identification of cell types [11-13, 15, 17]. Higher sequencing coverage usually results in a larger library size, thus providing more accurate chromatin accessibility information. Since the library sizes of scATAC-seq usually vary across different experiments, simATAC offers users the flexibility to adjust the library sizes from low to high coverage based on their needs.

Bin non-zero cell proportion
Sparsity is the inherent nature of scATAC-seq data [10,11,17], which results in a large proportion of zero entries in the bin-by-cell matrix. Let M j,i denote the number of reads that fall into the bin j of cell i for B bins. If M j,i > 0, it is considered as a non-zero entry. The number of cells with non-zero entries within a bin is associated with the chromatin accessibility in the corresponding genomic region. Based on the user-input real scATACseq bin-by-cell matrix, simATAC first estimates the proportion of cells with non-zero entries for the jth bin, p j , and then determines whether an entry in the simulated count matrix is zero or not based on a Bernoulli distribution, If X j,i = 1, the read count of cell i at bin j is non-zero, i.e. M j,i > 0. If X j,i = 0, the read count of cell i at bin j is set to zero, i.e. M j,i = 0. The non-zero cell proportion of bin j, p j , of the simulated bin-by-cell matrix is then defined as

Bin mean
The extent of genome accessibility leads to the variations in the number of sequenced reads falling into the fixed-length bins, and consequently, the variations in the average of the reads in each bin. More accessible regions potentially have larger bin means, that is, more cells with non-zero entries are mapped to that region. Based on the modeling scATAC-seq datasets, we observed a polynomial regression relationship between the non-zero cell proportions and the bin means for each cell group. simATAC simulates the average of the read counts at bin j, m j , by where β 0 , β 1 , and β 2 are estimated based on the input real scATAC-seq dataset.

Bin-by-cell count matrix
simATAC generates the final count of cell i at bin j, c j,i , using a Poisson distribution with γ c j,i as the mean parameter, where c j,i is the library size of cell i scaled by the bin mean m j at bin j, and γ denotes the sparsity adjustment factor with a default value of 1. When γ < 1, the simulated scATAC-seq data tend to be more sparse, and vice versa.
To reproduce the high noise level of scATAC-seq data, simATAC offers an optional step to include additional noise to the final simulated counts by High noise level blurs the difference in the read distributions between different cell types, which mimics real sequencing artifacts. However, as the noise level increases, the distribution of the library sizes and the sparsity of the simulated data may differ from the input real scATAC-seq data. The default setting of simATAC omits the optional adding noise step, but leaves users the flexibility to set their desirable noise level.
simATAC outputs the final simulated bin-by-cell matrix as a SCE object, a container for single-cell genomics data, from the SingleCellExperiment R package [21].

Simulating feature matrices in other formats
simATAC can also generate synthetic scATAC-seq data in other formats of feature matrices. The default bin-by-cell output can be converted to a peak-by-cell matrix by filtering bins enriched in aligned reads (top bins by measuring the bin means). simATAC generates a synthetic peak-by-cell count matrix with the "simATACgetPeakByCell" function, given a simulated bin-by-cell matrix by simATAC and a user-specified number of peak bins. simATAC also offers additional functionality to extract other feature matrices given a user-input list of regions (in a BED format) with the "simATACgetFeatureByCell" function. Another commonly used feature matrix format is the binary version, which is provided by "simATACgetBinary" function.

Evaluation
In this section, we demonstrate the resemblance of the simulated samples generated by simATAC to the input real scATAC-seq datasets. The simulated samples are compared to the real samples on the distributions of library size, sparsity, and bin means. We also evaluate the clustering performance of the simulated matrices. The evaluations are performed on each cell group (or cell type) from the annotated benchmark scATAC-seq datasets, Buenrostro2018 [22], Cusanovich2018 [23], and PBMCs [24], representing a wide range of platforms, cell types, and species. See the "Methods" section and Additional file 1: Table  S1 for the description of cell types and platforms of these datasets.

Statistical evaluation
With each of the three real scATAC-seq datasets as input, we simulate bin-by-cell matrices for each cell group with the same number of cells as in the real datasets. We then compare the distribution of library size, bin means, and sparsity of the simulated datasets to the real samples, by cell group. We present four cell groups from each benchmark dataset to demonstrate the similarity. Figure 2 depicts the library size distributions of the scATAC-seq data simulated by simATAC using 12 cell groups from the benchmark datasets as input. The library size distributions of the simulated data highly resemble those of the real datasets. See Additional file 1: Figure S1 for the complete comparison between the real and simulated data for all the cell groups in the three benchmark datasets. simATAC synthesized bin-by-cell matrices preserve the accessibility of genomic regions with its input real scATAC-seq dataset. Table 2 summarizes the Pearson correlation of the bin means and non-zero cell proportions between the simulated and the input real scATAC-seq datasets. The high correlations demonstrate that simATAC retains the genomic region accessibility characteristics of the input real data. See Additional file 1: Table S2 for the complete comparison for all cell groups from the benchmark datasets. All the reported Pearson correlations are averaged over 20 simulation runs. Figure 3 illustrates the sparsity of the simulated bin-by-cell matrices in 12 cell groups of the benchmark datasets, which demonstrates that the synthetic samples generated by simATAC retains the sparsity of the real samples across bins and cells. See Additional file 1: Figures S2-S3 for other cell groups comparison. To demonstrate the bin sparsity resemblance of the simulated to the real data, we provide the bin sparsity QQ-plots of 12 benchmark sample groups for sparsity adjustment factor γ = {0.8, 0.9, 1} in Additional file 1: Figures S4-S6.
We further investigate the impact of the bin sparsity adjustment parameter γ on the simulated bin-by-cell matrices. We observe that mild changes in the sparsity adjustment  factor do not significantly affect the distributions of cell sparsity, library sizes, or bin sparsity. See Additional file 1: Figure S7 for a comparison of these distributions under different sparsity adjustment factor values γ = {0.8, 0.9, 1, 1.1, 1.2} using the LMPP cell type from the Buenrostro2018 dataset. We also report the median absolute deviation (MAD), mean absolute error (MAE), and root mean square error (RMSE) of the sorted library sizes, bin means, and non-zero Fig. 3 Comparison of the bin sparsity and cell sparsity distributions. Bin sparsity QQ-plots and cell sparsity box plots of the simulated (in green) and real (in purple) scATAC-seq data are illustrated for the three benchmark datasets: Buenrostro2018, Cusanovich2018, and PBMCs. The sparsity of the synthetic data generated by simATAC (without additional Gaussian noise) closely resembles that of the corresponding real scATAC-seq input data cell proportion of each bin between the real and the simulated datasets in Table 3. The reported metrics are averaged over 20 simulation runs. The small values of MAD, MAE, and RMSE suggest that the synthesized bin-by-cell matrices' properties closely resemble those of the real input data.
The main functionality of simATAC is to simulate bin-by-cell scATAC-seq count matrices, yet it also offers additional functionalities such as converting the simulated bin-by-cell matrices to peak-by-cell feature matrices using the "simATACgetPeakByCell" function. We show that the extracted peak-by-cell matrices preserve the chromatin accessibility information of the real input data. We compare the clustering performance and peak overlaps of the simulated data with those extracted from two commonly used non-segmenting peak callers, MACS2 [25] and Genrich [26]. See Additional File 1: Figure S8, Tables S3-S4, and Note S2 for the detailed comparison.

Clustering evaluation
The ability to cluster cells with similar biological characteristics is one of the major evaluation aspects of many scATAC-seq analytical tools. Many previous studies reported close to perfect cell clustering performance when evaluating scATAC-seq pipelines using simulated data [11,12,17], which is not ideal in comparing the performance across different computational tools.
We here show that the simulated data by simATAC produce realistic downstream analysis results. Further, if the data is simulated based on an input real scATAC-seq dataset, the clustering performance based on the simulated data closely resemble those based on the real input data. Table 4 summarizes the clustering metrics, normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted Rand index (ARI) evaluated on the simulated datasets by simATAC. We adopt the SnapATAC graph-based clustering algorithm, and all the reported metrics are averaged over 20 simulation runs [11].
We compare the clustering metrics using the simulated data with those using the input real scATAC-seq data, which is listed on the "Real-input data" row. The clustering results in Table 4 show that the simulated data by simATAC maintain the chromatin accessibility information of each cell group reasonably close to the real data. The clustering metrics of the synthetic data can get very close to the real data when adding Gaussian noise and achieve realistic clustering performance in contrary to existing simulation methods. We further assess the distributions of the simulated counts under these three recommended Table 3 Median absolute deviation (MAD), mean absolute error (MAE), and root mean square error (RMSE) average for sorted real and sorted simulated library sizes, real and simulated bin means, and real and simulated non-zero cell proportions across all cell groups in the associated dataset. The reported values are the averages of these metrics ± the corresponding standard deviations based on 20 simulation runs. The unit of the values is 0.1%

Buenrostro2018
Cusanovich2018  noise levels in Additional file 1: Figures S9-S14 and Tables S5-S6, which suggests that there is a trade-off between the closeness in clustering performance and the closeness in the distributions of library size and the sparsity parameters between the simulated and real samples. We also note that the impact of Gaussian noise level on clustering metrics may vary across different cell types, and thus simATAC offers the flexibility for users to adjust their desirable noise level.
In Table 4, we assume the sparsity parameter γ = 1 when simulating all the data. We show in Additional file 1: Table S7 the impact of γ on the downstream clustering performance of the simulated data. With all other parameters fixed, the clustering results are generally consistent across different values of γ . Hence, we set the default value of γ to 1 in simATAC.

Discussion and conclusions
The rapid development of scATAC-seq technology led to a surge of scATAC-seq analytical tools. However, the lack of systematic simulation frameworks hinders the consistent evaluation of the computational tools and reproducibility of the analytical results. To meet this need, we developed simATAC, a systematic scATAC-seq simulator that generates synthetic samples that closely resemble real scATAC-seq data. simATAC builds upon Gaussian mixture distribution to model cells' library size, and polynomial regression model to represent the relationship between the average of bin counts and the non-zero cell proportion of bins. Moreover, simATAC grants users the flexibility to adjust parameters manually. simATAC generates a synthetic bin-by-cell matrix given a real scATAC-seq dataset as input. If there are no user-specified count matrix or parameters available, simATAC simulates samples using the default parameters derived from real scATAC-seq data. A list of estimated values for library size Gaussian mixture distribution and polynomial function parameters are provided in Additional file 2. simATAC also offers additional functions to transform the bin-by-cell matrix into feature matrices in other formats.
The statistical modeling framework of simATAC is built upon 90 real scATAC-seq cell groups from various sequencing platforms, species, and cell types. We demonstrated the distributions of the library sizes, bin means, and sparsity parameters of the simATAC synthetic datasets resembling those of the real input examples. simATAC offers additional options to modify the noise levels to mimic the artifacts in real scATACseq data and generate samples with various difficulty levels for downstream analyses assessment.
Quality control steps in most scATAC-seq pipelines apply filters on the raw feature matrices, such as removing cells with a library size less than a fixed threshold or filtering out regions based on the number of mapped reads. As the denoising thresholds may affect the downstream analyses, simATAC offers users the flexibility to manually set the quality control filtration thresholds on the simulated raw count matrix.
The feature matrices generated by simATAC cover regions spanning the whole genome without discarding the off-peak reads, enabling the identification of rare cell types in complex tissues. While the peak-by-cell matrix has been the common version of scATAC-seq feature matrix to be analyzed, recent studies challenged this strategy and proposed the bin-by-cell version for downstream analyses. Peak calling pipelines perform differently in defining accessible genomic regions based on the approach they deploy. Therefore, we believe genome binning is an optimal approach to simulate scATAC-seq data, providing a standard representation for samples from different sources. simATAC generates count matrices using a 5 kbp bin window, and at this resolution, each human cell spans ∼ 600,000 bins. However, there is no standard agreement on the optimal bin size for all samples, and bin size selection induces a trade-off between the ability to capture chromatin accessibility signals and computational cost. We assessed the runtime of simATAC on a desktop workstation (Intel(R) Xeon(R) CPU @ 3.60GHz processor). Simulating 1,000 human cells at 5 kbp window size on average took 43 seconds in five simulation runs. See Additional file 1: Table S8 for the running time of all benchmark datasets.
To our best knowledge, simATAC is the first scATAC-seq simulator that directly simulates bin-by-cell count matrices that are reproducible and closely resemble real data. Though the availability of real scATAC-seq data has been increasing, real scATAC-seq data with annotated labels ("ground truth") remains lacking. simATAC offers users the convenience to generate scATAC-seq data with known cell types and desirable number of single cells, yet closely resemble the real data. Further, simATAC provides users the flexibility to adjust the library size, bin sparsity, and noise level of the simulated data. We envision that simATAC empowers users to develop scATAC-seq analytical tools effectively and reproducibly.

simATAC statistical modeling
We compiled and processed each of the 90 scATAC-seq cell groups as well as each of the eight datasets (considering all cell groups together) to model the library size parameter. We conducted the Kolmogorov-Smirnov test and the chi-squared test to test the goodness of fit of the log-transformed library sizes to the Gaussian probability distribution, using the stats and fitdistrplus R packages [27][28][29][30]. The p-values of the goodness of fitness tests showed that non-10xG samples generally follow a Gaussian distribution.
Our preliminary statistical analysis of the 10xG scATAC-seq data showed that many of them are sampled from a mixture of probability distributions. We tested the null hypothesis if the 10xG samples' library sizes are sampled from a unimodal probability distribution using Hartigan's dip test from the diptest R package [31,32]. In most of the modeling datasets, the null hypothesis is rejected at a significance level of α = 0.05. Considering the probability density function and the cumulative distribution function plots, we modeled the log-transformed library sizes with a Gaussian mixture model with two modes and estimated the parameters using the mixtools R package [33]. Statistical parameters of the aforementioned tests for library size modeling are provided in Additional file 3.
We observed a significant difference in the distribution of library sizes between the real scATAC-seq data generated by the 10xG platform and other platforms. Library sizes of the non-10xG samples generally fit a unimodal Gaussian model, while those of the 10xG samples fit a bimodal GMM. All the statistical analyses results are provided in the Additional file 3. simATAC simulates the library size using a bimodal GMM for samples from all platforms, and for non-10xG samples, the weight of the second Gaussian distribution can be set to zero.
To recover the scATAC-seq data sparsity, simATAC first assigns zero or non-zero labels to the bin-by-cell matrix using a Bernoulli distribution for each bin. The probability that a cell at bin j is non-zero is the estimated non-zero cell proportion at the corresponding bin of the input real scATAC-seq dataset.
Based on the 90 real scATAC-seq modeling cell groups, we observed a polynomial relationship between the non-zero cell proportions and the bin means in the normalized real bin-by-cell arrays. The input matrix is normalized by dividing primary counts by the cells' library size and multiplying by the median of library sizes. The quadratic relation between bin means and non-zero cell proportions for the 12 sampled cell groups of benchmark datasets are provided in Additional file 1: Figure S15. simATAC estimates the regression parameters using the lm function from the stats package [29], and calculates bin means based on Eq. 4. Note that the parameters in Eq. 4 are estimated by cell types/groups, as the chromatin accessibility patterns of different cell types vary biologically.

Evaluation metrics
We assessed the simulated feature matrices by calculating the absolute differences between sorted real and sorted simulated library size vectors, real and simulated bin means, and non-zero cell proportion vectors of original and synthetic count matrices. The MAD, MAE, and RMSE of these vectors are computed by the following equations, where R are the real values and S are the simulated values: We used three metrics to evaluate the clustering performance, normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted Rand index (ARI).
Considering gt and pred as the ground truth and predicted labels, NMI is calculated using

NMI(gt, pred) = MI max(H(gt), H(pred))
, (10) where MI = MI(gt, pred) is the mutual information (MI) between gt and pred, and H is the entropy. AMI is defined as where E(·) is the expectation function.
Using the same notations, ARI is defined as where the Rand index (RI) is a similarity measure between two lists of labels. See more details of NMI, AMI, and ARI in Additional file 1: Note S2 [34].
• The PBMCs dataset is produced by the 10x Genomics Chromium (10xG) dropletbased platform and comprises 5335 cells from human peripheral blood mononuclear cells (PBMCs) [24]. There are no true cell type labels for PBMCs cells. However, we used 10x Genomics Cell Ranger ATAC's [24,41] clustering labels as ground truth and performed simulation on each cluster. Although existing labels are not perfect, we included this data to evaluate how simATAC mimics features of a group of cells with similar biological characteristics from a droplet-based platform [17].
All the aforementioned datasets are publicly available. The detailed information of all samples with cell groups and numbers are provided in the Additional file 1: Table S1. We also provide a table of abbreviations in Additional file 1: Table S9.

Dataset pre-processing
The raw FASTQ or BAM files were downloaded from the links provided, and bin-by-cell matrices used in simATAC development were generated using the SnapTools (version: 1.4.6) [11,42]. SnapTools is a Python module that pre-processes scATAC-seq data. Snap-Tools aligns raw FASTQ files to a reference genome using the Burrows-Wheeler aligner. Reads that were properly paired according to the SAM flag value, uniquely mapped with mapping quality > 30, and had a length less than 1,000 base pairs were filtered for further analyses. SnapTools groups the reads with the same barcode and removes PCR duplicate reads in each group. SnapTools outputs a .snap file, which is an hdf5 file that stores the input scATAC-seq data, including cell-by-bin matrix used in the development and analyses of simATAC modeling [11]. For 10xG samples, we started from the fragment.tsv file provided by 10x website, which is a barcoded and aligned fragment file processed, with an implemented option by SnapTools for 10xG samples. The rest of the samples were processed from FASTQ or provided BAM files, and unique randomly generated barcodes were added to the samples that did not have barcodes themselves.
We used samtools (version 1.10) for some of our pre-processing [43]. Bedtools (version 2.27.1) was used for generating peak-by-cell matrices [44] and Picard tool (version 2.23.3) for removing duplicate reads [45]. SnapATAC (version 1.0.0) R package was used for data pre-processing, loading bin-by-cell matrices, and clustering analysis [42]. We also used Signac R package (version 1.0.0), which is an extension of Seurat for the clustering analysis of peak-by-cell matrices [46].