Overview of scATAC-pro workflow
scATAC-pro consists of two units, the data processing unit and the downstream analysis unit (Fig. 1). The data processing unit takes raw fastq files for reads and barcodes as the input and outputs peak-by-cell count matrix, QC report, and genome track files. It consists of the following modules: demultiplexing, adaptor trimming, read mapping, peak calling, cell calling, genome track file generation, and quality control assessment. The downstream analysis unit consists of the following modules: dimension reduction, cell clustering, differential accessibility analysis, Gene Ontology analysis, TF motif enrichment analysis, TF footprinting analysis, prediction of chromatin interactions, and integration of multiple data sets. We provide flexible options for all modules. Details about each module are summarized in Additional file 2: Table S1. We designed the scATAC-pro to be user friendly. In each run, users just need to specify the input file (“--input” flag), the module name (“--step” flag), and a configuration file (“--config” flag) in which users provide parameters and options for the analysis modules. Users can choose to run the entire or partial workflow. By default, all results are saved in the “output” directory under the current directory (--output_dir “./output”).
scATAC-pro provides flexible choices of methods for many analysis tasks
We provide at least two methods for most data processing and analysis modules. There are several reasons to have multiple methods for a given data processing or analysis task. First, for certain tasks such as read mapping, many tools such as BWA [21], Bowtie [22], and Bowtie2 [23] exist that perform equally well but have different levels of trade-off between mapping accuracy and mapping speed [23, 24]. Users can choose among those aligners based on different goals. For cell calling, cells can be called by filtering low-quality barcodes [1, 2, 6, 9] or by using model-based approaches (e.g., cellranger-atac). Each category of methods has its pros and cons that can be tailored towards the goals of the analysis.
Second, in cases where a general-purpose and top-performing method exists, such as MACS2 [19] for peak calling, there are other methods that are more suitable for specific tasks. For example, the genome-wide event finding and motif discovery algorithm, GEM [25], was shown to have better performance in identifying peaks overlapping with transcription factor binding sites (TFBS) [26]. In this case, the users might prefer GEM over MACS2.
Third, often times, there is a need for methods that can address data set-specific characteristics. For example, using both real and simulated data, we found that the clustering performance using binarized/non-binarized data varies among different methods. For instance, cisTopic performs better using binarized data whereas SCRAT performs better using non-binarized data (Additional file 1: Fig. S1A, Fig. S2D). Even for the same clustering algorithm, we found that most of the time, both Louvain (Additional file 1: Fig. S1B) and K-mean (Additional file 1: Fig. S1C) clustering algorithms work better with non-binarized data although at higher noise levels, the performance difference becomes smaller, and in some cases using binarized data is more accurate. Given this observation, we provide methods that work on either binarized or non-binarized data or both (see “Methods” for details). Another example is clustering, a critical task for understanding heterogeneity in a cell population. To help select better clustering methods, we conducted benchmarking studies using simulated data and real data. The compared methods include scABC, chromVAR, cisTopic, latent semantic indexing or LSI [2], SCRAT, and Louvain algorithm implemented in Seurat v3 [27]. Based on the benchmarking result (Additional file 1: Fig. S2), we recommend cisTopic, SCRAT, and Seurat as the top-performance methods although all above methods are available in scATAC-pro.
scATAC-pro provides carefully evaluated default settings for all modules
Method and parameter choice makes a big difference in the result of several analysis modules. We therefore provide a set of carefully evaluated default settings for each analysis module. We discuss the settings for the major modules as follows (see “Methods” for details).
Read mapping
Because of its balance between mapping speed and accuracy [28], especially for paired-end sequencing data, we choose BWA (specifically bwa-mem) as the default read aligner.
Peak calling
Peak calling is usually done on aggregated data across all barcodes. Such an approach fails to identify peaks that only appear in rare cell populations. We implemented a two-step strategy, similar to the idea used by Cusanovich et al. [6]. We first segment the genome into 5-kb bins and generate a bin-by-barcode count matrix, removing barcodes with fewer than 1000 unique fragments. We then cluster the barcodes using the graph-based Louvain algorithm using principal components as the input. Finally, we use MACS2 to call peaks on the aggregated data for each cell cluster. The final set of peaks are generated by merging peaks less than 200 bp apart identified from different cell clusters.
Cell calling
As default, we use a filtering strategy to distinguish cell barcodes from non-cell barcodes, because the method is intuitive, easy to interpret, and widely used among published studies [1, 2, 4, 6, 9, 17]. We define a barcode as a cell if its total number of unique fragments is greater than 5000 and the fraction of such fragments in peaks is greater than 50%. Users can use different thresholds for the fraction of fragments in enhancers, promoters, or mitochondrial genome to filter barcodes as well.
Normalization
We provide two normalization methods. The term frequency-inverse document frequency (TF-IDF) method [2, 6] treats count data as binary and normalizes data by sequencing depth per cell and total number of unique fragments per peak. In the second method, count data is first log-transformed, followed by a linear regression to remove the confounding factor due to varying sequencing depth per cell for every peak. This method enables users to work with non-binarized count data. The TF-IDF method is set as the default normalization method in scATAC-pro.
Dimension reduction and data visualization
We use principal component analysis (PCA) as the default dimension reduction method because it is the most widely used method for scCAS data and easy to interpret. Note that if the PCA is conducted on the TF-IDF normalized data, such dimension reduction is also referred as latent semantic indexing or LSI [2]. We use the PCA implementation in Seurat v3 with some modifications. In Seurat v3, the raw count matrix is log-transformed followed by a regression to remove confounding factors (the total number of unique fragments per cell). PCA is then performed on the transformed features. Because there are usually hundreds of thousand peaks in a scCAS data set, this process takes about a couple of hours to finish for a typical data set. In our implementation, we first perform PCA on the normalized peak-by-cell count matrix, followed by a regression analysis on each principal component. This procedure substantially reduces the computation time and produces very similar clustering results as the original Seurat implementation (Additional file 1: Fig. S3). Uniform Manifold Approximation and Projection (UMAP) [29] is used as the default visualization method.
Clustering
We provide the graph-based Louvain algorithm implemented in Seurat v3 as the default clustering method. Shared neighbor network (SNN) graph is constructed based on the first 30 principal components. Louvain algorithm is then performed on the SNN graph with default setting. We found that the Louvain clustering algorithm has a better balance between accuracy and running speed among several popular clustering methods (Additional file 1: Fig. S2).
scATAC-pro provides summary reports and interface to visualization tools
scATAC-pro generates quality assessment metrics for both aggregated and single-cell data. Two types of metrics are generated. The first type of metrics evaluates data quality internally, including read mapping rate, duplicate rate, high-confidence mapped fragments (MAPQ score greater than 30), and library complexity. The second type of metrics evaluates data quality using external annotations of genomic features, including fraction of fragments in mitochondrial genome and fraction of fragments overlapping with peaks and other annotated genomic regions, such as enhancers and promoters. The quality assessment summary reports are generated in html format. These statistics can be used to filter low-quality barcodes.
Besides quality assessment metrics, scATAC-pro also generates summary reports for downstream analyses, including dimension reduction, clustering, differential chromatin accessibility analysis, TF motif enrichment analysis and footprinting analysis, gene ontology analysis, and prediction of chromatin interactions.
To enable interactive exploration of data in scATAC-pro, we provide an interface to VisCello [20], a data explorer and visualization tool for single-cell omics data. To do this, we annotate the peaks with its nearest gene and mark genes with their TSSs located within the peak. Users can then visualize the chromatin accessibility signal of each peak or gene in all cells and identify differential accessible peaks among arbitrary groups of cells.
scATAC-pro provides utility functions to facilitate downstream analyses
Generation of input files for genome browser tracks for each cluster
In addition to visualizing scCAS signal in VisCello, it is also common to visualize scCAS signal for each cell population on a genome browser. To generate a normalized signal track file, bam file of cells from each cell cluster is first split from the bam file of all barcodes. Reads per cluster are then normalized as reads per kilobase per million mapped reads. scATAC-pro outputs normalized chromatin accessibility for each cluster in bigWig and bedGraph file formats, which can be directly uploaded to a genome browser for visualization.
Transcription factor footprinting analysis
ATAC-seq and related technologies use the Tn5 enzyme to cleavage nucleosome-free DNA while keeping the transcription factor binding sites intact due to protection by the bound TF. As a result, a small region, referred to as the footprint, exhibits reduced Tn5 cleavage rate at the ATAC-seq peak locus. Unlike TF motif enrichment analysis, TF footprinting analysis provides direct evidence of TF binding to the chromatin [30]. With Hint-ATAC [31], scATAC-pro enables footprinting analysis of either one cell cluster or differential TF binding between two groups of cell clusters.
Integration of multiple scCAS data sets
To integrate multiple scCAS data sets, assuming each data set is processed by scATAC-pro, we first merge peaks identified in each data set. Using this merged set of peaks, scATAC-pro reconstructs the peak-by-cell matrix for each data set. Because we generate the peak-by-cell matrix using the same set of peaks for all samples, it is straightforward to integrate the data sets using existing tools, such as Seurat v3 or Harmony [32]. scATAC-pro uses Seurat v3 as the default for this task.
Peak annotation and gene ontology analysis
To facilitate Gene Ontology analysis of genes associated with differential accessibility peaks, scATAC-pro first annotates each peak with its nearest gene. Gene Ontology analysis for those genes can then be performed using the runGO module. The background gene set is comprised of all genes associated with the differential accessibility peaks in all cell groups resulted from the differential accessibility analysis. This analysis helps users to further explore the identity of each cell cluster.
Predicting chromatin interactions by Cicero
Connecting regulatory DNA elements to target genes is a prerequisite to understanding transcriptional regulation. Cicero [33] predicts the interactions between cis-regulatory elements and the target genes using scCAS data. scATAC-pro generates the predicted interactions by running the runCicero module. The resulting interactions can be viewed through the UCSC genome browser.
Case study
We used three real data sets generated with different experimental protocols to demonstrate the utility of scATAC-pro: a data set of sorted human bone marrow hematopoietic cells generated using the Fluidigm protocol [8, 9] (Buenrostro2018), a data set of human peripheral blood mononuclear cells (PBMCs) generated using the 10x Genomics protocol [34], and a data set of 13 adult mouse tissues generated using the sci-ATAC-seq protocol [6] (Cusanovich2018). For the sake of brevity, we present the result based on the 10x Genomics data set in the main figures. Similar figures based on the two other data sets are presented as supplementary figures.
Starting from the fastq files, scATAC-pro first demultiplexed sequencing reads by adding the cell barcodes (R2.fastq.gz) information to the paired-end reads (R1.fastq.gz, R3.fastq.gz). Adaptor sequences were then trimmed off, mapped to the GRCh38 reference genome using scATAC-pro default settings. Fragments with mapping quality score (MAPQ score) less than 30 were removed. A summary report for mapping statistics and library complexity is provided for all reads (Fig. 2A) and reads belonging to called cells (Fig. 2B, C). Summary reports for the other two data sets are shown in Additional file 1: Fig. S4 and Fig. S5, respectively.
Using the default peak caller, scATAC-pro called 129,049 peaks after removing peaks overlapping with ENCODE blacklisted genomic regions [35]. Cell barcodes were selected by filtering out barcodes with fewer than 5000 total unique fragments and the fraction of unique fragments in peak less than 50% (Fig. 3A). Quality assessment report for each barcode was generated using various metrics, including distribution of insert size, transcriptional start site (TSS) enrichment profile, distribution of the total number of unique fragments for cell and non-cell barcodes, and fractions of unique fragments overlapping with annotated genomic regions (Fig. 3B–E). Overall statistics of data aggregated from all called cells was also computed (Fig. 3F). Figures showing quality assessment metrics for called single cells for the other two data sets are shown in Additional file 1: Fig. S6 and Fig. S7, respectively.
Downstream analyses including clustering, TF motif enrichment analysis, TF footprinting analysis, GO analysis, and cis-element interaction prediction were conducted using default scATAC-pro methods and settings (Fig. 4). In total, we found 10 cell types. The top 10 enriched TFs for each cluster are shown in Fig. 4B, which provides a means for identifying cell type associated with each cluster. For example, binding motifs of PU.1 (encoded by SPI1), IRF4, CEBPA, and CEBPB are highly enriched in clusters 0, 6, 7, and 8, suggesting those clusters are monocytes [36]. Motifs of EOMES and TBX5 were enriched in clusters 1, 2, and 5, suggesting those clusters are T cells. Enrichment of EBF1 [37] and BCL11A motifs [38] suggests cluster 3 represents B cells. The differential footprinting analysis between cluster 0 and the rest of clusters further suggests that cluster 0 represents monocytes, because the monocytic TFs PU.1, JUNB, JUN, CEBPA, and CEBPE [36, 39, 40] all have significant higher binding probability in cluster 0 cells (Fig. 4C).
Summary report for downstream analyses for the other two data sets is shown in Additional file 1: Fig. S8 and Fig. S9, respectively.
Using VisCello [20], we can display chromatin accessibility values of TSS regions of several marker genes across cell clusters (Fig. 5A and Additional file 1: Fig. S10A), such as MS4A1 (CD20) for B cells, GNLY and NKG7 for natural killer (NK) cells, CD3E for T cells, CD14, LYZ, and FCGR3A (CD16) for monocyte cells, and CST3 for dendritic cells (DC) [41]. We also displayed UCSC genome browser tracks for two example genes, CD14 (Fig. 5B) and the FCER1A (Additional file 1: Fig. S11). Taken together, based on the chromatin accessibility profile of known cell-type-specific marker genes, we annotated cell clusters as T cells, B cells, CD14+ monocytes, CD16+ monocytes, dendritic cells, and natural killer cells (Fig. 5C). VisCello can also conduct differential chromatin accessibility analysis between any two arbitrary groups of cell clusters (Additional file 1: Fig. S10B).