scATAC-pro: a comprehensive workbench for single-cell chromatin accessibility sequencing data

Yu, Wenbao; Uzun, Yasin; Zhu, Qin; Chen, Changya; Tan, Kai

doi:10.1186/s13059-020-02008-0

Software
Open access
Published: 20 April 2020

scATAC-pro: a comprehensive workbench for single-cell chromatin accessibility sequencing data

Wenbao Yu^1,2,
Yasin Uzun^1,2,
Qin Zhu³,
Changya Chen^1,2 &
…
Kai Tan^1,2,3,4

Genome Biology volume 21, Article number: 94 (2020) Cite this article

16k Accesses
20 Citations
18 Altmetric
Metrics details

Abstract

Single-cell chromatin accessibility sequencing has become a powerful technology for understanding epigenetic heterogeneity of complex tissues. However, there is a lack of open-source software for comprehensive processing, analysis, and visualization of such data generated using all existing experimental protocols. Here, we present scATAC-pro for quality assessment, analysis, and visualization of single-cell chromatin accessibility sequencing data. scATAC-pro computes a range of quality control metrics for several key steps of experimental protocols, with a flexible choice of methods. It generates summary reports for both quality assessment and downstream analysis. scATAC-pro is available at https://github.com/tanlabcode/scATAC-pro.

Background

Chromatin accessibility is a strong indicator of the activities of functional DNA sequences. Recently, multiple experimental protocols have been developed to profile genome-wide chromatin accessibility in single cells, including the Assay of Transposase Accessible Chromatin with high-throughput sequencing (scATAC-seq) [1], single-cell combinatorial indexing ATAC-seq (sci-ATAC-seq) [2], single-cell transposome hypersensitivity site sequencing (scTHS-seq) [3], plate-based scATAC-seq protocol [4], and droplet-based single-cell combinatorial indexing ATAC-seq (dsci-ATAC-seq) [5]. In this paper, we collectively define data generated with these experiment protocols as single-cell chromatin accessibility sequencing data, or scCAS data. Application of these protocols has helped to understand the epigenetic heterogeneity across cell populations in complex tissues during normal development and pathogenesis, including adult mouse tissues [6], forebrain development [7], hematopoietic differentiation and leukemia evolution [8, 9], and T cell development and exhaustion [10].

In contrast to the rapid growth of scCAS data, bioinformatics tools for scCAS data analysis are critically lacking. The majority of existing tools lack comprehensiveness in their ability to process scCAS data. Both chromVar [11] and single-cell regulome analysis toolbox (SCRAT) [12] work with preprocessed data and only report loss or gain of chromatin accessibility on a set of predefined genomic regions, which ignores a large amount of information in the data. Detection of cell-type-specific difference in chromatin accessibility (Detin) [13], single-cell accessibility-based clustering (scABC) [14], and cisTopic [15] focuses on identifying cell populations and/or differential accessible regions given the processed data such as bam files or in peak-by-cell count matrix.

To our knowledge, Scasat [16] and scitools [17] are the only published software for comprehensive analysis of scCAS data. However, Scasat is developed in the Jupyter notebook environment. Although it is interactive, the programming codes are hard to standardize and reuse and users need to customize the analysis step by step. Furthermore, Scasat binarizes raw peak-by-cell count matrix, which ignores the differences among accessible regions and thus may lead to loss of valuable information for downstream analysis. Scasat does not provide summary reports for either data quality assessment or downstream analysis. scitools is designed for sci-ATAC-seq data only. Another tool SnapATAC [18] also binarizes the raw count data and cluster cells based on bin-by-cell count matrix. Cellranger-atac by 10x Genomics is another comprehensive tool but only works with data generated using the 10x Genomics platform, and the software code is not open source. Additionally, some key analysis modules of cellranger-atac are not flexible and do not use state-of-the-art algorithms. For example, the peak calling module does not use state-of-the-art algorithms, such as model-based analysis for ChIP-Seq 2 (MACS2) [19], resulting in many problematic peaks.

Here, we present a comprehensive and open-source software package for quality assessment and analysis of single-cell chromatin accessibility data, scATAC-pro. It provides flexible options for most of the analysis modules with carefully curated default settings. Summary reports for both quality assessment and downstream analysis are automatically generated. Interface to an interactive single-cell data exploration and visualization tool VisCello [20] is also provided.

Results

Overview of scATAC-pro workflow

scATAC-pro consists of two units, the data processing unit and the downstream analysis unit (Fig. 1). The data processing unit takes raw fastq files for reads and barcodes as the input and outputs peak-by-cell count matrix, QC report, and genome track files. It consists of the following modules: demultiplexing, adaptor trimming, read mapping, peak calling, cell calling, genome track file generation, and quality control assessment. The downstream analysis unit consists of the following modules: dimension reduction, cell clustering, differential accessibility analysis, Gene Ontology analysis, TF motif enrichment analysis, TF footprinting analysis, prediction of chromatin interactions, and integration of multiple data sets. We provide flexible options for all modules. Details about each module are summarized in Additional file 2: Table S1. We designed the scATAC-pro to be user friendly. In each run, users just need to specify the input file (“--input” flag), the module name (“--step” flag), and a configuration file (“--config” flag) in which users provide parameters and options for the analysis modules. Users can choose to run the entire or partial workflow. By default, all results are saved in the “output” directory under the current directory (--output_dir “./output”).

scATAC-pro provides flexible choices of methods for many analysis tasks

We provide at least two methods for most data processing and analysis modules. There are several reasons to have multiple methods for a given data processing or analysis task. First, for certain tasks such as read mapping, many tools such as BWA [21], Bowtie [22], and Bowtie2 [23] exist that perform equally well but have different levels of trade-off between mapping accuracy and mapping speed [23, 24]. Users can choose among those aligners based on different goals. For cell calling, cells can be called by filtering low-quality barcodes [1, 2, 6, 9] or by using model-based approaches (e.g., cellranger-atac). Each category of methods has its pros and cons that can be tailored towards the goals of the analysis.

Second, in cases where a general-purpose and top-performing method exists, such as MACS2 [19] for peak calling, there are other methods that are more suitable for specific tasks. For example, the genome-wide event finding and motif discovery algorithm, GEM [25], was shown to have better performance in identifying peaks overlapping with transcription factor binding sites (TFBS) [26]. In this case, the users might prefer GEM over MACS2.

Third, often times, there is a need for methods that can address data set-specific characteristics. For example, using both real and simulated data, we found that the clustering performance using binarized/non-binarized data varies among different methods. For instance, cisTopic performs better using binarized data whereas SCRAT performs better using non-binarized data (Additional file 1: Fig. S1A, Fig. S2D). Even for the same clustering algorithm, we found that most of the time, both Louvain (Additional file 1: Fig. S1B) and K-mean (Additional file 1: Fig. S1C) clustering algorithms work better with non-binarized data although at higher noise levels, the performance difference becomes smaller, and in some cases using binarized data is more accurate. Given this observation, we provide methods that work on either binarized or non-binarized data or both (see “Methods” for details). Another example is clustering, a critical task for understanding heterogeneity in a cell population. To help select better clustering methods, we conducted benchmarking studies using simulated data and real data. The compared methods include scABC, chromVAR, cisTopic, latent semantic indexing or LSI [2], SCRAT, and Louvain algorithm implemented in Seurat v3 [27]. Based on the benchmarking result (Additional file 1: Fig. S2), we recommend cisTopic, SCRAT, and Seurat as the top-performance methods although all above methods are available in scATAC-pro.

scATAC-pro provides carefully evaluated default settings for all modules

Method and parameter choice makes a big difference in the result of several analysis modules. We therefore provide a set of carefully evaluated default settings for each analysis module. We discuss the settings for the major modules as follows (see “Methods” for details).

Read mapping

Because of its balance between mapping speed and accuracy [28], especially for paired-end sequencing data, we choose BWA (specifically bwa-mem) as the default read aligner.

Peak calling

Peak calling is usually done on aggregated data across all barcodes. Such an approach fails to identify peaks that only appear in rare cell populations. We implemented a two-step strategy, similar to the idea used by Cusanovich et al. [6]. We first segment the genome into 5-kb bins and generate a bin-by-barcode count matrix, removing barcodes with fewer than 1000 unique fragments. We then cluster the barcodes using the graph-based Louvain algorithm using principal components as the input. Finally, we use MACS2 to call peaks on the aggregated data for each cell cluster. The final set of peaks are generated by merging peaks less than 200 bp apart identified from different cell clusters.

Cell calling

As default, we use a filtering strategy to distinguish cell barcodes from non-cell barcodes, because the method is intuitive, easy to interpret, and widely used among published studies [1, 2, 4, 6, 9, 17]. We define a barcode as a cell if its total number of unique fragments is greater than 5000 and the fraction of such fragments in peaks is greater than 50%. Users can use different thresholds for the fraction of fragments in enhancers, promoters, or mitochondrial genome to filter barcodes as well.

Normalization

We provide two normalization methods. The term frequency-inverse document frequency (TF-IDF) method [2, 6] treats count data as binary and normalizes data by sequencing depth per cell and total number of unique fragments per peak. In the second method, count data is first log-transformed, followed by a linear regression to remove the confounding factor due to varying sequencing depth per cell for every peak. This method enables users to work with non-binarized count data. The TF-IDF method is set as the default normalization method in scATAC-pro.

Dimension reduction and data visualization

We use principal component analysis (PCA) as the default dimension reduction method because it is the most widely used method for scCAS data and easy to interpret. Note that if the PCA is conducted on the TF-IDF normalized data, such dimension reduction is also referred as latent semantic indexing or LSI [2]. We use the PCA implementation in Seurat v3 with some modifications. In Seurat v3, the raw count matrix is log-transformed followed by a regression to remove confounding factors (the total number of unique fragments per cell). PCA is then performed on the transformed features. Because there are usually hundreds of thousand peaks in a scCAS data set, this process takes about a couple of hours to finish for a typical data set. In our implementation, we first perform PCA on the normalized peak-by-cell count matrix, followed by a regression analysis on each principal component. This procedure substantially reduces the computation time and produces very similar clustering results as the original Seurat implementation (Additional file 1: Fig. S3). Uniform Manifold Approximation and Projection (UMAP) [29] is used as the default visualization method.

Clustering

We provide the graph-based Louvain algorithm implemented in Seurat v3 as the default clustering method. Shared neighbor network (SNN) graph is constructed based on the first 30 principal components. Louvain algorithm is then performed on the SNN graph with default setting. We found that the Louvain clustering algorithm has a better balance between accuracy and running speed among several popular clustering methods (Additional file 1: Fig. S2).

scATAC-pro provides summary reports and interface to visualization tools

scATAC-pro generates quality assessment metrics for both aggregated and single-cell data. Two types of metrics are generated. The first type of metrics evaluates data quality internally, including read mapping rate, duplicate rate, high-confidence mapped fragments (MAPQ score greater than 30), and library complexity. The second type of metrics evaluates data quality using external annotations of genomic features, including fraction of fragments in mitochondrial genome and fraction of fragments overlapping with peaks and other annotated genomic regions, such as enhancers and promoters. The quality assessment summary reports are generated in html format. These statistics can be used to filter low-quality barcodes.

Besides quality assessment metrics, scATAC-pro also generates summary reports for downstream analyses, including dimension reduction, clustering, differential chromatin accessibility analysis, TF motif enrichment analysis and footprinting analysis, gene ontology analysis, and prediction of chromatin interactions.

To enable interactive exploration of data in scATAC-pro, we provide an interface to VisCello [20], a data explorer and visualization tool for single-cell omics data. To do this, we annotate the peaks with its nearest gene and mark genes with their TSSs located within the peak. Users can then visualize the chromatin accessibility signal of each peak or gene in all cells and identify differential accessible peaks among arbitrary groups of cells.

scATAC-pro provides utility functions to facilitate downstream analyses

Generation of input files for genome browser tracks for each cluster

In addition to visualizing scCAS signal in VisCello, it is also common to visualize scCAS signal for each cell population on a genome browser. To generate a normalized signal track file, bam file of cells from each cell cluster is first split from the bam file of all barcodes. Reads per cluster are then normalized as reads per kilobase per million mapped reads. scATAC-pro outputs normalized chromatin accessibility for each cluster in bigWig and bedGraph file formats, which can be directly uploaded to a genome browser for visualization.

Transcription factor footprinting analysis

ATAC-seq and related technologies use the Tn5 enzyme to cleavage nucleosome-free DNA while keeping the transcription factor binding sites intact due to protection by the bound TF. As a result, a small region, referred to as the footprint, exhibits reduced Tn5 cleavage rate at the ATAC-seq peak locus. Unlike TF motif enrichment analysis, TF footprinting analysis provides direct evidence of TF binding to the chromatin [30]. With Hint-ATAC [31], scATAC-pro enables footprinting analysis of either one cell cluster or differential TF binding between two groups of cell clusters.

Integration of multiple scCAS data sets

To integrate multiple scCAS data sets, assuming each data set is processed by scATAC-pro, we first merge peaks identified in each data set. Using this merged set of peaks, scATAC-pro reconstructs the peak-by-cell matrix for each data set. Because we generate the peak-by-cell matrix using the same set of peaks for all samples, it is straightforward to integrate the data sets using existing tools, such as Seurat v3 or Harmony [32]. scATAC-pro uses Seurat v3 as the default for this task.

Peak annotation and gene ontology analysis

To facilitate Gene Ontology analysis of genes associated with differential accessibility peaks, scATAC-pro first annotates each peak with its nearest gene. Gene Ontology analysis for those genes can then be performed using the runGO module. The background gene set is comprised of all genes associated with the differential accessibility peaks in all cell groups resulted from the differential accessibility analysis. This analysis helps users to further explore the identity of each cell cluster.

Predicting chromatin interactions by Cicero

Connecting regulatory DNA elements to target genes is a prerequisite to understanding transcriptional regulation. Cicero [33] predicts the interactions between cis-regulatory elements and the target genes using scCAS data. scATAC-pro generates the predicted interactions by running the runCicero module. The resulting interactions can be viewed through the UCSC genome browser.

Case study

We used three real data sets generated with different experimental protocols to demonstrate the utility of scATAC-pro: a data set of sorted human bone marrow hematopoietic cells generated using the Fluidigm protocol [8, 9] (Buenrostro2018), a data set of human peripheral blood mononuclear cells (PBMCs) generated using the 10x Genomics protocol [34], and a data set of 13 adult mouse tissues generated using the sci-ATAC-seq protocol [6] (Cusanovich2018). For the sake of brevity, we present the result based on the 10x Genomics data set in the main figures. Similar figures based on the two other data sets are presented as supplementary figures.

Starting from the fastq files, scATAC-pro first demultiplexed sequencing reads by adding the cell barcodes (R2.fastq.gz) information to the paired-end reads (R1.fastq.gz, R3.fastq.gz). Adaptor sequences were then trimmed off, mapped to the GRCh38 reference genome using scATAC-pro default settings. Fragments with mapping quality score (MAPQ score) less than 30 were removed. A summary report for mapping statistics and library complexity is provided for all reads (Fig. 2A) and reads belonging to called cells (Fig. 2B, C). Summary reports for the other two data sets are shown in Additional file 1: Fig. S4 and Fig. S5, respectively.

Using the default peak caller, scATAC-pro called 129,049 peaks after removing peaks overlapping with ENCODE blacklisted genomic regions [35]. Cell barcodes were selected by filtering out barcodes with fewer than 5000 total unique fragments and the fraction of unique fragments in peak less than 50% (Fig. 3A). Quality assessment report for each barcode was generated using various metrics, including distribution of insert size, transcriptional start site (TSS) enrichment profile, distribution of the total number of unique fragments for cell and non-cell barcodes, and fractions of unique fragments overlapping with annotated genomic regions (Fig. 3B–E). Overall statistics of data aggregated from all called cells was also computed (Fig. 3F). Figures showing quality assessment metrics for called single cells for the other two data sets are shown in Additional file 1: Fig. S6 and Fig. S7, respectively.

Downstream analyses including clustering, TF motif enrichment analysis, TF footprinting analysis, GO analysis, and cis-element interaction prediction were conducted using default scATAC-pro methods and settings (Fig. 4). In total, we found 10 cell types. The top 10 enriched TFs for each cluster are shown in Fig. 4B, which provides a means for identifying cell type associated with each cluster. For example, binding motifs of PU.1 (encoded by SPI1), IRF4, CEBPA, and CEBPB are highly enriched in clusters 0, 6, 7, and 8, suggesting those clusters are monocytes [36]. Motifs of EOMES and TBX5 were enriched in clusters 1, 2, and 5, suggesting those clusters are T cells. Enrichment of EBF1 [37] and BCL11A motifs [38] suggests cluster 3 represents B cells. The differential footprinting analysis between cluster 0 and the rest of clusters further suggests that cluster 0 represents monocytes, because the monocytic TFs PU.1, JUNB, JUN, CEBPA, and CEBPE [36, 39, 40] all have significant higher binding probability in cluster 0 cells (Fig. 4C).

Summary report for downstream analyses for the other two data sets is shown in Additional file 1: Fig. S8 and Fig. S9, respectively.

Using VisCello [20], we can display chromatin accessibility values of TSS regions of several marker genes across cell clusters (Fig. 5A and Additional file 1: Fig. S10A), such as MS4A1 (CD20) for B cells, GNLY and NKG7 for natural killer (NK) cells, CD3E for T cells, CD14, LYZ, and FCGR3A (CD16) for monocyte cells, and CST3 for dendritic cells (DC) [41]. We also displayed UCSC genome browser tracks for two example genes, CD14 (Fig. 5B) and the FCER1A (Additional file 1: Fig. S11). Taken together, based on the chromatin accessibility profile of known cell-type-specific marker genes, we annotated cell clusters as T cells, B cells, CD14⁺ monocytes, CD16⁺ monocytes, dendritic cells, and natural killer cells (Fig. 5C). VisCello can also conduct differential chromatin accessibility analysis between any two arbitrary groups of cell clusters (Additional file 1: Fig. S10B).

Discussion

scATAC-pro provides a comprehensive solution for scCAS data QC and analysis. It reports a number of QC metrics for both aggregated data of all barcodes and barcodes of called cells. These metrics evaluate multiple steps of the experimental protocol, including the quality of nuclei preparation, transposase reaction, and cell encapsulation (insert size, fraction of unique reads, cell vs non-cell reads, mitochondrial reads), and library preparation and sequencing (duplicate rate, fraction of unique reads and reads with MAP score > 30). Although there is no universally optimal QC metric for all types of scCAS data, the fraction of fragments in peak per cell is the most widely used in the literature [1, 2, 4, 6, 9, 17]. Alternative metric such as the TSS enrichment score per cell is introduced recently [18], but its utility may be limited for cell types that have a large fraction of active TSS-distal peaks. Having a comprehensive annotation of cis-regulatory elements across all human cell types will facilitate the task of evaluating quality of gene-distal ATAC-seq peaks.

Because there is no clear optimal method for many analysis tasks, scATAC-pro provides multiple methods that allow users to tailor their analyses and to address data set-specific characteristics. To guide the users, we have provided carefully evaluated default settings for each analysis task, including both method choice and parameter setting of the selected method(s).

The open-source and modular design of scATAC-pro facilitates the maintenance and future development of the software. Several experimental protocols exist for generating scCAS data. Data generated using these protocols have different characteristics and qualities. scATAC-pro is the first software tool that enables analysis of all types of scCAS data. By doing so, scATAC-pro facilitates the integration of rapidly growing scCAS data.

The current version of scATAC-pro generates a static summary report. It can be enhanced by generating dynamic summary report in future versions of the software. For example, for downstream analysis, the results can be updated in real time based on the cell clusters compared. Interactively display of cis-interactions within an arbitrary genomic region will be another very useful feature to add.

Conclusions

scATAC-pro is a comprehensive open-source software for processing, analyzing, and visualizing single-cell chromatin accessibility sequence data. With the rapid accumulation of single-cell chromatin accessibility sequencing data, application of scATAC-pro will facilitate a better understanding of epigenomic heterogeneity in healthy and diseased tissues.

Methods

Details of each scATAC-pro module

Demultiplexing

Given the reads fastq files and the barcode fastq files, the barcode sequences are written into the name of each read sequence (in the format of @BARCODE:ORIGINAL_READ_NAME) to facilitate the tasks of downstream modules, such as generating peak-by-cell matrix and quality assessment of single cells. For data generated using 10x Genomics, sci-ATAC-seq, and dsci-ATAC-seq protocols, users need to provide the paired-end read fastq files and the index fastq file (also supports multiple index fastq files). For scTHS-seq data, users need to specify the parameter isSingleEnd = TRUE in the configuration file because scTHS-seq data are single-end reads. This module is skipped if the barcode for each read is recorded in the required format. For example, in the mouse sci-ATAC-seq atlas data [2, 6], the barcode for each read is saved in the name of each read.

Adaptor trimming

To map sequencing reads confidently to the reference genome, scATAC-pro first trims off the adaptor sequence and primer oligo sequence from raw reads using trim_galore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) as the default, which can automatically detect and trim the adaptor and primer sequences. Alternatively, users can also use Trimmomatic [42], which is faster but users need to specify the sequence of the adaptor in the configuration file.