An integrated strategy for identification of both sharp and broad peaks from next-generation sequencing data
© BioMed Central Ltd 2011
Published: 25 July 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 25 July 2011
A novel integrative approach has been developed by Lieb and colleagues for analyzing genome-wide datasets of different chromatin-binding factors and epigenetic states that exhibit both sharp and diffuse signals on the genome.
See research article: http://genomebiology.com/2011/12/7/r67
Transcriptional regulation plays a central role in essential biological processes, including cellular differentiation and responses to cellular and environmental signals. It is becoming increasingly clear that transcription of a large number of genes is coordinately regulated during these processes through transcriptional regulatory networks consisting of three major components: (1) cis regulatory elements, including promoters and enhancers; (2) trans factors, including transcription factors (TFs) and chromatin modifying enzymes; and (3) chromatin modification states at regulatory regions of genes. Recent development of assays, including ChIP-Seq, DNase-Seq and FAIRE-Seq, which utilize next generation high-throughput sequencing, has propelled the advance in our understanding of these three components of transcriptional regulatory networks. Critical for application of these assays are a number of different algorithms that have been developed to identify signal-enriched regions (or peaks) from these genome-wide datasets, including the latest program reported by Rashid et al. in this issue of Genome Biology .
Increasingly numerous datasets of genome-wide profiling of epigenetic modifications and chromatin-binding proteins are being generated. Depending on the distribution pattern of sequencing reads on the genome, there are virtually two kinds of signals: sharp and diffuse signals. While TFs generally recognize specific target motifs, either in enhancers or promoters, and exhibit localized strong signals, the distribution of histone-modification signals ranges from a few nucleosomes to large chromatin domains spanning hundreds of kilobases. For example, H3K4me2 and H3K4me3, which are usually associated with enhancers and promoters, tend to exhibit localized, sharp peaks, whereas H3K27me3, associated with gene silencing, may cover entire chromatin domains [2, 3]. On yet larger scales, it is known that H3K9me3 marks heterochromatic domains. In addition to chromatin modifications, some histone-modifying enzymes , chromatin remodeling complexes and RNA polymerase II (RNA Pol II) also exhibit extended domains of enrichment. As the cost of sequencing continues to decrease and throughput continues to increase, both at a breathtaking pace, bioinformatic and statistics tools for analyzing genome-wide datasets have become the proverbial bottleneck in garnering results from these assays. Correct and robust delineation of the variety of enrichment patterns from high-throughput sequencing data is essential in all downstream analysis, ranging from annotating the genome, identification of novel target genetic elements or biomarkers, to shedding mechanistic light on specific biological processes.
Although transcriptional regulation involves coordination of a variety of factors that exhibit different binding patterns on the genome, efforts in algorithm development have largely been focused on finding peaks in ChIP-Seq data for identification of TF-binding sites. The first generation of ChIP-Seq peak-finding programs has been summarized and evaluated by Wilbanks and Facciotti . Common in many of these peak-finding programs is a coverage threshold approach based on a variety of statistics, which vary from algorithm to algorithm. Novel peak finding programs continue to be developed, among which is a topology-based approach that takes into account the shape of the peaks and uses a tree-based statistic for significance determination . Compared with programs suitable for sharp peaks of TF-binding events, computational tools in identifying the diffuse signals spanning large chromatin domains in ChIP-Seq data have been rather limited, due to high noise level, insufficient sequencing coverage, and lack of objective standards for evaluation. The first program designed for identifying such signals was SICER . Motivated by the known mechanisms of domain formation of histone modifications, SICER uses a spatial clustering approach backed by novel statistics for cluster extension and considers the context of enrichment explicitly. It has been successfully applied to ChIP-Seq for histone modifications and chromatin-binding enzymes. Another program, RSEG , employs a hidden Markov model approach that takes read mappability into account and provides a statistical approach for domain boundary determination. Although most of these programs provide satisfactory results when applied to datasets with a specific type of peak patterns, particular attention is needed to match the designing feature of the algorithms with the peak pattern of the datasets. When peak-finding programs designed primarily for TF-binding site identification are used on dispersed signals, sensitivity is low and integrity of the domains is compromised. When domain-finding approaches are used on data with sharp peaks, the information on the precise location of peaks may not be fully recovered. Thus, integrative strategies for analyzing the genome-wide datasets of different factors and epigenetic modifications that exhibit both sharp and diffuse signals are needed. ZINBA  is among the first to address this need.
ZINBA uses a novel mixture regression approach for its statistical framework. It partitions the genome into non-overlapping windows and each window is probabilistically classified into three components: background, enrichment and zero. The component zero is introduced to account for the many windows with zero tags due to insufficient sequencing coverage. The probability attributed to each component in each window is assigned via an expectation-maximization algorithm, using a negative binomial distribution to parameterize the enrichment and background components. The parameters of this distribution include several covariates, which serve to assist modeling of each component. Known important covariates include read mappability, copy number variations and, in some situations, GC content. As this list suggests, systematic biases of the ChIP-Seq assay are among the most relevant covariates. Indeed, the authors showed using simulated data that inclusion of relevant covariates improves the performance of the algorithm.
This distribution-plus-covariate approach enables incorporation of systematic biases in enrichment identification, a feature especially appealing when the matching control library is not available (for example, FAIRE-seq), or the control library is of poor sequencing coverage. The authors demonstrated that, by considering the non-input covariates, the performance of ZINBA in the absence of an input control library rivals that of the algorithm with an input control library. Additonally, these covariates provide important information for systematic understanding of the compositions of enrichment and background. In general, the ZINBA framework allows any covariate to be included and its relevance estimated, making it easily adaptable to additional covariates or new systems.
Rashid et al.  evaluated the performance of ZINBA using datasets that cover enrichment length-scales ranging from punctate signals produced by sequence-specific binding events of insulator binding protein CTCF to diffuse signals of the histone modification mark H3K36me3, associated with elongation of RNA Pol II. For punctate signals, ZINBA performs as well as MACS , which is one of the most popular peak finders. For diffuse signals such as H3K36me3, ZINBA-designated domains are shown to correlate well with expression. As an example for integrative analysis involving both punctate and diffuse signals, the authors examined the stalled and elongating RNA Pol II. Interestingly, by using ZINBA and a refined definition of stalling score that adjusts for lengths, a significant improvement of the anti-correlation between the stalling score and gene expression is achieved.
ZINBA is a much welcomed addition to the arsenal of bioinformaticians and computational biologists working alongside biologists. Future refinements may focus on resolving or improving the following issues with the method: (1) the assumption of independence of read counts of each window, which is mitigated at an empirical level when merging of enrichment is enforced; and (2) the intensive computational resources needed in model selection, which presumably is due to the need to carry out this operation for each window.
Understanding the diverse modes of functional organization of the genome has been a major theme of genome biology and continues to evolve given the recent developments in what we learn about the three-dimensional structure of genomes . We expect to see even more efforts in development of algorithms like ZINBA, allowing multiple resolutions and integrating diverse data types. As the authors mentioned in the paper, a major factor affecting all ChIP-Seq analysis tools, especially those dealing with diffuse signals, remains the sequencing depth. Most of the libraries displaying dispersed signals for large genomes, such as human and mouse, are far from saturated. They will be likely to remain so in the near future, even with the continuing increase in sequencing throughput. Yet there is only limited evaluation of methods from the perspective of sequencing coverage. The scaling behavior of analysis methods (that is, how robust the analysis results are as the sequencing depth is changed) should receive more attention in development and evaluation of new methods.
RNA polymerase II
We thank Brian Abraham for critical reading of the manuscript. This work was supported by the Division of Intramural Research Program of National Heart, Lung and Blood Institute, NIH (KZ).