ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing
- Guoliang Li†1,
- Melissa J Fullwood†2,
- Han Xu†1,
- Fabianus Hendriyan Mulawadi1,
- Stoyan Velkov1,
- Vinsensius Vega1,
- Pramila Nuwantha Ariyaratne1,
- Yusoff Bin Mohamed1,
- Hong-Sain Ooi1,
- Chandana Tennakoon3,
- Chia-Lin Wei2,
- Yijun Ruan2, 4Email author and
- Wing-Kin Sung1, 3Email author
© Li et al.; licensee BioMed Central Ltd. 2010
Received: 3 September 2009
Accepted: 25 February 2010
Published: 25 February 2010
Chromatin interaction analysis with paired-end tag sequencing (ChIA-PET) is a new technology to study genome-wide long-range chromatin interactions bound by protein factors. Here we present ChIA-PET Tool, a software package for automatic processing of ChIA-PET sequence data, including linker filtering, mapping tags to reference genomes, identifying protein binding sites and chromatin interactions, and displaying the results on a graphical genome browser. ChIA-PET Tool is fast, accurate, comprehensive, user-friendly, and open source (available at http://chiapet.gis.a-star.edu.sg).
Transcription factors and their three-dimensional interactions are crucial to gene regulation [1, 2]. Many distal transcription factor binding sites have been identified by genome-wide chromatin experiments, such as chromatin immunoprecipitation (ChIP)-chip , ChIP-paired-end tag (PET) , and ChIP-Seq , but it is not clear which of these distal transcription factor binding sites are real and functional in gene regulation, and which are non-functional 'parking spots'. Three-dimensional chromatin interactions have been shown to bring distal transcription factor binding sites into close spatial proximity to gene promoters , but global analysis of three-dimensional chromatin interactions has been limited by the lack of techniques for high-resolution and whole-genome analysis.
The ChIA-PET approach is very efficient in generating large volumes of PET sequence data for long-range chromatin interactions with different protein factors in complex genomes. Since the detection of long-range chromatin interactions involves high levels of background noise due to the complexity of chromatin structures in nuclear space and the nature of proximity ligation [7, 8], a meaningful analysis requires a comprehensive, efficient pipeline. The immense challenges in the setup of an efficient pipeline to process the huge body of ChIA-PET sequence data include: how to accurately filter the linker sequences from the raw reads; how to accurately and efficiently map the tag sequences to reference genomes; how to evaluate the noise level in the data; how to identify bona fide binding sites and chromatin interactions; how to organize the datasets; and how to effectively visualize the long-range chromatin interactions identified by ChIA-PET analysis. Many of the bioinformatics challenges faced in the ChIA-PET analysis are unprecedented.
In developing the ChIA-PET data analysis algorithms, we assembled a package of sophisticated bioinformatics solutions called 'ChIA-PET Tool' for processing, analyzing, visualizing, and managing ChIA-PET data quickly, accurately, and automatically. In this report, we describe the design and implementation of ChIA-PET Tool, and demonstrate its efficiency and effectiveness through processing and analyzing an estrogen receptor α (ERα) ChIA-PET library dataset from the MCF-7 cell-line.
The architecture design of ChIA-PET Tool
Statistics of ChIA-PET data analysis
4,269,610 (30.8% of IHH015A)
6,157,038 (44.4% of IHH015A)
Unique PETs with unique mapping
Different orientation PETs
Intra-chromosomal inter-ligation PETs (excludes different orientation PETs)
Inter-chromosomal inter-ligation PETs
Binding sites (FDR <0.01, filteredd)
Intra-chromosomal interactions (FDR <0.05; PET cluster size ≥ 3; filteredd)
Inter-chromosomal interactions (FDR <0.05; PET cluster size ≥ 3; filteredd)
A ChIA-PET input sequence has a pair of reads from the two opposite directions of the same ChIA-PET template. Both reads are 36 nucleotides long, and contain 20-nucleotide tag information and 16-nucleotide linker information (Figure 1b). To determine the linker type, we aligned the linker part of the reads (the last 16 nucleotide sequence of the 36 nucleotides) to the half-linker nucleotide sequence A (or B). If both the paired reads have good alignment with half-linker A (or B) and the specific nucleotide positions 9 and 10 (the nucleotide barcode) are CG (or AT), we classify the PET as having a homogenous full linker composition with AA (or BB). If one read in a PET has a good alignment with the half-linker A while the other one has a good alignment with the half-linker B, we classify this PET as having a heterogeneous full linker composition with AB (or BA). If there is no good alignment with any of the half-linkers (could also indicate low sequence quality), the PETs are classified as ambiguous PETs and will be discarded from further analyses. A PET sequence with a full linker AB indicates that this PET is derived from a ligation product formed between two different ChIP complexes from the two separate half-linker aliquots (Figure 1a). Therefore, the corresponding PETs with linker composition AB are most likely derived from random and non-specific ligations between two different ChIP complexes. Hence, we classified the PETs with linker AB as the chimeric PET dataset, and the PETs with linkers AA and BB as non-chimeric PET dataset (note that the PETs with linkers AA and BB may still have certain chimeric PETs). After the linker alignment, the linker parts in the short sequence reads will be trimmed and the tag sequences will proceed for further analyses.
With linker filtering, ChIA-PET library IHH015A data were divided into two pools: IHH015M (mix of PETs with AA and BB linkers) and IHH015C (chimeric PETs with AB linkers). The IHH015M dataset has 4,269,610 PETs (30.8% of total input PET sequences) and IHH015C has 6,157,038 PETs (44.4%). The remaining PETs were classified as PETs with ambiguous linkers.
PET mapping to reference genome
After linker trimming, the tags are mapped to the corresponding reference genome using the Batman package (C Tennakoon et al., manuscript submitted) with at most one mismatch. Batman is a Burrows-Wheeler-transform-based method  that maps short sequences to a genome with very high speed. For each tag, Batman first considers the exact matches to the reference genome. If a single exact match is obtained, that location is taken as the mapping location of the tag, and the tag is classified as 'unique mapping'. If multiple exact matches are obtained, the tag is classified as 'multiple mappings'. If no exact match is obtained for the tag, a mapping is done with one mismatch allowed, and the result is similarly labeled as 'unique mapping' or 'multiple mappings'. If there is still no mapping for a tag with one mismatch, the tag is finally classified as 'non-mappable'.
After mapping the tags to the human reference genome (hg18) with Batman, only those PETs from IHH015M and IHH015C with both tags uniquely mapped to the reference genome were considered for further analyses. The remaining PETs with tags multiply mapped or unmapped to the reference genome were filtered out. We obtained 2,016,907 PETs in IHH015M and 2,707,860 PETs in IHH015C with unique mappings. To further avoid miscalling of clonal amplifications by PCR involved in sample preparation as enrichment of ChIP fragments, we merged all similarly mapped PETs (within ± 1 bp) into one unique PET. In this way, we reduced false positive calls resulting from the same PCR clonal amplification. Finally, we obtained 1,501,442 unique PETs with unique mapping from IHH015M and 1,710,335 unique PETs with unique mapping from IHH015C.
The ChIA-PET sequences can be classified into two categories: self-ligation and inter-ligation PETs (Figure 1c). 'Self-ligation PETs' are obtained from self-circularization ligation of the same chromatin fragments, and result in ChIA-PET sequences with both tags mapped within a short genomic distance of each other on the same chromosome in a head-to-tail orientation. 'Inter-ligation PETs' are derived from inter-ligation between two different DNA fragments, and can be partitioned into three different sub-categories: 'inter-chromosomal inter-ligation PETs', 'intra-chromosomal inter-ligation PETs', and 'different-orientation ligation PETs'. 'Inter-chromosomal inter-ligation PETs' are PETs with two tags mapped to two different chromosomes. 'Intra-chromosomal inter-ligation PETs' are PETs with both tags mapped to the same chromosome with a long genomic span, since PETs with long genomic span cannot arise from individual short chromatin fragments. 'Different-orientation ligation PETs' are PETs with both tags mapped to the same chromosome within a short genomic span, but with the wrong orientation or on different strands.
Accordingly, the PET datasets in IHH015M and IHH015C were classified into different categories. In IHH015M, 491,750 PETs were classified as self-ligation PETs. By contrast, IHH015C only yielded 12,155 self-ligation PETs. IHH015M contains 91,519 intra-chromosomal inter-ligation PETs and 893,213 inter-chromosomal inter-ligation PETs. IHH015C contains 94,743 intra-chromosomal inter-ligation PETs and 1,602,889 inter-chromosomal inter-ligation PETs. We note that the number of self-ligation PETs in IHH015M (with homogenous linker AA and BB; non-chimeric) is more than 40 times (491,750/12,155 = 40.5) that of the self-ligation PETs in IHH015C (with heterogeneous linker AB; chimeric). This is expected because self-ligations of the same DNA fragments are supposed to have the same linker types based on the experimental design (Figure 1a, b).
To check whether the inter-ligation PETs in IHH015M and IHH015C arise due to random ligation, the ratio of the intra-chromosomal inter-ligation PETs and the inter-chromosomal inter-ligation PETs was analyzed with a random model. From the PET datasets IHH015M and IHH015C, the numbers of DNA fragments from each chromosome were counted. Assuming that the ligation of the fragments occurs in a random manner and one fragment has an equal chance of ligating to any other fragments, the expected rate of finding an intra-chromosomal inter-ligation PET in a specific chromosome is proportional to the square of the number of DNA fragments in this chromosome. The total rate of intra-chromosomal inter-ligation PETs is proportional to the sum of the square of the numbers of DNA fragments over all individual chromosomes. Therefore, based on a random model, the expected rate of intra-chromosomal inter-ligation PETs is 0.0552 for IHH015C and 0.0546 for IHH015M. By contrast, the observed rate was 0.0558 for IHH015C and 0.0929 for IHH015M. The fold change between the observed rate and the expected rate for IHH015C (0.0558/0.0552 = 1.01; P-value 5.2E-4 from binomial test) was insignificant, validating the notion that the chimeric inter-ligation PETs are derived from random ligation. By contrast, the difference between the observed rate and the expected rate for IHH015M (0.0929/0.0546 = 1.70; P-value < 2.97E-323) was very significant, suggesting that the non-chimeric inter-ligation PETs are not random and probably enriched for specific chromatin interactions.
In summary, our analyses showed that non-chimeric PET dataset IHH015M is significantly different from the random model and the data can be used for further analyses.
Binding site analysis
Binding sites can be identified by looking for clusters of overlapping self-ligation PETs. As background noise is random, noisy PETs should distribute randomly throughout the genome. Only ChIP-enriched fragments would form clusters of overlapping self-ligation PETs, and are considered putative binding sites (Figure 1d). The probability that a cluster contains k self-ligation PETs by random chance can be calculated by a Monte Carlo simulation, allowing false discovery rates to be assigned to different clusters. The clusters with false discovery rates below a particular threshold, such as 0.01, are putative binding sites. A similar method was previously applied in ChIP-PET data analysis . In considering the ChIP enrichment score of a binding site, the number of ChIP fragments at a binding site is counted. This is similar to the existing ChIP enrichment calculation protocols in ChIP-Seq data .
After predicting binding sites, a post-processing step can be applied to remove suspicious binding sites, which can arise from different sources. For example, a library from female cells, such as MCF-7, is not expected to have any binding sites in chromosome Y. Binding sites in satellite repeat regions are also likely to be attributable to non-specific mapping and should be dropped. Further, certain binding sites in cancer genomes may be the results of genome amplifications [8, 14], and should be flagged, such that caution can be exercised in using them.
In our analysis, after calling putative binding sites and removing binding sites likely to be due to non-specific mapping, 4,035 binding sites were called from the IHH015M dataset at a false discovery rate <0.01, which covered 47,735 (9.7%) of the 491,750 self-ligation PETs in IHH015M. By contrast, only 24 binding sites were called from the chimeric PET dataset IHH015C at the same false discovery rate, which covered only 130 (1.1%) of 12,155 self-ligation PETs in IHH015C. This indicates that most of the binding sites of IHH015M are bona fide. Indeed, many of the ERα binding sites identified in IHH015 library were validated using ChIP-qPCR .
Chromatin interaction analysis
Chromatin interaction identification is done in two steps: first, define the ChIP enriched interaction anchor regions from inter-ligation PETs and then quantify the interaction frequency by counting the number of inter-ligation PETs between the two connected anchor regions. The identification of ChIP-enriched interaction anchors from inter-ligation PETs is performed by finding peaks and valleys from the overlapped tags from the inter-ligation PETs, in a similar manner employed by ChIP-Seq . The tag length of each inter-ligation PET is extended in a 5' to 3' manner by the 'tag extension length' to represent a ChIP DNA fragment. The 'tag extension length' is equivalent to the most frequently detected span of the self-ligation PETs, which is around 200 bp for the IHH015A library. Most of the interaction anchor regions identified from inter-ligation PETs should also be overlapped with the protein binding sites identified from self-ligation PETs. After defining the enriched anchor regions from inter-ligation PETs, the number of overlapping inter-ligation PETs between any two anchors is counted to reflect the relative interaction frequency. As each interaction involves two anchors and one loop, it is called a 'duplex interaction'. Similar to the binding site analysis, a real interaction is expected to involve multiple overlapping inter-ligation PETs connecting two anchors.
The expected interaction frequencies between any two genomic loci and the false discovery rate for each interaction were calculated. Hence, both inter-ligation PET frequency and ChIP enrichment of the anchors are taken into account by this analysis.
From the predicted interactions with three or more inter-ligation PETs between anchors and false discovery rate < 0.05, we filtered the interactions that could be a result of mis-mapping or noise, as we did with the binding sites. We filtered out any interactions wherein the anchors were present in chromosome Y or the mitochondria, and satellite repeat regions. We also filtered out a specific kind of noise in the interaction clusters from genome structural variations in cancer cell-lines. The cancer cell-lines like MCF-7 have intensive genome rearrangements: genome insertion, deletion, inversion, and translocation, which can be identified with DNA-PET data [8, 14]. Self-ligation PETs around the breakpoints of genome rearrangement from cancer cell-lines will be mapped as inter-ligation PETs in the reference genome, and interaction clusters related to such genome rearrangements should be removed.
ChIA-PET data visualization
Implementation and performance
ChIA-PET Tool is implemented with various software written in C, Java and scripting languages (PERL and Python), and has been tested with the Linux operating system. The components in the pipeline are linked up together to behave as a singular processing pipeline. The pipeline requires hardware support; for example, 32 Gbytes RAM and 1,024 Gigabytes hard disk are recommended on a duo-core server. mySQL (5.0.67 or higher) is used as the database engine. Other required software packages are Apache, Perl and Bioperl modules, and PHP (refer to the ChIA-PET Tool installation guide on the ChIA-PET website  for details).
We have tested the ChIA-PET Tool pipeline with two hardware configurations: one with 32 Gbytes RAM and 8 CPUs, and another with 64 Gbytes RAM and 16 CPUs. The input to the pipeline is the original paired-end sequence data from the Illumina sequencing output file. A ChIA-PET library from a single lane requires approximately 1 Gigabyte space for storage, which includes intermediate files generated throughout the processing. On average, it takes 1 hour for ChIA-PET Tool to process a single lane ChIA-PET dataset and 1 hour to upload the GFF file to the G-browser database (the upload time is subject to existing database size and server load).
We developed a comprehensive computational package, ChIA-PET Tool, to accommodate large amounts of ChIA-PET data, process the linkers, map the short tags to the reference genomes, classify the PETs into different categories, and identify statistically significant binding sites and chromatin interactions. We demonstrate the effectiveness of ChIA-PET Tool by analyzing a ChIA-PET library IHH015A with statistical results, and show that ChIA-PET Tool is a convenient, user-friendly, accurate bioinformatics solution, and an integral component of the ChIA-PET process for chromatin interaction analysis. Although we have only reported the use of ChIA-PET Tool in ERα ChIA-PET analysis, it is obvious that this tool can be applied to different ChIA-PET libraries bound by different transcription factors in different genomes for chromatin interaction analysis.
ChIA-PET Tool is open-source and free for non-commercial use. The complete package of ChIA-PET Tool is downloadable from the ChIA-PET website , together with the ChIA-PET Tool file format, the ChIA-PET Tool installation guide, the ChIA-PET Tool user manual, and the ChIA-PET browser user manual. The raw sequences and the processed data are also available from ChIA-PET website  (username, guest; password, guest). More related data for ChIA-PET analysis are accessible from NCBI's Gene Expression Omnibus with accession number [GEO:GSE18046].
chromatin interaction analysis with paired-end tag sequencing
estrogen receptor α
generic genome browser
The authors thank the anonymous reviewers for their valuable comments. We acknowledge the Genome Technology and Biology Group at the Genome Institute of Singapore for technical support; Mr Atif Shahab and Mr Chan Chee Seng for computing support. GL is supported by a GIS post-doctoral fellowship. MJF is supported by an A*STAR scholarship and a L'Oreal-UNESCO For Women In Science National Fellowship. YR and C-L W are supported by A*STAR of Singapore and NIH ENCODE grants (R01 HG004456-01, R01HG003521-01, and part of 1U54HG004557-01).
- Farnham PJ: Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009, 10: 605-616. 10.1038/nrg2636.PubMedPubMed CentralView ArticleGoogle Scholar
- Lanctot C, Cheutin T, Cremer M, Cavalli G, Cremer T: Dynamic genome architecture in the nuclear space: regulation of gene expression in three dimensions. Nat Rev Genet. 2007, 8: 104-115. 10.1038/nrg2041.PubMedView ArticleGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science. 2000, 290: 2306-2309. 10.1126/science.290.5500.2306.PubMedView ArticleGoogle Scholar
- Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA, Sung WK, Miller LD, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y: A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006, 124: 207-219. 10.1016/j.cell.2005.10.043.PubMedView ArticleGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.PubMedView ArticleGoogle Scholar
- Carroll JS, Liu XS, Brodsky AS, Li W, Meyer CA, Szary AJ, Eeckhoute J, Shao W, Hestermann EV, Geistlinger TR, Fox EA, Silver PA, Brown M: Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell. 2005, 122: 33-43. 10.1016/j.cell.2005.05.008.PubMedView ArticleGoogle Scholar
- Fullwood MJ, Ruan Y: ChIP-based methods for the identification of long-range chromatin interactions. J Cell Biochem. 2009, 107: 30-39. 10.1002/jcb.22116.PubMedPubMed CentralView ArticleGoogle Scholar
- Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009, 19: 521-532. 10.1101/gr.074906.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH, Chew EG, Huang PY, Welboren WJ, Han Y, Ooi HS, Ariyaratne PN, Vega VB, Luo Y, Tan PY, Choy PY, Wansa KD, Zhao B, Lim KS, Leow SC, Yow JS, Joseph R, Li H, Desai KV, Thomsen JS, Lee YK, et al: An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 2009, 462: 58-64. 10.1038/nature08497.PubMedPubMed CentralView ArticleGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView ArticleGoogle Scholar
- Burrows M, Wheeler D: A Block Sorting Lossless Data Compression Algorithm. Technical Report 124. 1994, Palo Alto, CA: Digital Equipment CorporationGoogle Scholar
- Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc. 2001, 96: 161-173. 10.1198/016214501750332758.View ArticleGoogle Scholar
- Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conformation. Science. 2002, 295: 1306-1311. 10.1126/science.1067799.PubMedView ArticleGoogle Scholar
- Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007, 318: 420-426. 10.1126/science.1149504.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133: 1106-1117. 10.1016/j.cell.2008.04.043.PubMedView ArticleGoogle Scholar
- Simonis M, Kooren J, de Laat W: An evaluation of 3C-based methods to capture DNA interactions. Nat Methods. 2007, 4: 895-901. 10.1038/nmeth1114.PubMedView ArticleGoogle Scholar
- Simonis M, Klous P, Splinter E, Moshkin Y, Willemsen R, de Wit E, van Steensel B, de Laat W: Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat Genet. 2006, 38: 1348-1354. 10.1038/ng1896.PubMedView ArticleGoogle Scholar
- ChIA-PET Tool for Comprehensive Chromatin Interaction Analysis with Paired-End Tag Sequencing. [http://chiapet.gis.a-star.edu.sg]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.