PIPE-CLIP: a comprehensive online tool for CLIP-seq data analysis
© Chen et al.; licensee BioMed Central Ltd. 2014
Received: 6 November 2013
Accepted: 22 January 2014
Published: 22 January 2014
CLIP-seq is widely used to study genome-wide interactions between RNA-binding proteins and RNAs. However, there are few tools available to analyze CLIP-seq data, thus creating a bottleneck to the implementation of this methodology. Here, we present PIPE-CLIP, a Galaxy framework-based comprehensive online pipeline for reliable analysis of data generated by three types of CLIP-seq protocol: HITS-CLIP, PAR-CLIP and iCLIP. PIPE-CLIP provides both data processing and statistical analysis to determine candidate cross-linking regions, which are comparable to those regions identified from the original studies or using existing computational tools. PIPE-CLIP is available at http://pipeclip.qbrc.org/.
RNA’s diversity in sequence and structure endows it with crucial roles in cell biology . Recent technological developments, especially the technique of crosslinking immunoprecipitation coupled with high-throughput sequencing (CLIP-seq), have provided powerful tools for studying the roles of RNA regulation in the control of gene expression and the generation of phenotypic complexity . For example, high-throughput sequencing of RNA isolated by cross-linking immunoprecipitation (HITS-CLIP) was used to identify approximately 30 to 60 nucleotide regions around the peaks of CLIP read clusters that represent binding sites of RNA-binding proteins (RBPs) . To increase detection sensitivity, photoactivatable-ribonucleoside-enhanced CLIP (PAR-CLIP) [1, 3] was also developed. PAR-CLIP introduces photoactivatable ribonucleoside analogs, such as 4-thiouridine (4SU) and 6-thioguanosine (6SG), into the RNA of cultured cells to enhance cross-linking efficiency. This cross-linking process usually introduces mutations in sequence tags at RBP binding sites. For example, HITS-CLIP utilizes UV cross-linking of proteins with RNA, which introduces either insertions, deletions, or substitutions, depending on the RBPs [1, 4]. PAR-CLIP introduces a distinct spectrum of substitutions (T-to-C for 4SU and G-to-A for 6SG). These cross-linking-induced mutations in HITS-CLIP and PAR-CLIP can be used as markers to identify the precise RBP binding sites. In addition, individual-nucleotide resolution CLIP (iCLIP) was developed to identify cross-linking sites independently of experimentally induced mutations. Instead, cDNA is circularized and then linearized at specific restriction sites, so that the truncation positions are used to locate candidate RBP binding positions [2, 5].
Although several tools have been recently developed, there is still a lack of a comprehensive publicly available pipeline for analyzing CLIP-seq data. Piranha  is a tool mainly focusing on peak calling, without considering cross-linking-induced mutations. PARalyzer  and WavClusterR  are available as R packages for PAR-CLIP data analysis. PARalyzer estimates the likelihood of specific cross-linking-induced mutations, while wavClusterR uses wavelet transformation to distinguish between non-experimentally and experimentally induced transitions. Both tools, however, were developed only for PAR-CLIP data, and R packages may be inconvenient for experimentalists. A newly published tool, RIPseeker , is an R package based on a hidden Markov model for general RIP-seq experiment data analysis. It can process CLIP-seq data, but it does not utilize the specific characteristics of CLIP-seq data. Different from the tools mentioned above, CLIPZ  is an online web tool for analyzing CLIP-seq data with visualization functions. However, CLIPZ does not allow users to specify any analysis parameters. More importantly, it does not provide measurements of the statistical significance associated with specifically identified binding regions.
The aim of PIPE-CLIP is to provide a public web-based resource to process and analyze CLIP-seq data. It provides a unified pipeline for PAR-CLIP, HITS-CLIP and iCLIP, with the following features: (1) user-specified parameters for customized analysis; (2) statistical methods to reduce the number of false positive cross-linking sites; (3) statistical significance levels for each binding site to facilitate planning of future experimental follow-ups; and (4) a user-friendly interface and reproducibility features. PIPE-CLIP offers statistical methods that provide a significance level for each identified candidate binding site. Compared to the candidate cross-linking regions identified in the original studies for HITS-CLIP, PAR-CLIP and iCLIP, those identified by PIPE-CLIP are similar (using the cutoff based method) or slightly more reliable (using the statistics-based method). Furthermore, we demonstrate how different false discovery rate (FDR) cutoffs affect the number of identified candidate binding regions. Finally, we show that PIPE-CLIP has similar performance when identifying cross-linking regions from CLIP-seq data to other existing computational algorithms. This empirical study provides some guidance for users to select appropriate cutoff values for the analysis of novel datasets. In summary, PIPE-CLIP provides a user-friendly, web-based, ‘one-stop’ resource for the analysis of various types of CLIP-seq data.
Materials and methods
The PIPE-CLIP analysis pipeline accepts inputs in Sequence Alignment/Map (SAM) format or binary format (BAM) . It preprocesses the data by filtering mapped reads and handling PCR duplicates. The main criteria for reads filtering are the minimum matched lengths and the maximum mismatch numbers for each read, and both parameters can be specified by users. Reads that meet both criteria are kept for further analysis. After the filtering step, users have different options to handle PCR duplicates. Based on the current literature for CLIP-Seq experiments [13–16], PCR duplicates are usually removed to avoid PCR artifacts, which in turn reduces the false positive rate in the identified cross-linking regions. However, removing duplicates may discard potentially good alignments and affect the results when the sequencing coverage is low . Therefore, PIPE-CLIP allows users to decide whether to keep or remove PCR duplicates from the alignment file.
PIPE-CLIP users have an option to remove PCR duplicates using two different methods. The first method is based on the read start position and orientation, as described in Zhang et al., while the second method takes sequence into account, along with mapping information. Specifically, the first method chooses a representative read from the cluster of reads that share the same starting genomic position, using the following sequential steps: (1) find the reads with the longest matched lengths; (2) find the reads with the fewest mismatch numbers; (3) find the reads with the highest quality scores; (4) choose one read randomly.
For the second approach, since the reads that map to the same position can still have different mutations, the reads are placed into groups by their sequences and steps 3 and 4 described above are executed, in order to find out the representative sequence for each group. For iCLIP data it is important to note that, since PCR duplicates are removed according to random bar codes before mapping, identical sequences in the SAM/BAM file represent real cDNA counts, and will not be removed in this step.
Identifying enriched clusters
To identify enriched peaks, the adjacent mapped reads are clustered together if they overlap each other by at least one nucleotide, similar to ChIP-seq processing . The clusters are used for further analysis. Let r i denote the total number of reads within the ith cluster of length s i . Longer clusters tend to have greater read counts, so the variable s i needs to be used to adjust the length effect on modeling r i . Given that all clusters receive at least one read, we propose a model equipped with the zero-truncated negative binomial (ZTNB) likelihoods.
where f(s) is used as an explanatory variable that represents the functional dependence of the read count on the cluster length. The link functions are slightly different from what has been typically used for the ZTNB regression model. In our model, we use f(s) instead of s as a predictor, so that the model is more general in the sense that the mean and variance function for r is allowed to be non-linear with respect to s. This model allows us to test whether a cluster is significantly enriched by reads, while adjusting the span of the cluster. For clusters of length s i and read count r i , the P-value is defined as the probability of observing read counts ≥ r i . That is, the P-value = P(r ≥ r i |s = s i ), where the probability law is derived from Equation 1.
For the model inference, first we estimate f(s) using the local liner regression  of r on s. Then, the estimate is plugged into the ZTNB regression as a predictor. To obtain maximum likelihood estimates (MLEs) of α and β, the conditional maximization method is implemented along with the Fisher’s scoring method  for α and the Newton-Raphson method for β. For more details about the model inference, please check the source code . FDRs are calculated using the Benjamin-Hochberg procedure . PIPE-CLIP reports the enriched clusters based on a user-specified FDR cutoff (the default is 0.01).
Selecting reliable mutation/truncation sites
The identified cross-linking-induced mutations (for PAR-CLIP and HITS-CLIP) or cDNA truncations (for iCLIP) are clustered at each genomic location. For PAR-CLIP, only the characteristic mutations specified by users are included in the analysis. For HITS-CLIP, since cross-linking-induced mutations depend on the protein of interest, PIPE-CLIP processes substitutions, deletions and insertions separately, to allow the users to choose the type of cross-linking-induced mutation. For iCLIP, all of the cDNA truncations are included. Each location (one nucleotide) is characterized by two parameters (ki, mi), where ki is the total number of mapped reads covering that location, and mi is the number of specific mutations/truncations at location i. At each genomic location, mi is modeled by a binomial distribution with size ki and a success rate (that is, the reads coverage calculated using the sum of matched lengths of all reads that passed the filtering criteria in the data preprocessing step, divided by the genome size), and a P-value is calculated to assess the statistical significance of the mutation rate. Finally, FDRs are calculated from the P-values using the Benjamin-Hochberg method , and the locations with FDRs less than a user-specified cutoff are reported as reliable mutation/truncation sites.
Identifying candidate cross-linking regions
where χ42 is a chi-square random variable with four degrees of freedom.
PIPE-CLIP generates one BED file, containing the candidate cross-linking regions for the characteristic mutations/truncation sites for PAR-CLIP and iCLIP data, while it also generates a BED file for each mutation type (substitution, deletion or insertion) separately for HITS-CLIP data.
Annotating candidate cross-linking regions
Finally, the candidate cross-linking regions are annotated using the annotation package HOMER , which is a suite of tools for motif discovery and next-generation sequencing analysis, for the human (hg19/GRCh37.67) and mouse (mm10/GRCm38.69) genomes, providing information about the specific transcripts that are bound by the RBP of interest.
Results and discussion
PIPE-CLIP’s performance on PAR-CLIP data
PAR-CLIP sequencing data of three FET family proteins  was downloaded from the DNA Data Bank of Japan [DDBJ: SRA025082]. We mapped reads to the human genome (hg19) using Novoalign , and kept the uniquely mapped reads. To evaluate the performance of the PIPE-CLIP analysis, we compared the results from the PIPE-CLIP analysis with the original publication  and also checked whether the results were consistent with the biological expectation.
Cross-linking regions identified by PIPE-CLIP for the FET family proteins data
Number of cross-linking regions
Comparison of the overlapping frequency of the 1,000 top enriched cross-linking regions of FET proteins identified in the original study versus by PIPE-CLIP software
Number of genes (Hoell et al.)
Number of genes (PIPE-CLIP)
P-value (Fisher’s exact test)
FUS overlap TAF15
FUS overlap EWSR1
EWSR1 overlap TAF15
PIPE-CLIP’s performance on HITS-CLIP data
Cross-linking regions identified by PIPE-CLIP for the Ago HITS-CLIP data
PIPE-CLIP’s performance on iCLIP data
PIPE-CLIP results summary for the Nova iCLIP data
Number of enriched clusters
Number of reliable truncations
Number of cross-linking regions
Comparing PIPE-CLIP’s performance with other computational tools
Recently, several computational tools were developed for analyzing PAR-CLIP data. Using the FET family protein data described above, we compared PIPE-CLIP’s performance with published computational tools, including Piranha , PARalyzer  and MACS2 . Piranha is a universally peak caller for CLIP-seq and RIP-seq data that bins all the mapped reads according to their starting point on the genome. The total reads counted in the bin, together with some other covariates such as mappability, are used to fit a certain (user defined) distribution model to determine whether a specific bin is enriched or not. For this analysis, a negative binomial distribution was selected since it generally has good performance and is matched with the distribution used in PIPE-CLIP. MACS2 is a popular peak caller for ChIP-seq data, but it is also used in various other high-throughput sequencing data for peak calling purposes. The MACS2 models peaks on positive strands and negative strands based on a Poisson distribution . After that, peaks from positive and negative strands are paired and moved in the 3’ direction until their middle points are at the same position, and that position is then reported as a peak summit. The default parameters of MACS2 were used to generate results. PARalyzer is a computational algorithm designed for PAR-CLIP data. It groups adjacent mapped reads and generates two smoothened kernel density estimates within each read group, one for T-to-C transitions and one for non-transition events. Nucleotides within the read groups that maintain a minimum read depth, and where the likelihood of T-to-C conversion is higher than non-conversion, are considered interaction sites. Again, we implemented the default parameters in the PARalyzer package to identify cross-linking regions for the three FET family proteins.
Currently, there exist few computational tools to analyze HITS-CLIP or iCLIP data. PARalyzer was designed for PAR-CLIP data analysis, and MACS2, designed for ChIP-seq data, does not consider mutation or truncation information. We thus implemented the Piranha algorithm for Ago HITS-CLIP data and Nova iCLIP data, but it could not identify any binding targets using a FDR cutoff of 5%. As shown in the previous results, PIPE-CLIP identified reasonable cross-linking regions using the same FDR cutoff. In addition, we also performed simulation studies and showed that PIPE-CLIP performed better than CIMS in the simulation studies (Additional file 1).
PIPE-CLIP is a web-based resource designed for detecting cross-linking regions in HITS-CLIP, PAR-CLIP and iCLIP data. It is based on a Galaxy open-source framework, and accepts SAM/BAM format as input. It reports cross-linking regions with high reliability. Comparative analysis with several publicly available data sets and several existing computational tools showed that PIPE-CLIP has a performance comparable with other methods for identifying cross-linking sites from CLIP-seq experiments. Users can easily tailor different parameters for processing steps and choose statistical thresholds for identifying candidate binding sites, and compare all the results. All such user-specified parameters are well documented, and the intermediate outputs provided, in order to make it convenient for users to trace back the analysis steps. Details of usage are available online. A script (barcodeRemover) to remove barcode and PCR duplicates for iCLIP is also provided at the same website . In conclusion, PIPE-CLIP provides a comprehensive, user-friendly and reproducible analytical resource for various types of CLIP-seq data.
crosslinking-induced mutation sites
cross-linking immunoprecipitation coupled with high-throughput sequencing
false discovery rate
high-throughput sequencing of RNA isolated by cross-linking immunoprecipitation
individual-nucleotide resolution CLIP
polymerase chain reaction
zero-truncated negative binomial.
This work was supported by NIH 5R01CA152301 and 4R33DA027592, and NASA grants NNJ05HD36G, CPRIT R1008, NIH R01CA120185, P01CA134292, and CPRIT RP101251.
- Licatalosi DD, Darnell RB: RNA processing and its regulation: global insights into biological networks. Nat Rev Genet. 2010, 11: 75-87.View ArticleGoogle Scholar
- Darnell RB: HITS-CLIP: panoramic views of protein-RNA regulation in living cells. WIREs RNA. 2010, 1: 266-286. 10.1002/wrna.31.View ArticleGoogle Scholar
- Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M, Jungkamp A-C, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010, 141: 129-141. 10.1016/j.cell.2010.03.009.View ArticleGoogle Scholar
- Zhang C, Darnell RB: Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotechnol. 2011, 29: 607-614. 10.1038/nbt.1873.View ArticleGoogle Scholar
- Konig J, Zarnack K, Rot G, Curk T, Kayikci M, Zupan B, Turner DJ, Luscombe NM, Ule J: iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol. 2010, 17: 909-915. 10.1038/nsmb.1838.View ArticleGoogle Scholar
- Uren PJ, Bahrami-Samani E, Burns SC, Qiao M, Karginov FV, Hodges E, Hannon GJ, Sanford JR, Penalva LOF, Smith AD: Site identification in high-throughput RNA–protein interaction data. Bioinformatics. 2012, 28: 3013-3020. 10.1093/bioinformatics/bts569.View ArticleGoogle Scholar
- Corcoran DL, Georgiev S, Mukherjee N, Gottwein E, Skalsky RL, Keene JD, Ohler U: PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data. Genome Biol. 2011, 12: R79-10.1186/gb-2011-12-8-r79.View ArticleGoogle Scholar
- Sievers C, Schlumpf T, Sawarkar R, Comoglio F, Paro R: Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Res. 2012, 40: e160-10.1093/nar/gks697.View ArticleGoogle Scholar
- Li Y, Zhao DY, Greenblatt JF, Zhang Z: RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic Acids Res. 2013, 41: e94-10.1093/nar/gkt142.View ArticleGoogle Scholar
- Khorshid M, Rodak C, Zavolan M: CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins. Nucleic Acids Res. 2011, 39: D245-D252. 10.1093/nar/gkq940.View ArticleGoogle Scholar
- PIPE-CLIP source code. [https://github.com/QBRC/PIPE-CLIP]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup 1GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.View ArticleGoogle Scholar
- Chou C-H, Lin F-M, Chou M-T, Hsu S-D, Chang T-H, Weng S-L, Shrestha S, Hsiao C-C, Hung J-H, Huang H-D: A computational approach for identifying microRNA-target interactions using high-throughput CLIP and PAR-CLIP sequencing. BMC Genomics. 2013, 14: S2-Google Scholar
- Lebedeva S, Jens M, Theil K, Schwanhäusser B, Selbach M, Landthaler M, Rajewsky N: Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. Mol Cell. 2011, 43: 340-352. 10.1016/j.molcel.2011.06.008.View ArticleGoogle Scholar
- Licatalosi DD, Yano M, Fak JJ, Mele A, Grabinski SE, Zhang C, Darnell RB: Ptbp2 represses adult-specific splicing to regulate the generation of neuronal precursors in the embryonic brain. Genes Dev. 2012, 26: 1626-1642. 10.1101/gad.191338.112.View ArticleGoogle Scholar
- Macias S, Plass M, Stajuda A, Michlewski G, Eyras E, Cáceres JF: DGCR8 HITS-CLIP reveals novel functions for the Microprocessor. Nat Struct Mol Biol. 2012, 19: 760-766. 10.1038/nsmb.2344.View ArticleGoogle Scholar
- Hoell JI, Larsson E, Runge S, Nusbaum JD, Duggimpudi S, Farazi TA, Hafner M, Borkhardt A, Sander C, Tuschl T: RNA targets of wild-type and mutant FET family proteins. Nat Struct Mol Biol. 2011, 18: 1428-1431. 10.1038/nsmb.2163.View ArticleGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-5231. 10.1093/nar/gkn488.View ArticleGoogle Scholar
- Cleveland WS, Grosse E, Shyu WM: Local regression. Statistical Medels in S. Edited by: Chanbers EJM, Hastie TJ. 1992, California: Wadsworth & Rrooks/Cole, 312-316.Google Scholar
- Agresti A: Introduction to generalized linear models. Categorical Data Analysis. 2002, New Jersey: John Wiley & Sons, 146-148. 2View ArticleGoogle Scholar
- PIPE-CLIP source code for identifying enriched clusters. [https://github.com/QBRC/PIPE-CLIP/blob/master/ZTNB.R]
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995, 57: 289-300.Google Scholar
- Fisher RA: Tests of goodness of fit, indepencece and homogeneity; with table of χ2. Statistical Methods for Research Workers. 1932, Edinburgh: Oliver and Boyd, 97-105. 4Google Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010, 38: 576-589. 10.1016/j.molcel.2010.05.004.View ArticleGoogle Scholar
- Novocraft. [http://www.novocraft.com/main/index.php]
- Chi SW, Zang JB, Mele A, Darnell RB: Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature. 2009, 460: 479-486.Google Scholar
- Chi SW, Hannon GJ, Darnell RB: An alternative mode of microRNA target recognition. Nat Struct Mol Biol. 2012, 19: 321-327. 10.1038/nsmb.2230.View ArticleGoogle Scholar
- Sugimoto Y, Konig J, Hussain S, Zupan B, Curk T, Frye M, Ule J: Analysis of CLIP and iCLIP methods for nucleotide-resolution studies of protein-RNA interactions. Genome Biol. 2012, 13: R67-10.1186/gb-2012-13-8-r67.View ArticleGoogle Scholar
- Dredge BK, Darnell RB: Nova regulates GABAA receptor γ2 alternative splicing via a distal downstream UCAU-rich intronic splicing enhancer. Mol Cell Biol. 2003, 23: 4687-4700. 10.1128/MCB.23.13.4687-4700.2003.View ArticleGoogle Scholar
- Dredge BK, Stefani G, Engelhard CC, Darnell RB: Nova autoregulation reveals dual functions in neuronal splicing. EMBO J. 2005, 24: 1608-1620. 10.1038/sj.emboj.7600630.View ArticleGoogle Scholar
- Buckanovich RJ, Darnell RB: The neuronal RNA binding protein Nova-1 recognizes specific RNA targets in vitro and in vivo. Mol Cell Biol. 1997, 17: 3194-3201.View ArticleGoogle Scholar
- Yang YYL, Yin GL, Darnell RB: The neuronal RNA-binding protein Nova-2 is implicated as the autoantigen targeted in POMA patients with dementia. Proc Natl Acad Sci USA. 1998, 95: 13254-13259. 10.1073/pnas.95.22.13254.View ArticleGoogle Scholar
- Ule J, Jensen KB, Ruggiu M, Mele A, Ule A, Darnell RB: CLIP identifies Nova-regulated RNA networks in the brain. Science. 2003, 302: 1212-1215. 10.1126/science.1090095.View ArticleGoogle Scholar
- Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008, 456: 464-469. 10.1038/nature07488.View ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, Nussbaum C, Myers R, Brown M, Li W, Liu X: Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137-10.1186/gb-2008-9-9-r137.View ArticleGoogle Scholar
- Han T, Kato M, Xie S, Wu L, Mirzaei H, Pei J, Chen M, Xie Y, Allen J, Xiao G, McKnight S: Cell-free formation of RNA granules: bound RNAs identify features and components of cellular assemblies. Cell. 2012, 149: 768-779. 10.1016/j.cell.2012.04.016.View ArticleGoogle Scholar
- PIPE-CLIP galaxy website. [http://pipeclip.qbrc.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.