BioWardrobe: an integrated platform for analysis of epigenomics and transcriptomics data
© Kartashov and Barski. 2015
Received: 8 January 2015
Accepted: 7 July 2015
Published: 7 August 2015
High-throughput sequencing has revolutionized biology by enhancing our ability to perform genome-wide studies. However, due to lack of bioinformatics expertise, modern technologies are still beyond the capabilities of many laboratories. Herein, we present the BioWardrobe platform, which allows users to store, visualize and analyze epigenomics and transcriptomics data using a biologist-friendly web interface, without the need for programming expertise. Predefined pipelines allow users to download data, visualize results on a genome browser, calculate RPKMs (reads per kilobase per million) and identify peaks. Advanced capabilities include differential gene expression and binding analysis, and creation of average tag -density profiles and heatmaps. BioWardrobe can be found at http://biowardrobe.com.
The recent proliferation of next-generation sequencing (NGS)-based methods for analysis of gene expression, chromatin structure and protein–DNA interactions has opened new horizons for molecular biology. These methods include RNA sequencing (RNA-Seq) , chromatin immunoprecipitation sequencing (ChIP-Seq) , DNase I sequencing (DNase-Seq) , micrococcal nuclease sequencing (MNase-Seq) , assay for transposase-accessible chromatin sequencing (ATAC-Seq) , and others. On the “wet lab” side, these methods are largely well established and can be performed by experienced molecular biologists; however, analysis of the sequenced data requires bioinformatics expertise that many molecular biologists do not possess. Re-utilizing published datasets is also challenging: although authors usually comply with the longstanding requirement to deposit raw data files into databases such as the Sequence Read Archive (SRA) or Gene Expression Omnibus (GEO), it is impossible to analyze these datasets without special expertise. Even when processed data files (e.g., gene expression values) are available, direct comparison between datasets is ill-advised because different laboratories use different pipelines (or different software versions). This situation means that biologists require the help of bioinformaticians even for the simplest of tasks, such as viewing their own data on a genome browser, putting these exciting techniques beyond the reach of many laboratories. Even when bioinformaticians are available, differences in priorities within collaborations can result in delays and misunderstandings that are damaging to the research effort. An optimal way to mitigate these problems is to enable biologists to perform at least basic tasks without the help of bioinformaticians by creation of user-friendly data analysis software.
Multiple standalone programs and web services are available for the analysis of NGS data. However, the majority of currently available tools have a command-line interface, perform one specific task and typically require file conversions between them. Some popular packages such as HOMER  or Tuxedo [7, 8] are organized into suites and include components capable of performing multiple tasks, thus solving the interoperability problem. HOMER, for example, includes tools for calling peaks, identifying motifs and performing analysis of Hi-C data. However, this excellent tool still requires the use of the command line and has limited visualization options. The commercial programs Genespring , Partek  and Golden Helix  can be run on regular desktop computers and allow analysis of gene expression or genetic variation. However, users have to load the data manually and store it on their desktop computers; given the sheer volume of NGS datasets, this setup makes data analysis complicated at best. Furthermore, these tools do not allow for seamless integration of multiple, published or locally produced datasets. Illumina Basespace  and Galaxy server  allow for both storage and analysis of data and have integrated viewing tools. However, they require transfer of data outside the institution (which may be prohibited by HIPAA (Health Insurance Portability and Accountability Act of 1996) regulations in some cases) and provide only limited storage space for user data. Although Galaxy provides the opportunity to run tools without using a command-line interface, users still have to manage file type conversions and select detailed parameters each time, which requires a deep understanding of each tool and file format. Absence of stable pipelines may result in inexperienced users comparing “apples to oranges”. In summary, few of the available tools provide a biologist-friendly interface, and none integrate such an interface with data storage, display and analysis.
To demonstrate the utility of the included quality controls and the ability of BioWardrobe to integrate and analyze data from various sources, we have performed re-analysis of two published datasets. The first study examined gene expression and chromatin changes during differentiation of human naïve T helper (Thn) cells into T helper type 1 (Th1) cells (SRA082670 ). The dataset included Helicos RNA-Seq performed in triplicates for both resting Thn cells and cells differentiated in Th1 conditions for 72 hours (Th1 cells) and H3K4me3 (histone 3 lysine 4 trimethylation) ChIP-Seq data of Th1 cells. In order to identify differentiation-related chromatin changes, we also included our own H3K4me3 ChIP-Seq data for Thn cells. After we entered sample information into the system (Additional file 1: Figure S2a), BioWardrobe downloaded the dataset and performed basic analysis (Additional file 1: Figure S3b). ChIP-Seq data demonstrated the expected percentage of reads mapped and base frequency (Additional file 1: Figure S3a–d), average tag density profiles showed high enrichment at promoters (Additional file 1: Figure S3e and f) and MACS2 identified a large number of islands (areas of enrichment), the majority of which (68–77 %) were located at promoters (Additional file 1: Figure S3g–j). However, RNA-Seq results demonstrated poor mapping to the human transcriptome, poor coverage and potential DNA and ribosomal RNA contamination (Additional file 1: Figs. S2 and S4). Keeping these problems in mind, we continued with data analysis and performed a comparison of gene expression using DESeq2. Replicates were defined, genes were grouped by common transcription start site (TSS) and differentially expressed genes were identified. These results were used to define lists of genes that were expressed or silent in both Thn and Th1 cells or induced during differentiation. Next, H3K4me3 average tag density profiles were created for these three gene lists (Fig. 1b). As demonstrated in the graphs and Mann-Whitney-Wilcoxon (MWW) statistical analysis (Fig. 1c), genes that are expressed in both Thn and Th1 cells have higher levels of H3K4me3 at their promoters than genes that are silent in both cell types. Interestingly, differentiation-induced genes had intermediate levels of this modification in naïve cells, in which they were silent, suggesting that H3K4me3 poises inducible genes for expression during differentiation.
In summary, we have developed a semi-automated system for storage, visualization and analysis of NGS data. BioWardrobe has been already used to analyze data in several publications [25–29]. The system can be installed on Mac or Linux computers and can provide a data analysis solution for an entire laboratory or institution.
Materials and methods
BioWardrobe allows users to upload, store and analyze NGS data. The workflow consists of two parts: basic and advanced analysis (Additional file 1: Figure S1). The basic analysis includes operations that do not require comparison of samples: data download, quality control, calculation of RPKMs (reads per kilobase of transcript per million reads mapped), peak identification and upload to an integrated mirror of the UCSC genome browser. Advanced analysis includes comparing gene expression or ChIP-Seq profiles between samples. BioWardrobe can work with multiple genomes (our instance currently uses human, mouse, rat, fly and frog) and additional genomes are easy to add, especially if the genome of interest is represented on the UCSC genome browser. A flexible data ownership system is implemented: though all users can see all experiments on a local mirror of the UCSC genome browser, only members of the laboratories that own the data can access and analyze datasets within the BioWardrobe web interface or download it. Laboratory-level administrators can elect to share data with other laboratories. However, trusted bioinformaticians can have access to all datasets outside of the BioWardrobe interface — e.g., via R/RStudio. We believe that this setup strikes a balance between maintaining data ownership and encouraging collaborations.
Basic analysis includes operations that are performed on a single library (Additional file 1: Figure S1a). Analysis starts by entering the experiment description into BioWardrobe. This information will be used to select the appropriate genome and analysis pipeline. Raw data can be directly downloaded by BioWardrobe via hypertext transfer protocol (http) or file transfer protocol (ftp) from core facilities or internet databases such as GEO or SRA. Compressed or uncompressed FASTQ (.fastq) or SRA (.sra) files can be used. We elected not to use prealigned BAM (.bam) files to ensure uniform alignment of samples.
For ChIP-Seq and similar experiments, reads are aligned to the genome with Bowtie , quality control analysis is conducted and data are summarized in a table (Additional file 1: Figure S2b). In addition to basic statistics (percentages of mapped/unmapped/non-uniquely mapped reads and average fragment length), BioWardrobe displays several other quality control measures. Base frequency plots are used to estimate adapter contamination, a frequent occurrence in low-input ChIP-Seq experiments (Additional file 1: Figure S3c). Average tag density profiles can be used to estimate ChIP enrichment for promoter proximal histone modifications (e.g., H3K4me3; Additional file 1: Figure S3e and f). The genome browser can be used to visually compare results with other experiments in the database (Additional file 1: Figure S3g and h). ChIP-Seq results are displayed on the genome browser as coverage per million reads mapped. For paired-end reads, coverage is calculated as the number of fragments covering each base pair (bp). To obtain coverage for single-read experiments, average fragment length is calculated by model-based analysis of ChIP-Seq (MACS2) , and individual reads are extended to this length in the 3’ direction. Islands (areas of enrichment) identified by MACS2 are displayed both on the browser (Additional file 1: Figure S3g and h) and as a table together with the nearest genes. This table can also be used to select a cutoff for significant peaks that will be used in the downstream analysis. Additionally, the fasta sequences of peaks can be obtained with the click of a button and used with third-party tools (e.g., MEME-ChIP ) to produce sequence logos. Use of different parameters or pipelines for different antibodies (e.g., “broad peaks” MACS2 option for H3K27me3) is possible. Additionally, users can elect to use one of the experiments in the database as an “input” control for MACS2. The distribution of the islands between genomic areas (promoters, exons, etc.) is displayed as a stacked bar graph (Additional file 1: Figure S3i and k).
For RNA-Seq analysis, reads are aligned to the genome using RNA STAR  provided with an appropriate annotation (e.g., RefSeq; other annotations can also be used). The quality control tab displays the number of reads aligned within and outside the transcriptome. The percentage of the reads mappable to ribosomal DNA is displayed to estimate the quality of ribosomal RNA depletion (Additional file 1: Figure S2b). Interpretation of quality control data is shown in Additional file 1: Figure S4. Data are deposited on the browser, and RPKM values are calculated for each transcript (algorithm to be described elsewhere). Depending on the application, RPKM values can be presented for each transcript or summed up for each TSS (for gene expression studies) or for each gene (for functional studies, e.g., Gene Ontology).
If satisfied with the quality of data obtained from sequencing, a user can proceed to advanced analysis, which involves integration of information from multiple experiments. For gene expression analysis, the typical task is identifying differentially expressed genes. We elected to incorporate the DESeq1/2 algorithm [18, 31] for this purpose because it does not require recreating transcript models and does not make too many assumptions. In order to perform gene expression profiling, a user can define replicates and utilize the DESeq algorithm to calculate p values and fold changes for all genes. On the basis of DESeq results, lists of genes whose expression changes can be created within BioWardrobe using expression levels, fold change, or p/q values, as well as other parameters, and downloaded, if needed, in a table form for further analysis (e.g., gene set enrichment analysis).
The gene sets can also be used to create average tag density profiles and heatmaps within BioWardrobe (Fig. 1b). Average tag density profiles are used to compare the enrichment of histone modifications or other proteins around the TSS or the gene bodies between different gene sets. Often gene bodies are used to estimate enrichment, for instance when comparing the levels of positive marks, such as H3K4me3, between expressed and silent genes. Heatmaps provide similar information but allow comparisons of modifications between individual genes. Statistical comparison of tag densities between groups of genes using MWW test can be performed by highlighting the area of interest with a mouse (Fig. 1b, insert). All graphs can be downloaded in publication-quality scalable vector graphics (SVG) format.
For ChIP-Seq, the task is usually the identification of areas that have different levels of binding between samples. The difficulty here is that the signal-to-background ratio (enrichment) is usually slightly different between ChIP-Seq experiments; thus, several assumptions have to be made in order to compare islands of enrichment. BioWardrobe uses the MAnorm algorithm , which assumes that modifications do not change in the majority of areas. This allows MAnorm to adjust for differential levels of enrichment between experiments. The lists of islands, fold changes, accompanying p values and the nearest genes are presented in table form, and islands can be viewed in the browser with the push of a button.
Although we sincerely believe that the set of quality control measurements and tools that we provide is the most useful, this may be a matter of personal preference. In order to allow for easy addition of custom analysis, we have incorporated R language script editing into the BioWardrobe web interface for both basic and advanced analysis steps. System administrators can add custom R scripts in the R tab, and biologists can run these scripts via the graphical web interface. In the basic analysis, customized R scripts can be run for each sample automatically or for selected samples. As an example, we have added scripts that provide the histogram of read pile-up or island length for ChIP-Seq data or gene body coverage and RPKM histogram for RNA-Seq data (Additional file 1: Figure S5). In the advanced analysis R interface, customized scripts can be provided by system administrators. Users can select records of interest via the graphical user interface and run the customized scripts as needed. As an example, we provide a principal component analysis (PCA) script that can be used for analysis of RNA-Seq data and an IDR2 script that can be used to analyze reproducibility of ChIP-Seq experiments (see  for output).
Chromatin immunoprecipitation sequencing
DNase I hypersensitive site sequencing
Gene Expression Omnibus
Histone 3 lysine 4 trimethlation
Ribonucleic acid sequencing
Reads per kilobase of transcript per million reads mapped
Sequence Read Archive (formerly known as Short Read Archive)
T helper type 1
naïve T helper
Transcription start site
University of California, Santa Cruz
The authors would like to thank Raja Jothi (National Institute of Environmental Health Sciences), Dustin E. Schones (City of Hope) and Matt Weirauch (Cincinnati Children’s Hospital Medical Center) for critically reading the manuscript and Shawna Hottinger (Cincinnati Children’s Hospital Medical Center) for editorial assistance. The work was supported in part by the Cincinnati Children’s Research Foundation, NHLBI NIH Career Transition Award to A.B. (K22 HL098691) and Center for Clinical and Translational Science and Training Innovative Core Facility grant (NIH/NCRR Institutional Clinical and Translational Science Award, 8UL1TR000077-06).
The cover image was designed by Yaroslav Ivanov.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.PubMedView Article
- Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–37.PubMedView Article
- Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132:311–22.PubMed CentralPubMedView Article
- Schones DE, Cui K, Cuddapah S, Roh T-Y, Barski A, Wang Z, et al. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008;132:887–98.PubMedView Article
- Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10:1213–8.PubMed CentralPubMedView Article
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–89.PubMed CentralPubMedView Article
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–78.PubMed CentralPubMedView Article
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.PubMed CentralPubMedView Article
- Genespring NGS. http://www.genomics.agilent.com/en/NGS-Analysis-Software/GeneSpring-NGS-versions-12-5-12-6-1-/?cid=cat170010&tabId=AG-PR-1062.
- Partek. http://www.partek.com/.
- Golden Helix. http://www.goldenhelix.com.
- Illumina basespace. https://basespace.illumina.com/home/index.
- Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86.PubMed CentralPubMedView Article
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006.PubMed CentralPubMedView Article
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.PubMed CentralPubMedView Article
- FASTX. http://hannonlab.cshl.edu/fastx_toolkit/index.html.
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137.PubMed CentralPubMedView Article
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.PubMed CentralPubMedView Article
- Shao Z, Zhang Y, Yuan G-C, Orkin SH, Waxman DJ. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets. Genome Biol. 2012;13:R16.PubMed CentralPubMedView Article
- BioWardrobe. http://biowardrobe.com.
- BioWardrobe github repository: https://github.com/SciDAP/biowardrobe.
- BioWardrobe demo: http://demo.biowardrobe.com.
- Hawkins RD, Larjo A, Tripathi SK, Wagner U, Luu Y, Lönnberg T, et al. Global chromatin state analysis reveals lineage-specific enhancers during the initiation of human T helper 1 and T helper 2 cell polarization. Immunity. 2013;38:1271–84.PubMedView Article
- Kidder BL, Hu G, Zhao K. KDM5B focuses H3K4 methylation near promoters and enhancers during embryonic stem cell self-renewal and differentiation. Genome Biol. 2014;15:R32.PubMed CentralPubMedView Article
- Rochman Y, Yukawa M, Kartashov AV, Barski A. Functional characterization of human T cell hyporesponsiveness induced by CTLA4-Ig. PLoS One. 2015;10, e0122198.PubMed CentralPubMedView Article
- Hasegawa K, Sin H-S, Maezawa S, Broering TJ, Kartashov AV, Alavattam KG, et al. SCML2 establishes the male germline epigenome through regulation of histone H2A ubiquitination. Dev Cell. 2015;574–588.
- Rochman M, Kartashov AV, Caldwell JM, Collins MH, Stucke EM, Kc K, et al. Neurotrophic tyrosine kinase receptor 1 is a direct transcriptional and epigenetic target of IL-13 involved in allergic inflammation. Mucosal Immunol. 2014;8:785–98.PubMedView Article
- Bouffi C, Rochman M, Zust CB, Stucke EM, Kartashov A, Fulkerson PC, et al. IL-33 markedly activates murine eosinophils by an NF-κB-dependent mechanism differentially dependent upon an IL-4-driven autoinflammatory loop. J Immunol. 2013;191:4317–25.PubMedView Article
- Sin HS, Barski A, Zhang F, Kartashov AV, Nussenzweig A, Chen J, et al. RNF8 regulates active epigenetic modifications and escape gene activation from inactive sex chromosomes in post-meiotic spermatids. Genes Dev. 2012;26:2737–48.PubMed CentralPubMedView Article
- Ma W, Noble WS, Bailey TL. Motif-based analysis of large nucleotide data sets using MEME-ChIP. Nat Protoc. 2014;9:1428–50.PubMed CentralPubMedView Article
- Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8:1765–86.PubMedView Article