Skip to main content
Fig. 3 | Genome Biology

Fig. 3

From: Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome

Fig. 3

Virtual ChIP-seq predicts chromatin factor binding with high accuracy. Using ChIP-seq and RNA-seq data, we learned from the association of gene expression and chromatin factor binding for 63 chromatin factors. a Box plots show distribution of auPR among 4 chromosomes (5, 10, 15, and 20) for 63 chromatin factors assessed in four cell types (blue: H1-hESC; orange: IMR-90; green: K562; brown: MCF-7). Dashed line: medians. Grey shapes: prevalence of bound bins in the chromosome, the auPR baseline. Axis label colors categorize median auPR (purple: greater than 0.5, red: between 0.25 and 0.5, black: below 0.25). Sequence logos indicate one of a TF’s JASPAR motifs, when available. When multiple motifs existed, we displayed the shortest motif here. Numbers 1–9: The nine chromatin factors that the DREAM Challenge also evaluated in its final round. b We compared Virtual ChIP-seq’s performance to that of the top 4 performing methods in the DREAM Challenge across-cell type final round. For CTCF, MAX, GABPA, REST, and JUND, we had enough cell types to train and validate the performance of Virtual ChIP-seq on DREAM data. For these chromatin factors we trained on chromosomes 5, 10, 15, and 20 in training cell types and validated performance on merged data of chromosomes 1, 8, and 21 in validation cell types. For other chromatin factors, we trained the model and validated our performance using publicly available Cistrome and ENCODE data. auPR values are only directly comparable for the same cell type and test set. The black vertical line in each panel separates test sets based on genome assembly and source. Axis label color: reference genome assembly (black: GRCh37, grey: GRCh38). c We compared Virtual ChIP-seq’s performance to that of Catchitt, the co-winner of the ENCODE-DREAM Challenge on GRCh38 datasets. The x-axis shows the training cell type we used for training the model. Multiple: data of multiple cell types concatenated for training. Average: indicates average of posterior probability from models trained on each of the training cell types. We examined performance on three different validation cell types: H1-hESC (circle), K562 (triangle), and MCF-7 (rectangle). Turquoise: Virtual ChIP-seq. Orange: Catchitt. Black horizontal lines clarify the vertical position performance of each point

Back to article page