An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding

Background Transcription factor (TF) binding specificity is determined via a complex interplay between the transcription factor’s DNA binding preference and cell type-specific chromatin environments. The chromatin features that correlate with transcription factor binding in a given cell type have been well characterized. For instance, the binding sites for a majority of transcription factors display concurrent chromatin accessibility. However, concurrent chromatin features reflect the binding activities of the transcription factor itself and thus provide limited insight into how genome-wide TF-DNA binding patterns became established in the first place. To understand the determinants of transcription factor binding specificity, we therefore need to examine how newly activated transcription factors interact with sequence and preexisting chromatin landscapes. Results Here, we investigate the sequence and preexisting chromatin predictors of TF-DNA binding by examining the genome-wide occupancy of transcription factors that have been induced in well-characterized chromatin environments. We develop Bichrom, a bimodal neural network that jointly models sequence and preexisting chromatin data to interpret the genome-wide binding patterns of induced transcription factors. We find that the preexisting chromatin landscape is a differential global predictor of TF-DNA binding; incorporating preexisting chromatin features improves our ability to explain the binding specificity of some transcription factors substantially, but not others. Furthermore, by analyzing site-level predictors, we show that transcription factor binding in previously inaccessible chromatin tends to correspond to the presence of more favorable cognate DNA sequences. Conclusions Bichrom thus provides a framework for modeling, interpreting, and visualizing the joint sequence and chromatin landscapes that determine TF-DNA binding dynamics.

-Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? 1. Please clarify where your data comes from. It seems to me that all your data were from mouse cell lines, so claiming that your model provides a framework for modeling chromatin landscapes *in vivo* is a stretch. Induction of iAscl1 and iNeurog2 ES cells on a plate is considered as in vitro experiments, induction of those cassettes in live mice and then harvesting the cells for ChIPseq analysis is in vivo. 2. Incorporating prior chromatin information does lead to an increase of specificity of induced Ascl1 binding predictions, however, the median auPRC for both sequence and preexisting chromatin combined is only at 0.59. Is the improvement in prediction significant? Is the sequence+preexisting chromatin model accurate enough to predict TF binding? -Are the conclusions adequately supported by the data shown? 3. Your network does indeed rely on motif multiplicity and motif flanks in predicting Ascl1 binding, however simulated sequences is not enough to validate the effects of motifs on TF binding. 4. Your conclusions for other TFs are unclear. More details on your analysis and results is needed for the last section.
-Are sufficient details provided to allow replication and comparison with related analyses that may have been performed? Yes.
-Does the work represent a significant advance over previously published studies? Yes.
-Is the paper of broad interest to others in the field, or of outstanding interest to a broad audience of biologists? Yes, temporal analysis of TF binding is very important in understanding TF's complex binding specificity. The authors examined the determinants for newly activated TF binding in mouse cell lines (in vitro).
We thank all the reviewers for their detailed comments, which have led us to improve and clarify our manuscript in several places. In what follows, the reviewer's comments are given in black text, interleaved with our responses in blue. We have extensively revised the entire manuscript in response to the reviewer comments, and to further improve aspects beyond the reviewer comments. Thus, we do not provide a "tracked changes" manuscript or show individual edits below.

Reviewer #1
In this paper, Srivastava et al. studied how pre-existing chromatin environment (e.g., accessibility, histone modifications, etc.) can influence transcription factor (TF) binding at sequence specific sites. Their motivation is to discover the role of cell-type specific chromatin environment in establishing cell-type specific binding of TFs. As the authors have discussed, there are prior works on modeling genome wide TF binding from sequence and cell-type specific "concurrent" chromatin environment. However, they argue that this approach cannot discern which aspects of chromatin environment can induce TF binding rather than being a "parallel measurement" of TF binding itself. Therefore, the authors chose to model datasets that have assessed both prior chromatin environment and TF occupancy upon overexpression. Using a bimodal (one mode for sequence, the other mode for chromatin environment) neural network, the authors find that sequence specificity and chromatin environment have compensatory roles in overall TF binding. Variation in both of these inputs leads to variation in binding. Their models also suggest TF binding is positively influenced by multiplicity of sequence motif occurrences. This is an important mechanistic question, but it is not possible to establish a causal link between chromatin environment and TF binding with only two datasets. In particular, one needs to perform a number of controlled knock-outs to elicit causal links. The authors should make this clear that their motivating question and the question being answered are different. There are several concerns about their models and modeling approach (see below). But the major concern is, these new models did not discover any novel mechanism/hypothesis. Our other comments and concerns are as follows.
We thank the reviewer for their feedback, and we agree with their points here. We have extensively rewritten our manuscript to emphasize that our work provides a computational framework that can be used to generate hypotheses about the relationships between TF binding and prior chromatin landscapes. We have also removed language suggesting that the links found by our approach are necessarily causal. We now make it clear that our computational approach -Bichrom -aims to find sequence and prior chromatin predictors that explain observed TF binding sites. Such predictors can suggest hypotheses that would have to be tested before causal links are established.
While our revised manuscript now more clearly focuses on the goal of interpreting observed TF binding in the context of sequence and prior chromatin predictors, we also wanted to demonstrate that the hypotheses suggested by Bichrom can indeed capture causal links between the preexisting chromatin accessibility environment and induced TF binding. The reviewer rightly suggests that causality sometimes can be established with knock-out experiments, but this is not always simple for the types of temporal associations that Bichrom suggests. For example, it is difficult to think of a knock-out experiment that would confirm a causal link between general chromatin accessibility and future TF binding. However, the same hypothesis can be tested by using Bichrom to characterize the predictors of induced TF binding in one chromatin state and testing whether the conclusions hold up in a different chromatin environment.
In newly added work, we analyzed ChIP-seq data from 3 transcription factors -Brn2, Ebf2, and Onecut2 -that become expressed in two distinct chromatin environments. Specifically, we use Bichrom to predict the relative degree to which each TF's binding sites are predicted by prior chromatin accessibility. Importantly, we base this Bichrom analysis on data from only one of the two chromatin environments in which we characterized Brn2, Enf2, and Onecut2 binding. We find that prior accessibility adds a significant improvement in predictive capacity for 2 of the 3 TFs compared with using sequence information alone, but no significant improvement is observed for Onecut2. In our interpretation, Bichrom's results suggest that Onecut2 may be less sensitive to the prior chromatin environment than Brn2 or Ebf2. To test this hypothesis, we examined the 3 TFs' binding in the alternate chromatin environment. Supporting our interpretation, Onecut2 binding changes much less than the others across the alternate prior chromatin contexts, suggesting that it is less sensitive to the prior chromatin landscape ( Fig. 6A-B, Supp. Fig. 6 and Supp. Fig.  7). Furthermore, individual Brn2, Ebf2, and Onecut2 sites that Bichrom predicts as "chromatinpredicted" or "sequence-predicted" show higher or lower (respectively) variability in binding in the alternate chromatin environment ( Fig. 6C-F).
These results are more fully described in the new Results section titled "Bichrom predicts the relative dependence of neuronal TF binding sites on preexisting chromatin", and add an important validation of our approach and interpretations.
1. TFs that are already bound to the DNA can also impact binding of a TF when it is overexpressed. How do the authors deal with this fact?
Because chromatin states largely depend on TFs, we agree that TFs which are already bound in a cell type influence the future binding of TFs that become expressed in that cell type. Our method, Bichrom, can be used to explore such associations in several ways. If the user has ChIP-seq data available for a particular TF that they believe may impact future TF binding, that ChIP-seq data can be incorporated into Bichrom alongside the other prior chromatin predictors. Then Bichrom's results can explicitly address the degree to which an induced TF's binding is predicted by the preexisting TF. On the other hand, if TF ChIP-seq data is not available, the binding activities of preexisting TFs are encoded within other chromatin feature tracks (e.g. accessibility or enhancerassociated histone modifications). If we find that those features predict future TF binding, we naturally must be cautious with our interpretations. For example, an induced TF's binding may not be causally dependent on prior accessibility, but rather on interactions with a preexisting TF that is itself causing the accessibility.
Our revised manuscript more clearly acknowledges such issues involving interpretation and possible interactions with pre-bound TFs. Further, in the revised manuscript, we have included additional analysis of latent network embeddings to identify secondary TF motifs that may relate to pre-bound TFs at both pre-inaccessible and pre-accessible chromatin. We demonstrate that we are able to find instances where secondary motif enrichment predicts future TF binding at a subset of prior-chromatin-predicted TFs binding sites for Ascl1, Ebf2, and Brn2 (Fig. 4A, Fig. 6E, F).
2. Except for the choice of appropriate datasets, it is not clear whether and how the bimodal deep neural network is a novel contribution. It is arguable that the previous models (for modeling TF binding from sequence and "concurrent" chromatin environment) should provide similar conclusions if applied on these new datasets.
The Reviewers' comment made us realize that we failed to communicate the novelty of the bimodal neural network framework. We have rewritten the manuscript to highlight the novelty of our framework, and we included additional data to demonstrate its validity. We discuss this in brief here.
The majority of previously described methods that integrate sequence and concurrent chromatin data to predict genome-wide TF binding use an early-integration framework, i.e. sequence and chromatin data is integrated at the feature level, and is jointly modeled by a unified framework 2-6 . It is not clear how such methods can be applied to deconvolve the sequence and preexisting chromatin accessibility predictors of TF binding the level of individual TF binding sites. Notably, Arvey, et al. have previously described a late-integration framework within which separate SVMs are used to model TF binding as a function of DNA sequence and concurrent chromatin accessibility data 7 . However, in Arvey, et al., the weights of each input modality (sequence vs. concurrent chromatin accessibility) are not derived using the input training data, but rather a normalized sum of the sequence and chromatin SVM outputs is used to predict TF binding.
The novelty of our bimodal neural network is its design, which embeds TF binding sites in a latent 2-dimensional space (Fig. 1A, Fig. 3A, Supp. Fig. 9A-C). The coordinates of each TF binding site in this space represent the degree to which sequence features and prior chromatin features predict TF binding at individual sites. To our knowledge, Bichrom is the first method that quantifies the contributions of multiple modalities to TF binding prediction at individual sites across the genome. The types of applications demonstrated in our manuscript (e.g., examining compensatory sequence and prior chromatin predictors at individual sites) are not trivially enabled by applying previous TF binding prediction approaches to temporally separated data.
The novelty of our approach is now more clearly described in the revised manuscript.
3A. It is also not clear why the authors needed a deep neural network. How was this architecture selected?
We use a neural network because of the flexibility it provides while modeling multimodal data; it is more straightforward to integrate sequence and chromatin predictors in a unified framework with CNN-based neural networks than it would be with alternate machine learning approaches. We would like to emphasize that the innovation in our framework is not the use of neural networks, but rather the neural network design within which distinct sub-networks transform higherdimensional sequence and chromatin data into a directly interpretable lower-dimensional representation. We have tried to make this motivation clearer in our revised manuscript.
The architecture of the bimodal network (Bichrom) was designed such that both the sequence (BichromSEQ) and chromatin sub-networks (BichromCHR) output single real-valued activations, which are then weighted by a sigmoid-activated logistic node. The architectures and hyperparameters of the individual sub-networks (BichromSEQ and BichromCHR) were selected using a limited hyper-parameter random grid search. We used a random grid-search over the number of (1) number of dense layers, (2) size of the dense layers, (3) number of convolutional filters, (4) activation functions and the (5) dropout rate. We used the size of the convolutional kernels as previously defined by DeepBind (kernel size=24). This is now more clearly described in our methods.
3B. What type of sequence patterns did the 240 convolutions discover? Do they discover motifs of any potential co-factor?
The Reviewer raises an excellent point about interpreting the sequence features learned by our approach. While convolutional kernels in 1-layer convolutional neural networks can learn directly interpretable TF sequence motifs, it is much less clear whether convolutional filters are similarly interpretable in deeper convolutional neural networks (CNNs) or CNN-LSTM architectures. In fact, convolutional kernels often learn overlapping, non-independent patterns. Therefore, analyzing patterns learned within individual convolutional kernels may not be a meaningful representation of the sequence features identified by the network.
In the revised manuscript, we use gradient-based techniques to identify local regions in the input sequences that are important for the network's output predictions. We cluster these "important" regions based on their underlying k-mer frequencies, and perform motif discovery at these individual clusters (Fig. 4A, 6D-F). Our results demonstrate that in addition to motif multiplicity, secondary motif enrichment and flanking nucleotide composition can further influence the network output predictions.

3C. Also, we do not find any indication of what they achieved by including the LSTM layers.
Typically, these recurrent layers should discover some "grammar" in the sequence space, but the authors only found some role of motif multiplicity.
The Reviewer is correct that the typical motivation behind the use of recurrent neural networks (e.g., LSTMs) is to introduce the ability to model spatial dependencies between convolutional filter activations. In the TF binding setting, it is tempting to assume that a CNN-LSTM architecture should learn motif "grammars" within enhancer regions, for example. Using our gradient-based feature importance approach, we did not find that Bichrom learned any spatial dependencies between motifs for the examined TFs. This could either be because the learning or interpretation schemes fail to detect such dependencies, or because such dependencies don't biologically exist for the examined TFs -we honestly can't tell at this point.
So, what is the point in continuing to use a LSTM layer? We believe that they are useful in introducing spatial dependencies within TF binding motifs. If individual convolutional filters learn parts of a motif (or informative motif-flanking features), LSTMs are useful in stitching them together at an individual binding site. It's worth noting that the same effect can likely be achieved using multiple stacks of convolutional layers, but using an LSTM allows us to use a shallower neural network architecture overall.
In the revised manuscript, we demonstrate that our chosen CNN-LSTM architecture performs better than, or on par with, a wide range of CNN-only architectures, including some that have much deeper numbers of layers. Specifically, we compare Bichrom's CNN-LSTM with CNN-based networks with combinations of selected values for the following parameters: (1) number of convolution layers, (2) convolutional kernel size, (3) number of convolutional filters, (4) max pooling size, (5) max pooling stride, (6) number of dense layers, (7) number of dense nodes and (8) dropout rate (Supp. Fig. 1).
4. The parameters in the final dense layer of the sequence sub-network were not kept fixed when training the combined model. How can the authors rule out the possibility that the compensatory contributions from sequence and chromatin environment is not an artifact of model reparameterization?
To address this concern, we trained the bimodal network (Bichrom) while keeping the final dense layer in BichromSEQ fixed, demonstrating that network re-parameterization is not responsible for the observed pattern in which compensatory contributions from sequence and the prior chromatin environment predict TF binding (Supp. Fig. 5).
Minor comments: 1. Pg. 8: "Orthogonal strategy to confirm that … binding sites" -> should be "binding peaks"? This sentence has been removed in the revised manuscript.
2. Please give a more detail description of the feature attribution method.
We have added a more detailed description of the feature attribution method to the revised manuscript.
3. Regarding the posterior distribution of recall: this estimate may still be biased depending on which chromosome was held-out and the distribution of peaks across the chromosomes. Also, the authors should note that, for beta with a = 1, b = 1, it is simply a uniform distribution.
We agree that the recall estimates may be biased based on the chromosome being tested. In our revised manuscript, we have repeated training over 9 held-out test chromosomes for each TF. In addition to recall values, we report the auPRC for each training iteration. Further, to test whether the gain in auPRCs upon the addition of preexisting chromatin are significant, we compare distributions of auPRCs from the sequence-only network and Bichrom using the Wilcoxon Signed Rank test. Finally, we have added in a note in the manuscript that the beta(1,1) prior is equivalent to a uniform prior.

Reviewer #2
We thank the Reviewer for their insightful comments throughout.
Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included?
1. Please clarify where your data comes from. It seems to me that all your data were from mouse cell lines, so claiming that your model provides a framework for modeling chromatin landscapes *in vivo* is a stretch. Induction of iAscl1 and iNeurog2 ES cells on a plate is considered as in vitro experiments, induction of those cassettes in live mice and then harvesting the cells for ChIPseq analysis is in vivo.
The Reviewer's comment reflects the varying uses of in vitro versus in vivo terminology. Some of the TF-DNA binding literature uses "in vitro" to refer to experiments carried out on naked DNA templates, and "in vivo" to describe experiments performed within a cellular chromatin context (e.g. via ChIP-seq). We apologize for the confusion. As the reviewer pointed out, all of the ChIP-seq data we examined comes from cultured ES cells or fibroblast cell lines, and these are of course "in vitro" systems. To avoid confusion, we removed the "in vivo" terminology and more clearly describe the ES-based cellular systems used through much of the manuscript.
2. Incorporating prior chromatin information does lead to an increase of specificity of induced Ascl1 binding predictions, however, the median auPRC for both sequence and preexisting chromatin combined is only at 0.59. Is the improvement in prediction significant? Is the sequence+preexisting chromatin model accurate enough to predict TF binding?
We first address whether our bimodal architecture can accurately predict genome-wide TF binding (or put another way, is 0.59 a reasonable auPRC compared with other state-of-the-art methods?). Briefly, to evaluate the ability of our bimodal network (Bichrom) to accurately predict genomewide TF binding, we turned to the ENCODE-DREAM TF binding challenge, which evaluated the ability of a range of computational methods to predict TF binding within and across cell types. The ENCODE DREAM challenge was set up to predict TF binding as a function of sequence, concurrent chromatin accessibility (DNase-seq) and concurrent gene expression (RNA-seq) data. We used the same training regime as required by the challenge to ensure our results were comparable to other models (e.g., predictions were made in overlapping 200 bp windows, with a 50 bp stride). Importantly, we note that the goal of the ENCODE-DREAM challenge (predicting TF binding using concurrent accessibility and expression) is not the same as our motivation -we are primarily interested in building a framework that can deconvolve sequence and prior chromatin predictors of TF binding. That said, Bichrom performs at par with the current state-of-the-art models on the ENCODE-DREAM challenge datasets, albeit with slightly lower predictive accuracies (Supp. Fig. 2). In fact, for CTCF bound in induced pluripotent stem cells and PC-3 cells, the bimodal network outperforms the overall top-performing method, Anchor 6 . The auPRCs for predicting TF binding using concurrent chromatin data as a feature range from 0.30-0.79 for 12 TFs tested (Supp. Fig. 2). Thus, we are confident that the auPRC of 0.59 observed for predicting induced Ascl1 binding is a reasonable performance level for current approaches. There is still room for improvement in this task, of course.
Next, we address whether the gain in auPRC for Ascl1 is significant upon addition of prior mEB chromatin features. In our revised manuscript, we train both the bimodal network (Bichrom) and a sequence-only network using 9 independent training sets, each corresponding to a distinct held-out chromosome as a test set. We use the Wilcoxon signed rank test to compare whether the means of the paired samples (sequence vs. bimodal sequence-chromatin model auPRCs for each training set) are different. The p-value for Ascl1 is 0.003, suggesting that the gain in auPRC is statistically significant (Fig. 2A). We also perform significance testing for all other TFs analyzed (Fig. 7, Supp.  Fig. 7).
Finally, we now demonstrate that gain in network predictive capacity (i.e., gain in auPRC) upon incorporation of preexisting chromatin can be interpreted as reflecting the degree to which preexisting chromatin predicts induced TF binding. As described in the first response to Reviewer #1, we apply Bichrom analysis to interpret the binding sites of Brn2, Ebf2, and Onecut2 that become expressed in two distinct chromatin environments 1 . We show that Bichrom's estimates of the degree to which preexisting chromatin accessibility predicts these TF's binding are correlated with the overall degree of differential binding observed across the two distinct chromatin environments (Fig. 6A, B and Supp. Fig. 7), suggesting that the gain in auPRCs reflects the relative degree to which a TF depends on the preexisting chromatin.
Are the conclusions adequately supported by the data shown?
3. Your network does indeed rely on motif multiplicity and motif flanks in predicting Ascl1 binding, however simulated sequences is not enough to validate the effects of motifs on TF binding.
We agree with the Reviewer that simulated data isn't sufficient to validate the effects of sequence motif composition on TF binding. The relevant part of the results (Fig. 4D) was one of several analyses we performed in order to tease apart the various features learned by the BichromSEQ subnetwork. Our revised manuscript more clearly describes all such features as "predictors" of induced binding. While such predictors suggest causal hypotheses that may be examined using experimental approaches, such experiments are beyond the scope of the current work.
4. Your conclusions for other TFs are unclear. More details on your analysis and results is needed for the last section.
We have performed a more detailed analysis of the features learned by Bichrom when predicting the binding sites of TFs induced in mouse NIH-3T3 fibroblasts. We also include results from additional TFs induced in human fibroblasts (Fig. 7, Supp. Fig. 10, 11, 12).

Are sufficient details provided to allow replication and comparison with related analyses that may have been performed?
Yes.

Does the work represent a significant advance over previously published studies?
Yes.

Is the paper of broad interest to others in the field, or of outstanding interest to a broad audience of biologists?
Yes, temporal analysis of TF binding is very important in understanding TF's complex binding specificity. The authors examined the determinants for newly activated TF binding in mouse cell lines (in vitro).

Reviewer 1
Srivastava et al. revised their previous manuscript on a bimodal neural network that assesses the role of sequence and preexisting chromatin landscapes in the occupancy of newly activated TFs. The authors have addressed our previous comments to avoid causal statements. This has really clarified the exact scientific question Bichrom is designed to answer.
Major comments on the revised version: 1. However, the authors' discussion on our other concern about whether this model a novel contribution is not satisfactory. It's still not clear how the other studies provide less explanation. Previous studies did not look at temporally preceding data of histone modifications, but if given the same data, how would those models be any weaker?
The authors can check PMIDs 26291518 and 26818008. For example, PMID 26291518 noted: "In addition to sequence motif (SM), chromatin state (CS) and DNA structure (DS) features have been shown to be informative for TFBS identification … the relative contributions of SM, CS, and DS features, both singly and in combination, towards TF binding prediction remains unclear … To address these questions, we first examined the relationships between TF binding regions determined in genome-wide ChIP analysis of 40 TFs and 23 features including two SM, 11 CS, and 10 DS features … TFs differ greatly in which features are significantly different between bound and unbound regions. Taking RAP1 as an example, values for most of the CS features are significantly different between bound and unbound regions, yet none of the DS features is significant … In contrast, for ZAP1, many DS but few CS-related features have significant test statistics ... However, some TFs, such as MSN2, REI1, SPT23, and SWI5, show significant differences both in CS and DS features". The other paper, PMID 26818008, also noted similar conclusions.
Overall, the authors need to do a better comparison against other highly similar works and show how the new tool is more powerful or completely novel than the published works.
The authors have mentioned Arvey et al.'s work. They are right that Arvey et al. have done things differently, it doesn't mean that the conclusions would be different.
2. In the new presentation, the authors have focused on the role of existing histone environment in a newly expressed TF. They're using chromatin marks at a time T0 in a model to predict ChIP data of a TF at a later time T1. It does not, however, show the role (if any) of chromatin marks at time T1 in the occupancy of the TF at T1. For a pioneer TF like Ascl1, this is an important part of Ascl1's binding mechanism.
3. The authors noted "Thus, our approach may offer a metric for quantifying the "pioneering" activity of a