- METHOD
- Open Access

# The Local Edge Machine: inference of dynamic models of gene regulation

- Kevin A. McGoff
^{1}Email author, - Xin Guo
^{2}, - Anastasia Deckard
^{3}, - Christina M. Kelliher
^{4}, - Adam R. Leman
^{4}, - Lauren J. Francey
^{5}, - John B. Hogenesch
^{5}, - Steven B. Haase
^{4}and - John L. Harer
^{3}

**Received:**13 June 2016**Accepted:**3 October 2016**Published:**19 October 2016

## Abstract

We present a novel approach, the Local Edge Machine, for the inference of regulatory interactions directly from time-series gene expression data. We demonstrate its performance, robustness, and scalability on in silico datasets with varying behaviors, sizes, and degrees of complexity. Moreover, we demonstrate its ability to incorporate biological prior information and make informative predictions on a well-characterized in vivo system using data from budding yeast that have been synchronized in the cell cycle. Finally, we use an atlas of transcription data in a mammalian circadian system to illustrate how the method can be used for discovery in the context of large complex networks.

## Keywords

- Gene regulatory networks
- Inference
- Time series

## Background

Temporally dynamic gene expression programs have been observed in a wide variety of organisms. In some instances, it is believed that the observed temporal dynamics are an emergent property of underlying transcription networks, which consist of interacting collections of transcription factors (TFs) [1–3]. Although it is difficult to assay such transcription networks directly, high-throughput technologies allow the measurement of transcription levels in time-course experiments [4, 5]. However, using such time-course transcriptome data to infer the structure of transcription networks is considered a major problem in computational biology [6, 7]. To date, many inference methods have been proposed for reconstructing gene regulatory networks [8, 9], but successful network inference directly from time-series datasets has remained elusive [8]. In fact, practicing systems biologists continue to rely on the manual curation of network models [2, 3, 10–12]. Indeed, the network inference problem persists in systems biology, despite an abundance of regulatory evidence in the form of TF binding experiments, genetic screens for candidate nodes, and mutant expression profiling experiments.

We are particularly interested in the *functional* components of networks (rather than the most expansive or inclusive network), where the function of the network is manifested by the *dynamics* of the network. By functional network, we mean a network such that an experimental perturbation will likely alter the dynamical phenotype of the network. One of the best examples of a large functional network is the mammalian circadian oscillator, for which the current core network contains about 30 nodes.

Previous methods for network inference from dynamics data may be broadly classified according to the tools involved. Many methods rely on linear statistical models called vector auto-regressive models, including methods based on Granger causality [13–15]. Other popular approaches employ sparse linear regression and related techniques [16–18], calculations of mutual information [19], or dynamic Bayesian networks [20–23]. Most recently, several studies have developed inference methods based on nonlinear ordinary differential equations (ODEs) for the chemical kinetics and a Bayesian formalism on the network structure [24–27]. This article fits into the latter class and extends some of those ideas as follows.

The computational task of inferring network connections from steady-state TF perturbation experiments (gene knockouts or overexpression) has been attempted [8]; however, it is difficult to infer causality, directionality, and the function of network edges from single-point measurements. LEM attempts to overcome these challenges by basing edge predictions on dynamics data. Importantly, though, the abundance of data from perturbation experiments and other regulatory evidence from a given model organism could be used throughout the process of LEM network inference in several ways. Indeed, it may be used to inform the selection of nodes chosen to run through LEM, to inform the structure of the prior information used by the algorithm, and to evaluate the output of the algorithm. In particular, the LEM framework allows for the incorporation of a wide variety of evidence in the form of prior information, including genetic evidence (e.g., gene expression changes in TF targets upon TF knockout or overexpression), physical interaction evidence (e.g., high-throughput genomics experiments, such as ChIP techniques, and database compilation, such as ENCODE [29]), and direct regulation evidence (e.g., the fast-on technique to identify direct TF targets [30]). In our yeast cell-cycle analysis, we include TF function (activator, repressor, or unknown; see Additional files 1 and 2) as prior information to improve LEM inference further. Additionally, we use the available regulatory evidence from various TF binding and genetics experiments (see Additional file 3) to evaluate LEM predictions. We view the development and testing of specific prior distributions based on current regulation evidence as an important direction for future work.

## Methods

### Description of the method

Given a set of genes deemed to be potentially important for network function, LEM takes a Bayesian approach to answer the following question: of all possible regulators, which regulator and regulatory logic (activation or repression) best models the expression dynamics of each gene? Here, we provide a brief description of how the LEM algorithm models the gene expression of each node and scores each possible regulation in the network. For a complete description of the mathematical and computational details, see Additional file 1: Sections 1–4.

Consider a gene regulatory network with a set of *N* nodes, \(\mathcal {N} = \{X_{1}, \dots, X_{N}\}\). For *i*=1,…,*N*, we let *X*
_{
i
}(*t*) denote the expression level of gene *X*
_{
i
} at time *t*. The data, denoted by *D*, consist of the observed expression levels of the *N* nodes at *T* time points, \(\{t_{j}\}_{j=1}^{T}\).

*X*

_{ i }, our model is that

*X*

_{ i }satisfies

where **X**(*t*)=(*X*
_{1}(*t*),…,*X*
_{
N
}(*t*)), the function \(f_{i} : \mathbb {R}^{N} \to \mathbb {R}\) governs the type of regulation that *X*
_{
i
} experiences, *α*
_{
i
}>0 represents the strength of the regulation, *β*
_{
i
}≥0 represents the rate of degradation of *X*
_{
i
}, and *γ*
_{
i
}≥0 represents the basal rate of production of *X*
_{
i
}. In general, stochastic effects play a significant role in the dynamics of any individual cell, and such considerations lead one to stochastic differential equations. However, we consider data generated by averaging expression levels over many (∼10^{8}) individual cells, and we, therefore, assume that the stochastic effects are insignificant, leading to our use of ODEs.

*f*

_{ i }of the following forms, which correspond to regulation by a single gene:

More complex regulatory functions *f*
_{
i
} could be allowed in the model class if the goal is to infer simultaneous regulation by multiple genes. However, we choose to restrict attention to single regulation, since the information content of time-series datasets at present appears not to support the substantial increase in complexity of the model class that would result from inclusion of combinatorial regulation.

Thus, to specify a system of ODEs completely, as in Eq. 1, for each node *X*
_{
i
}, one must select a regulator *X*
_{
j
}, a type of regulation (activation or repression), and a vector of real-valued parameters (*α*
_{
i
},*n*
_{
i
},*K*
_{
i
},*β*
_{
i
},*γ*
_{
i
}). We refer to triples of the form (*X*
_{
i
},*X*
_{
j
},*a*) or (*X*
_{
i
},*X*
_{
j
},*r*) as edges, where we interpret (*X*
_{
i
},*X*
_{
j
},*a*) as the relationship that *X*
_{
i
} is activated by *X*
_{
j
} and (*X*
_{
i
},*X*
_{
j
},*r*) denotes that *X*
_{
i
} is repressed by *X*
_{
j
}. Note that these edges are both signed and directed.

The LEM inference method first involves making a local approximation, which allows us to infer the regulation of each node separately, rather than all at once (see Additional file 1: Section 2). To infer the regulation of the target *X* (here we drop the subscript *i* from the above notation without introducing ambiguity), LEM takes a Bayesian approach that relies on the Gibbs posterior principle [33, 34] and a Laplacian approximation in the computation of the posterior distribution.

*M*is a model (among several) and

*D*is a dataset, then Bayes’s rule yields a posterior probability of

*M*given the data

*D*:

*p*(

*D*|

*M*) is the likelihood of the data

*D*given the model

*M*,

*π*is a probability distribution on the possible models, called the prior distribution, and

*p*(

*D*) is the likelihood of

*D*(averaged over all the possible models). If one interprets the prior distribution as our belief in the veracity of each model prior to generation of the data, then the posterior distribution represents the optimal way to update our beliefs in light of the data. If

*M*requires an additional choice of parameter

*θ*to be a fully generative model, the posterior distribution may be written as an integral over

*θ*:

For LEM, we formulate the edge inference problem in a similar manner. Let *X* be a fixed node and *E* an edge with *X* as the target [i.e., *E*=(*X*,*Y*,*a*) or *E*=(*X*,*Y*,*r*) for some node *Y*]. We would like to view *E* as a model for explaining the behavior of *X* and employ the Bayesian framework above to compute its posterior probability. To do so, we need to specify a prior distribution on the set of possible models, which in our case is the set of possible edges with *X* as the target, and we need a likelihood function. Recall that in our model, each edge requires an additional choice of parameter vector *θ*=(*α*,*β*,*γ*,*n*,*K*) (as in Eqs. 1 and 2) in order to specify fully the corresponding differential equation.

*π*(

*E*) be the uniform distribution over the possible edges that have

*X*as a target. For each edge

*E*with

*X*as the target, we select a priori bounds on each of the parameters in

*θ*

_{ E }, resulting in a region

*R*

_{ E }(contained in \(\mathbb {R}^{5}\)) of biologically reasonable parameter values (see Additional file 1: Section 3). Once these bounds are selected, we choose the maximum entropy prior distribution subject to these bounds, which is the least informative prior on

*R*

_{ E }and ensures that we do not unnecessarily bias the result. This distribution is

*s*is the number of edges with

*X*as target and Vol(

*R*

_{ E }) is the volume of

*R*

_{ E }.

*p*(

*D*|

*M*,

*θ*) by

*ℓ*(

*D*,

*E*,

*θ*) is an appropriately chosen loss function. We specify a loss function

*ℓ*(

*D*,

*E*,

*θ*) as follows. For a triple (

*D*,

*E*,

*θ*), define the function \(F : [t_{1},t_{T}] \to \mathbb {R}\) on the points \(\{t_{j}\}_{j=1}^{T}\) by

*F*to the whole interval [

*t*

_{1},

*t*

_{ T }] by linearly interpolating between these values. That is, if

*t*=

*ut*

_{ j }+(1−

*u*)

*t*

_{ j+1}for some

*j*<

*T*and

*u*∈(0,1), then let

*F*(

*t*)=

*uF*(

*t*

_{ j })+(1−

*u*)

*F*(

*t*

_{ j+1}). Now set

*ℓ*(

*D*,

*E*,

*θ*) to be the mean squared error between the observed values \(\{X(t_{j})\}_{j=1}^{T}\) and the properly shifted model prediction \(\{\hat {X}(t_{j})\}_{j=1}^{T}\):

This choice of loss function is effectively equivalent to the choice of a Gaussian noise model.

*E*given the data is

As is common in many Bayesian methods, the above integral does not have a closed-form solution. We choose to estimate it using a Laplace approximation [35] (see Additional file 1: Section 3). From this approximation, one can see that LEM explicitly favors networks whose dynamics are more robust to a perturbation in the parameter space. In principle, one could attempt to compute other approximations of this integral, including Monte Carlo approximations. However, we have found that the Laplace approximation is computationally fast and produces sufficiently accurate results for our purposes.

Thus, the core output of LEM is *N* different probability distributions—one for each node in the network (see Additional file 1: Section 3.1). The distribution for node *X* should be interpreted as representing our beliefs about which edge is the dominant regulatory interaction (edge) controlling the expression of *X*. There are multiple ways to obtain a single network from this set of distributions, the simplest of which is to select the most likely edge from each distribution.

## Results

### Validation and testing: in silico and yeast cell-cycle data

LEM outperforms existing network inference algorithms on both in silico and biological data

Network |
| LEM (AUC) | Inferelator (AUC) | Granger Causality (AUC) | Hill-DBN (AUC) | Jump3 (AUC) |

In silico 1 | 3 | 1.0000 | 0.9000 | 0.7000 | 0.5000 | 0.9000 |

In silico 2 | 3 | 1.0000 | 0.5667 | 0.8111 | 0.3667 | 0.7222 |

In silico 3 | 5 | 0.9900 | 0.7857 | 0.7791 | 0.4003 | 0.6794 |

In silico 4 | 10 | 0.8884 | 0.5541 | 0.5949 | 0.5131 | 0.7727 |

In silico 5 | 20 | 0.8781 | 0.6789 | 0.7441 | 0.6770 | 0.7540 |

Yeast cell-cycle 1 | 17 | 0.8693 | 0.6705 | 0.6893 | 0.6253 | 0.6481 |

Network |
| LEM (MCC) | TD-ARACNE (MCC) | Banjo DBN (MCC) | ||

In silico 1 | 3 | 1.0000 | 0.0000 | −0.5000 | ||

In silico 2 | 3 | 1.0000 | 0.0000 | −0.5000 | ||

In silico 3 | 5 | 0.7379 | 0.4528 | −0.0624 | ||

In silico 4 | 10 | 0.7463 | 0.0636 | 0.0294 | ||

In silico 5 | 20 | 0.5908 | 0.2147 | 0.0086 | ||

Yeast cell-cycle 1 | 17 | 0.0478 | 0.0292 | −0.0380 |

To compare these algorithms, we first used several of our benchmark datasets of oscillatory dynamics from in silico networks. Then we examined the performance of the algorithms on transcriptome data generated from time-series experiments on synchronized yeast cells [2] (see Additional file 1: Section 10 for a description of the data and Additional file 1: Section 11 for a description of the curation of a yeast cell-cycle network). As LEM makes more detailed predictions than these algorithms, we weakened its predictive power to make these comparisons. Nonetheless, as shown in Table 1, LEM outperforms these algorithms on both in silico and in vivo networks.

To demonstrate the performance of LEM on biological data, we begin with time-series data collected in the study of the transcriptional oscillator underlying the yeast cell cycle [2]. Based on these data, as well as on previously available data, a tentative network model was previously manually curated [2]. We created a network consisting of the previously published network [2] and some other known targets (see Additional file 7 for the list of genes, Additional file 1: Section 11 for a discussion, and Additional files 3 and 8 for almost 100 citations supporting this network). Taking this network as the gold standard, we found that the LEM predictions obtained an AUC-ROC score of 0.8693, indicating that the LEM predictions were highly informative with respect to the manually curated network. Indeed, as can be seen in Additional file 9, the gold standard edges are ranked highly by LEM, even without the inclusion of any prior information. Recognizing that all model networks represent an approximation of the underlying reality and are subject to revision, we also constructed both smaller (more restrictive) and larger (more inclusive) networks from the available data (see Additional files 10, 11, and 12 for the networks, Additional file 1: Section 11 for a discussion, and Additional files 3 and 8 for citations). We then compared the output of LEM to these networks, as seen in Additional file 5.

Since LEM takes a Bayesian approach, it also easily incorporates prior information. For the testing on the yeast cell-cycle data, we used a simple form of prior information: each node should appear exclusively as an activator or as a repressor. For example, if TF *Y* is known to be an activator (repressor) of its target genes, then inclusion of this function as prior information would exclude any regulation of the form *X* is repressed (activated) by *Y*. Note that this type of restriction appears only according to the user-defined prior information, and by default a TF is allowed to appear as both a repressor and an activator (which appears to be the case in many systems, especially in mammals [38–41]). To get an idea about the performance of LEM with this type of prior information, we first simulated this type of prior information with our benchmark in silico datasets (see Additional file 1: Section 6 for details and Additional file 13 for results).

### Validation and testing: mammalian circadian data

Next, we used LEM to discover new nodes in a complex incomplete biological network. The mammalian circadian clock is a transcriptional network that regulates gene expression in tune with the 24-hour light/dark cycle. While genetics and biochemistry have identified many core components/nodes of the circadian network, its full complement of nodes and its topology remain uncharacterized. Recently, Zhang and colleagues built an atlas of circadian transcription from 12 mouse organs and found that at least 43 % of the protein-coding genes are under clock control [44]. In this and other data, known clock genes tend to have the highest amplitude and most statistically significant rhythms [45]. Reasoning that new clock genes are likely to have similar dynamics, we used LEM to search these data for new clock genes.

First, we assembled a list of 31 high-confidence core circadian clock genes (see Additional file 16). Next, we used a suite of periodicity detection algorithms (see Additional file 1: Section 12) to find clock-regulated genes in each of the 12 mouse organs. Reasoning that novel circadian regulators are likely to regulate the known circadian core components, we ran LEM to estimate the probability that each candidate regulates a known clock gene. By summing these probabilities across all known circadian core components in all 12 organs, we calculated a score reflecting how likely each periodic gene is to regulate known core components. We selected a threshold of 0.1 for significance of this score, identifying 333 potential regulators. Notably, ten known clock genes were in this list.

Clock genes regulate each other. To winnow down this list, we used the liver data and found 205 candidates that were regulated by known clock components (see Additional file 17). We focused on the liver, as it is the organ with the strongest regulated circadian rhythms and best companion datasets (e.g., ChIP-seq data on known clock components). Known clock genes are TFs, kinases, and ubiquitin ligases. Reasoning that new components are likely to be in these classes as well, we filtered the list of 205 genes down to 34 genes in these or other plausible classes. Based on practical considerations (time and cost), we chose to conduct functional studies on the ten highest ranked genes from this list of 34. We consider the other 24 genes to be good candidates for future experimental work.

*Rnf152*and

*Ppp1r3c*were not expressed in NIH 3T3 cells, so we substituted

*Nup62*(a nuclear pore complex factor) and

*Fus*(a DNA/RNA binding protein). Five out of these ten knockdowns led to significant changes in circadian period or loss of circadian rhythms (see Additional file 18 for results), and four of these five remain significant after multiple hypothesis correction (see Additional file 19), indicating their requirement for normal circadian rhythms in 3T3 cells. An earlier whole-genome screen in a cell model found knockdown of ∼2.5 % of genes had circadian rhythm phenotypes including shortened or lengthened period and arrhythmicity [46], suggesting LEM provided a significant enrichment over background (Fisher’s exact test gives a

*p*value of 1.089×10

^{−5}after multiple hypothesis correction). Interestingly,

*Ankrd23*, a poorly characterized globular protein, was found to be critical for circadian rhythmicity, as

*Ankrd23*knockdowns were arrhythmic (see Fig. 5), the most profound phenotype observed. Taken together, these results show how LEM can be combined with biological information and functional validation to find causal nodes in complex biological networks.

### Further validation and testing: noise, partial information, and computational resources

In addition to the studies described above, we used in silico networks to test the performance of LEM with respect to changes in several other qualities of the data: noise [47], incomplete prior information, and non-periodic systems. To test for robustness against noise, we added truncated Gaussian noise to the data and computed the corresponding AUC-ROC score for LEM. Our tests covered a range of noise scales, where the largest noise scale was chosen so that the variance of the noise was 32 % of the variance of the signal. For a full description of our noise testing and precise results, see Additional file 1: Section 6 and Additional file 19. Based on these experiments, it appears that noise of this type does not greatly reduce the performance of LEM.

We also examined how the performance of LEM changed with the inclusion of prior information. To mimic the type of prior information that we used in the yeast cell-cycle analysis, each node in our in silico networks was assigned an identity, either repressor, activator, or both. We incorporated this information in our prior distribution as follows: if *Y* is an activator, then all edges in which *Y* appears as a repressor have prior probability 0; if *Y* is a repressor, then all edges in which *Y* appears as an activator have prior probability 0; and if *Y* is both, then no change is made to the prior distribution. Next we tested the performance of LEM under increasing access to this prior information. For increasing numbers of nodes, we randomly selected nodes and included their identities as prior information, computing LEM’s AUC-ROC score in each case. As mentioned previously, partial inclusion of this information incrementally improves the results (see Additional file 1: Section 6 and Additional file 15).

Since we initially focused on producing in silico datasets with periodic behavior, we also asked whether LEM could infer network structure and parameters from non-periodic data. For this testing, we generated several non-periodic benchmark datasets in silico, and we used AUC-ROC as a measure of LEM’s performance. LEM appears to perform as well on these data as it does on oscillating data from networks with the same number of nodes (see Additional file 1: Section 7 for results and Additional file 5 for a comparison), with AUC-ROC scores between 0.8 and 1.

Lastly, we evaluated the computational requirements imposed by the LEM algorithm. These requirements can be large, since they depend quadratically on the number of nodes under consideration (see Additional file 1: Section 4 for a precise description of LEM’s implementation and computational complexity, including run-time tables). However, since LEM involves separately computing an approximate posterior distribution on the possible regulations of each node, it is highly parallelizable. Leveraging parallel computations, we observed that LEM is scalable to large networks of the order of hundreds of nodes.

## Discussion and conclusions

We have presented LEM as a tool to prioritize hypotheses for gene regulatory network structures. After validating the approach on in silico networks, we first compared LEM outputs to a gold standard gene regulatory network established by physical evidence gathered from ChIP-on-chip studies in a well-characterized budding yeast cell-cycle transcription network. In this analysis, outputs of LEM, which we view as functional edges, consisted of many of the edges previously characterized, along with one novel edge (YOX1 repressing CTS1), which provides an example of a potential discovery. LEM does not identify all edges that were detected by ChIP studies, and there are several possible explanations of this. For one, LEM is designed to identify the dominant regulatory signal in a given experimental condition, and therefore, it is possible that the gold standard edges not found by LEM are of secondary importance. Furthermore, we speculate that physical binding does not always predict a functional relationship between regulator and target in the conditions that were observed experimentally.

If we relax the *p*-value cutoffs used to construct the gold standard network, we obtain a more complex network with additional nodes and edges that have less experimental support (the yeast cell-cycle 5 network, see Additional file 12 and Additional file 1: Section 10). In general, we find that LEM (along with other algorithms) has a harder time finding evidence for this network in the time-series data. As seen in Additional files 5 and 6, all algorithms perform poorly when this network is treated as the gold standard. Indeed, all AUC scores are close to 0.5, which is what one would expect if edges were ranked randomly. This outcome suggests that there is little support in the time-series data for this as the underlying network.

To demonstrate how LEM could be applied to study larger biological networks, we used LEM to predict novel members of the circadian transcription network, for which both the regulators and the topology are incompletely characterized. Using LEM to look for novel regulators that both receive and transmit regulatory edges to known circadian network nodes, we generated a candidate list of about 200 potential circadian regulators, a dramatic reduction from the thousands of circadian oscillating genes that periodicity-detection algorithms reported in Zhang et al. [44]. For evidence that these results are informative, note that four regulators ranked in the top 20 have been previously shown to have circadian function (see Additional file 17). Furthermore, in a preliminary screen, we found that four out of ten tested genes from the LEM list showed a significant circadian phenotype, despite that previous high-throughput screens found that about 2 % of the genome has a circadian phenotype (see Additional file 18). In light of this performance, we believe that LEM is a powerful tool for reducing a hypothesis space while inferring network topology from the available data.

The issue of non-identifiability of network models for gene regulatory networks has been recognized [6, 7] but not widely studied. This issue arises when distinct networks (i.e., network topologies) have the capability of generating the same dynamics within similar parameter regimes. By definition, no inference algorithm can distinguish between such non-identifiable pairs of networks. Since LEM takes a Bayesian approach, it implicitly rewards models that are robust to changes of parameters (see Additional file 1: Section 3 for a theoretical justification and Additional file 20 for examples). Thus, if two distinct models generate the same dynamics, then LEM will place a higher posterior probability on the more robust model. Although LEM cannot overcome the theoretical limits on inference placed by non-identifiability issues, we observe that LEM, nonetheless, performs quite well, as evidenced by its ability to find global systems of differential equations that fit the data (see Additional file 1: Section 8 and Additional file 20 for examples). This phenomenon also appears in yeast cell-cycle network 1, where LEM does not capture all of the gold standard edges, but it does generate dynamics that closely approximate the observed data (see Additional file 21). Predictions made by LEM for yeast cell-cycle network 1 that were not previously identified by experiments are in the process of being tested.

In addition to the theoretical non-identifiability discussed above, there is a practical issue that arises when the data are not informative enough to distinguish between several models. This situation may arise when one considers sparsely sampled or noisy data, and it calls for additional data from other experimental conditions, such as genetic perturbations. If such data become available, they may be integrated into the framework of LEM via prior information.

As the size of a network grows, the degree of non-identifiability may also increase, since many nodes can present similar dynamics. Such an increase in non-identifiability will necessarily limit the performance of any edge inference algorithm on large networks. To illustrate this point, we created two additional in silico datasets, each with 100 nodes (see Additional files 22 and 23 for the network diagrams and Additional files 24 and 25 for the performance of the inference algorithms). By design, the network in silico 23 contains an extreme amount of redundancy, with 97 of the nodes having exactly the same time series. However, this redundancy is concentrated within the network in such a way that only one edge (out of 100) lacks identifiability, leading to very strong performance by LEM and other edge inference algorithms in our ROC analysis. The network in silico 24 has some practical non-identifiability, in the sense that some nodes have very similar time series (although strictly speaking no two nodes are exactly the same). Accordingly, the performance of the inference algorithms suffers slightly on this dataset. Despite that both of these networks have 100 nodes, LEM is able to infer the correct edges with high accuracy. In summary, non-identifiability (not size) appears to be the main factor limiting the accuracy of LEM.

One of the simplifying assumptions in LEM is that each node in the network has only one dominant regulator controlling its expression level, which is in contrast to some other algorithms, such as Inferelator, that allow one to model regulation with some form of combinatorial control. In principle, one could modify LEM to include such combinatorial terms. However, we believe that the data available in the foreseeable future will not be informative enough to overcome the increase in computational and statistical complexity that would be introduced by these terms, and therefore, inclusion of these terms at this time would result in longer run times and more overfitting. Furthermore, our results indicate that the present version of LEM performs well, even when the generating networks are known to contain combinatorial regulation. One possible explanation for this performance is that even when a gene experiences combinatorial regulation, at least one of the regulators fits the target data reasonably well by itself. In such cases, LEM will typically reward that regulator with a high score, leading to strong performance in a ROC analysis.

In Additional file 1: Section 4, we give details about the computational burden of LEM. Other algorithms that combine detailed differential equation models with a Bayesian formalism tend to employ MCMC to approximate the posterior distribution(s) [26, 48]. In general, LEM avoids the need for any MCMC, thereby reducing the computational burden. In an illustrative comparison against the method of Mazur et al. [48] on four small networks with three nodes, LEM runs substantially faster (see Additional file 26). We did not compare LEM directly to CheMA [26], as the available implementation is designed for protein signaling networks mediated by phosphorylation and therefore, it is not applicable in our setting. Some algorithms that rely on different underlying techniques run faster than LEM, such as Granger Causality [14] and Inferelator [17, 18]. Nonetheless, the parallelizability of LEM makes it applicable to large networks, and the improvement in inferential accuracy demonstrated by LEM over previous methods suggests that it is an especially valuable tool in the search for *functional* networks or network components, which are typically moderate in size.

In the examples considered here, we have focused largely on periodic time series of gene expression, as they clearly result from functional networks, but there is nothing inherent to LEM that limits its utility to gene expression. Indeed, we expect it to generalize as well to other dynamic processes, such as signal transduction pathways or developmental networks. In future work, we intend to extend LEM to allow for the explicit modeling of other cellular processes, such as phosphorylation and ubiquitination.

## Declarations

### Funding

This work was partially supported by DARPA under grant number D12AP0001 and by the National Science Foundation under grant number DMS-1045153. XG acknowledges support from the Research Grants Council of Hong Kong (project PolyU 25301115).

### Availability of data and materials

Local Edge Machine (LEM) is free software: it may be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the license or any later version. It may be found at https://gitlab.com/biochron_open/lem. The version of the source code used in the manuscript has been assigned DOI 10.5281/zenodo.154551.

Software will be available at http://cms.math.duke.edu/harer/?q=downloads.

### Authors’ contributions

AD, SBH, JLH, and KAM conceived of and contributed to the intellectual development of the project. XG helped develop and implemented the algorithm. AD created the in silico datasets and figures. CMK and ARL curated the yeast cell-cycle networks and datasets. LJF and JBH provided data and experimental validation for the circadian analysis. JLH and KAM prepared the manuscript, with contributions from AD, XG, SBH, CMK, and ARL. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Ethics approval and consent to participate

Ethics approval was not needed for this study.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004; 5(2):101–13.PubMedView ArticleGoogle Scholar
- Orlando DA, Lin CY, Bernard A, Wang JY, Socolar JE, Iversen ES, Hartemink AJ, Haase SB. Global control of cell-cycle transcription by coupled CDK and network oscillators. Nature. 2008; 453(7197):944–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Simmons Kovacs LA, Mayhew MB, Orlando DA, Jin Y, Li Q, Huang C, Reed SI, Mukherjee S, Haase SB. Cyclin-dependent kinases are regulators and effectors of oscillations driven by a transcription factor network. Mol Cell. 2012; 45(5):669–79.PubMedView ArticleGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle–regulated genes of the yeast
*Saccharomyces cerevisiae*by microarray hybridization. Mol Biol Cell. 1998; 9(12):3273–97.PubMedPubMed CentralView ArticleGoogle Scholar - Pramila T, Wu W, Miles S, Noble WS, Breeden LL. The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the S-phase gap in the transcriptional circuitry of the cell cycle. Genes Dev. 2006; 20(16):2266–78.PubMedPubMed CentralView ArticleGoogle Scholar
- Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems. 2009; 96(1):86–103.PubMedView ArticleGoogle Scholar
- Lillacci G, Khammash M. Parameter estimation and model selection in computational biology. PLoS Comput Biol. 2010; 6(3):1000696.View ArticleGoogle Scholar
- Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M, Collins JJ, Stolovitzky G, et al.Wisdom of crowds for robust gene network inference. Nat Methods. 2012; 9(8):796–804.PubMedPubMed CentralView ArticleGoogle Scholar
- Oates CJ, Mukherjee S. Network inference and biological dynamics. Ann Appl Stat. 2012; 6(3):1209.PubMedPubMed CentralView ArticleGoogle Scholar
- Haase SB, Wittenberg C. Topology and control of the cell-cycle-regulated transcriptional circuitry. Genetics. 2014; 196(1):65–90.PubMedPubMed CentralView ArticleGoogle Scholar
- Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS, et al.Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001; 106(6):697–708.PubMedView ArticleGoogle Scholar
- Zhang EE, Kay SA. Clocks not winding down: unravelling circadian networks. Nat Rev Mol Cell Biol. 2010; 11(11):764–76.PubMedView ArticleGoogle Scholar
- Bansal M, Della Gatta G, Di Bernardo D. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006; 22(7):815–22.PubMedView ArticleGoogle Scholar
- Granger CWJ. Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 1969; 37(3):424–38.View ArticleGoogle Scholar
- Morrissey ER, Juárez MA, Denby KJ, Burroughs NJ. Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully Bayesian spline autoregression. Biostatistics. 2011; 12(4):682–94.PubMedView ArticleGoogle Scholar
- Banos DT, Millar AJ, Sanguinetti G. A Bayesian approach for structure learning in oscillating regulatory networks. Bioinformatics. 2015; 31(22):3617–24.Google Scholar
- Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol. 2006; 7(5):36.View ArticleGoogle Scholar
- Greenfield A, Hafemeister C, Bonneau R. Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics. 2013; 29(8):1060–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinform. 2010; 11(1):154.View ArticleGoogle Scholar
- Yu J, Smith V, Wang P, Hartemink A, Jarvis E. Using Bayesian network inference algorithms to recover molecular genetic regulatory networks: 2002. International Conference on Systems Biology 2002 (ICSB02), December 2002.Google Scholar
- Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED. Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004; 20(18):3594–603.PubMedView ArticleGoogle Scholar
- Hill SM, Lu Y, Molina J, Heiser LM, Spellman PT, Speed TP, Gray JW, Mills GB, Mukherjee S. Bayesian inference of signaling network topology in a cancer cell line. Bioinformatics. 2012; 28(21):2804–10.PubMedPubMed CentralView ArticleGoogle Scholar
- Dondelinger F, Lèbre S, Husmeier D. Non-homogeneous dynamic Bayesian networks with Bayesian regularization for inferring gene regulatory networks with gradually time-varying structure. Mach Learn. 2013; 90(2):191–230.View ArticleGoogle Scholar
- Äijö T, Lähdesmäki H. Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics. Bioinformatics. 2009; 25(22):2937–44.PubMedView ArticleGoogle Scholar
- Huynh-Thu VA, Sanguinetti G. Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics. 2015; 31(10):1614–22.PubMedPubMed CentralView ArticleGoogle Scholar
- Oates CJ, Dondelinger F, Bayani N, Korkola J, Gray JW, Mukherjee S. Causal network inference using biochemical kinetics. Bioinformatics. 2014; 30(17):468–74.View ArticleGoogle Scholar
- Penfold CA, Buchanan-Wollaston V, Denby KJ, Wild DL. Nonparametric Bayesian inference for perturbed and orthologous gene regulatory networks. Bioinformatics. 2012; 28(12):233–41.View ArticleGoogle Scholar
- Kitano H. Computational systems biology. Nature. 2002; 420(6912):206–10.PubMedView ArticleGoogle Scholar
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.View ArticleGoogle Scholar
- McIsaac RS, Silverman SJ, McClean MN, Gibney PA, Macinskas J, Hickman MJ, Petti AA, Botstein D. Fast-acting and nearly gratuitous induction of gene expression and protein depletion in
*Saccharomyces cerevisiae*. Mol Biol Cell. 2011; 22(22):4447–59.PubMedPubMed CentralView ArticleGoogle Scholar - Gardner TS, Di Bernardo D, Lorenz D, Collins JJ. Inferring genetic networks and identifying compound mode of action via expression profiling. Science. 2003; 301(5629):102–5.PubMedView ArticleGoogle Scholar
- Yeung MS, Tegnér J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci. 2002; 99(9):6163–8.PubMedPubMed CentralView ArticleGoogle Scholar
- Bissiri PG, Holmes CC, Walker SG. A general framework for updating belief distributions. J R Stat Soc Ser B Stat Methodol. 2016; 78(5):1103–30.View ArticleGoogle Scholar
- Jiang W, Tanner MA. Gibbs posterior for variable selection in high-dimensional classification and data mining. Ann Stat. 2008; 36(5):2207–31.View ArticleGoogle Scholar
- Wong R. Asymptotic approximation of integrals. Philadelphia: SIAM; 2001.View ArticleGoogle Scholar
- Buchler NE, Gerland U, Hwa T. On schemes of combinatorial transcription logic. Proc Natl Acad Sci. 2003; 100(9):5136–41.PubMedPubMed CentralView ArticleGoogle Scholar
- Setty Y, Mayo AE, Surette MG, Alon U. Detailed map of a cis-regulatory input function. Proc Natl Acad Sci. 2003; 100(13):7702–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Bazil JN, Stamm KD, Li X, Thiagarajan R, Nelson TJ, Tomita-Mitchell A, Beard DA. The inferred cardiogenic gene regulatory network in the mammalian heart. PLoS ONE. 2014; 9(6):100842.View ArticleGoogle Scholar
- Nayak A, Glöckner-Pagel J, Vaeth M, Schumann JE, Buttmann M, Bopp T, Schmitt E, Serfling E, Berberich-Siebelt F. Sumoylation of the transcription factor nfatc1 leads to its subnuclear relocalization and interleukin-2 repression by histone deacetylase. J Biol Chem. 2009; 284(16):10935–46.PubMedPubMed CentralView ArticleGoogle Scholar
- Peng Y, Jahroudi N. The NFY transcription factor functions as a repressor and activator of the von Willebrand factor promoter. Blood. 2002; 99(7):2408–17.PubMedView ArticleGoogle Scholar
- Reynolds N, O’Shaughnessy A, Hendrich B. Transcriptional repressors: multifaceted regulators of gene expression. Development. 2013; 140(3):505–12.PubMedView ArticleGoogle Scholar
- Teixeira MC, Monteiro PT, Guerreiro JF, Gonçalves JP, Mira NP, dos Santos SC, Cabrito TR, Palma M, Costa C, Francisco AP, Madeira SC, Oliveira AL, Freitas AT, Sá-Correia I. The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in
*Saccharomyces cerevisiae*. Nucleic Acids Res. 2014; 42(D1):161–6.View ArticleGoogle Scholar - Cherry JM, Hong EL, Amundsen C, et al. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 2011; 40(D1):D700–D705.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang R, Lahens NF, Ballance HI, Hughes ME, Hogenesch JB. A circadian gene expression atlas in mammals: implications for biology and medicine. Proc Natl Acad Sci. 2014; 111(45):16219–24.PubMedPubMed CentralView ArticleGoogle Scholar
- Anafi RC, Lee Y, Sato TK, Venkataraman A, Ramanathan C, Kavakli IH, Hughes ME, Baggs JE, Growe J, Liu AC, Kim J, Hogenesch JB. Machine learning helps identify chrono as a circadian clock component. PLoS Biol. 2014; 12(4):1001840.View ArticleGoogle Scholar
- Zhang EE, Liu AC, Hirota T, Miraglia LJ, Welch G, Pongsawakul PY, Liu X, Atwood A, Huss III JW, Janes J, Su AI, Hogenesch JB, Kay SA. A genome-wide RNAi screen for modifiers of the circadian clock in human cells. Cell. 2009; 139(1):199–210.PubMedPubMed CentralView ArticleGoogle Scholar
- Raser JM, O’Shea EK. Noise in gene expression: origins, consequences, and control. Science. 2005; 309(5743):2010–13.PubMedPubMed CentralView ArticleGoogle Scholar
- Mazur J, Ritter D, Reinelt G, Kaderali L. Reconstructing nonlinear dynamic models of gene regulation using stochastic sampling. BMC Bioinform. 2009; 10(1):448.View ArticleGoogle Scholar
- Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature. 2000; 403(6767):335–8.PubMedView ArticleGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al.Transcriptional regulatory networks in
*Saccharomyces cerevisiae*. Science. 2002; 298(5594):799–804.PubMedView ArticleGoogle Scholar - Workman CT, Mak HC, McCuine S, Tagne JB, Agarwal M, Ozier O, Begley TJ, Samson LD, Ideker T. A systems approach to mapping DNA damage response pathways. Science. 2006; 312(5776):1054–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Fisher RA. Statistical methods for research workers, 4th ed. Edinburgh: Oliver and Boyd; 1932.Google Scholar
- Price TS, Baggs JE, Curtis AM, FitzGerald GA, Hogenesch JB. Waveclock: wavelet analysis of circadian oscillation. Bioinformatics. 2008; 24(23):2794–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.Google Scholar