MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

Tareen, Ammar; Kooshkbaghi, Mahdi; Posfai, Anna; Ireland, William T.; McCandlish, David M.; Kinney, Justin B.

doi:10.1186/s13059-022-02661-7

Software
Open access
Published: 15 April 2022

MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

Ammar Tareen^1,2,
Mahdi Kooshkbaghi¹,
Anna Posfai¹,
William T. Ireland^3,4,
David M. McCandlish¹ &
…
Justin B. Kinney ORCID: orcid.org/0000-0003-1897-3778¹

Genome Biology volume 23, Article number: 98 (2022) Cite this article

8590 Accesses
23 Citations
42 Altmetric
Metrics details

Abstract

Multiplex assays of variant effect (MAVEs) are a family of methods that includes deep mutational scanning experiments on proteins and massively parallel reporter assays on gene regulatory sequences. Despite their increasing popularity, a general strategy for inferring quantitative models of genotype-phenotype maps from MAVE data is lacking. Here we introduce MAVE-NN, a neural-network-based Python package that implements a broadly applicable information-theoretic framework for learning genotype-phenotype maps—including biophysically interpretable models—from MAVE datasets. We demonstrate MAVE-NN in multiple biological contexts, and highlight the ability of our approach to deconvolve mutational effects from otherwise confounding experimental nonlinearities and noise.

Background

Over the last decade, the ability to quantitatively study genotype-phenotype (G-P) maps has been revolutionized by the development of multiplex assays of variant effect (MAVEs), which can measure molecular phenotypes for thousands to millions of genotypic variants in parallel [1, 2]. MAVE is an umbrella term that describes a diverse set of experimental methods, some examples of which are illustrated in Fig. 1. Deep mutational scanning (DMS) experiments [3] are a type of MAVE commonly used to study protein sequence-function relationships. These assays work by linking variant proteins to their coding sequences, either directly or indirectly, then using deep sequencing to assay which variants survive a process of activity-dependent selection (e.g., Fig. 1a). Massively parallel reporter assays (MPRAs) are another major class of MAVE and are commonly used to study DNA or RNA sequences that regulate gene expression at a variety of steps, including transcription, mRNA splicing, cleavage and polyadenylation, translation, and mRNA decay [4–7]. MPRAs typically rely on either an RNA-seq readout of barcode abundances (Fig. 1c) or the sorting of cells expressing a fluorescent reporter gene (Fig. 1e).

Most computational methods for analyzing MAVE data have focused on accurately quantifying the activity of individual assayed sequences [13–19]. However, MAVE measurements like enrichment ratios or cellular fluorescence levels usually cannot be interpreted as providing direct quantification of biologically meaningful activities, due to the presence of experiment-specific nonlinearities and noise. Moreover, MAVE data is usually incomplete, as one often wishes to understand G-P maps over vastly larger regions of sequence space than can be exhaustively assayed. The explicit quantitative modeling of G-P maps can address both the indirectness and incompleteness of MAVE measurements [1, 20]. The goal here is to determine a mathematical function that, given a sequence as input, will return a quantitative value for that sequence’s molecular phenotype. Such quantitative modeling has been of great interest since the earliest MAVE methods were developed [12, 21, 22], but no general-use software has yet been described for inferring G-P maps of arbitrary functional form from the diverse types of datasets produced by commonly used MAVEs.

Here we introduce a unified conceptual framework for the quantitative modeling of MAVE data. This framework is based on the use of latent phenotype models, which assume that each assayed sequence has a well-defined latent phenotype (specified by the G-P map), of which the MAVE experiment provides a noisy indirect readout (described by the measurement process). The quantitative forms of both the G-P map and the measurement process are then inferred from MAVE data simultaneously. We further introduce an information-theoretic approach for separately assessing the performance of the G-P map and the measurement process components of latent phenotype models. This strategy is implemented in an easy-to-use open-source Python package called MAVE-NN, which represents latent phenotype models as neural networks and infers the parameters of these models from MAVE data using a TensorFlow 2 backend [23].

In what follows, we expand on this unified MAVE modeling strategy and apply it to a diverse array of DMS and MPRA datasets. Doing so, we find that MAVE-NN provides substantial advantages over other MAVE modeling approaches. In particular, we illustrate how the ability of MAVE-NN to train custom G-P maps can shed light on the biophysical mechanisms of protein function and gene regulation. We also highlight the substantial benefits of including sequence variants with multiple mutations in assayed sequence libraries, as doing so allows MAVE-NN to deconvolve the features of the G-P map from potentially confounding effects of experimental nonlinearities and noise. Indeed, including just a modest number of multiple-mutation variants in a MAVE experiment can be beneficial even when one is primarily interested in the effects of single mutations.

Results

Latent phenotype modeling strategy

MAVE-NN supports the analysis of MAVE data on DNA, RNA, and protein sequences and can accommodate either continuous or discrete measurement values. Given a set of sequence-measurement pairs, MAVE-NN aims to infer a probabilistic mapping from sequences to measurements. Our primary enabling assumption, which is encoded in the structure of the latent phenotype model (Fig. 2a), is that this mapping occurs in two stages. Each sequence is first mapped to a latent phenotype by a deterministic G-P map. This latent phenotype is then mapped to possible measurement values via a stochastic measurement process. During training, the G-P map and measurement process are simultaneously learned by maximizing a regularized form of likelihood.

MAVE-NN includes four types of built-in G-P maps: additive, neighbor, pairwise, and black box. Additive G-P maps assume that each character at each position within a sequence contributes independently to the latent phenotype. Neighbor G-P maps incorporate interactions between adjacent (i.e., nearest-neighbor) characters in a sequence, while pairwise G-P maps include interactions between all pairs of characters in a sequence regardless of the distance separating the characters in each pair. Black box G-P maps have the form of a densely connected multilayer perceptron, the specific architecture of which can be controlled by the user. MAVE-NN also supports custom G-P maps that can be used, e.g., to represent specific biophysical hypotheses about the mechanisms of sequence function.

To handle both discrete and continuous measurement values, two different strategies for modeling measurement processes are provided. Measurement process agnostic (MPA) regression uses techniques from the biophysics literature [12, 20, 24, 25] to analyze MAVE datasets that report discrete measurements. Here the measurement process is represented by an overparameterized neural network that takes the latent phenotype as input and outputs the probability of each possible categorical measurement (Fig. 2b). Global epistasis (GE) regression (Fig. 2c), by contrast, leverages ideas previously developed in the evolution literature for analyzing datasets that contain continuous measurements [26–29], and is becoming an increasingly popular strategy for modeling DMS data [30–33]. Here, the latent phenotype is nonlinearly mapped to a prediction that represents the most probable measurement value. A noise model is then used to describe the distribution of likely deviations from this prediction. MAVE-NN supports both homoscedastic and heteroscedastic noise models based on three different classes of probability distribution: Gaussian, Cauchy, and skewed-t. In particular, the skewed-t distribution, introduced by Jones and Faddy [34], reduces to Gaussian and Cauchy distributions in certain limits, but also accommodates asymmetric experimental noise. Figure 2d shows an example of a GE measurement process with a heteroscedastic skewed-t noise model.

Readers should note that the current implementation of MAVE-NN places constraints on input data and model architecture. Input sequences must be the same length, and when analyzing continuous data, only scalar measurements (as opposed to vectors of multiple measurements) can be used to train models. In addition, because our method for learning the form of experimental nonlinearities depends on observing how multiple mutations combine, MAVE-NN’s functionality is more limited when analyzing MAVE libraries that contain only single-mutation variants. More information on these constraints and the reasons behind them can be found below in the section “Constraints on datasets and models”.

Information-theoretic measures of model performance

We further propose three distinct quantities for assessing the performance of latent phenotype models: intrinsic information, predictive information, and variational information (Fig. 2e). These quantities come from information theory and are motivated by thinking of G-P maps in terms of information compression. In information theory, a quantity called mutual information quantifies the amount of information that the value of one variable communicates about the value of another [35, 36]. Mutual information is symmetric, is nonnegative, and is measured in units called “bits”. If the two variables in question are independent, their mutual information will be zero bits. If instead, knowing the value of one of these variables allows you to narrow down the value of the other variable to one of two otherwise equally likely possibilities, their mutual information will be 1.0 bits. This mutual information will be 2.0 bits if this second variable’s value is narrowed down to one of four possibilities, 3.0 bits if it is narrowed down to one of eight possibilities, and so on. But importantly, mutual information does not require that the relationship between the two variables of interest be so clean cut, and can in fact be computed between any two types of variables—discrete, continuous, multi-dimensional, etc.. This property makes the information-based quantities we propose applicable to all MAVE datasets, regardless of the specific type of experimental readout used. By contrast, many of the standard model performance metrics have restricted domains of applicability: accuracy can only be applied to data with categorical labels, Pearson and Spearman correlation can only be applied to data with univariate continuous labels, and so on. We note, however, that estimating mutual information and related quantities from finite data is nontrivial and that MAVE-NN uses a variety of approaches to do this.

Our information metrics are as follows. Intrinsic information, I_int, is the mutual information between the sequences and measurements contained within a MAVE dataset. This quantity provides a useful benchmark against which to compare the performance of inferred G-P maps. Predictive information, I_pre, is the mutual information between MAVE measurements and the latent phenotype values predicted by an inferred G-P map. This quantifies how well the latent phenotype preserves sequence-encoded information that is determinative of experimental measurements. When evaluated on test data, I_pre is bounded above by I_int, and equality is realized only when the G-P map losslessly compresses sequence-encoded information. Variational information, I_var, is a linear transformation of log likelihood (or equivalently, cross-entropy) that provides a variational lower bound on I_pre [37–39]. The difference between I_pre and I_var quantifies how accurately the inferred measurement process matches the observed distribution of measurements conditioned on latent phenotypes (see Additional file 1: Section S1).

MAVE-NN infers model parameters by maximizing an (often quite lightly) regularized form of likelihood. These computations are performed using the standard backpropagation-based training algorithms provided within TensorFlow 2. With certain caveats noted (see “Methods”), this optimization procedure maximizes I_pre while avoiding the costly estimates of mutual information at each iteration that have hindered the adoption of previous mutual-information-based modeling strategies [12, 25].

Application: deep mutational scanning assays

We now demonstrate the capabilities of MAVE-NN on three DMS datasets, starting with the study of Olson et al. [8]. These authors measured the effects of single and double mutations to residues 2–56 of the IgG binding domain of protein G (GB1). To assay the binding of GB1 variants to IgG, the authors combined mRNA display with ultra-high-throughput DNA sequencing [Fig. 1a(i)]. The resulting dataset reports log enrichment values for all possible 1045 single mutations and 530,737 (nearly all possible) double-mutations to this 55 aa protein sequence (Fig. 1b).

Inspired by the work of Otwinowski et al. [29], we used MAVE-NN to infer a latent phenotype model comprising an additive G-P map and a GE measurement process. This inference procedure required only about 5 min on one node of a computer cluster (Additional file 1: Fig. S1e). Figure 3a illustrates the inferred additive G-P map via the effects that every possible single-residue mutation has on the latent phenotype. From this heatmap of additive effects, we can immediately identify all of the critical GB1 residues, including the IgG-interacting residues at 27, 31, and 43 [8]. We also observe that missense mutations to proline throughout the GB1 domain tend to negatively impact IgG binding, as expected due to this amino acid’s exceptional conformational rigidity.

Figure 3b illustrates the corresponding GE measurement process, revealing a sigmoidal relationship between log enrichment measurements and the latent phenotype values predicted by the G-P map. Nonlinearities like this are ubiquitous in DMS data due to the presence of background and saturation effects. Unless they are explicitly accounted for in one’s quantitative modeling efforts, as they are here, these nonlinearities can greatly distort the parameters of inferred G-P maps. Figure 3c shows that accounting for this nonlinearity yields predictions that correlate quite well with measurement values.

One particularly useful feature of MAVE-NN is that every inferred model can be used as a MAVE dataset simulator (see “Methods”). Among other things, this capability allows users to verify whether MAVE-NN can recover ground-truth models from realistic datasets in diverse biological contexts. By analyzing simulated data generated by the model we inferred for GB1, we observed that MAVE-NN could indeed accurately and robustly recover both the GE nonlinearity and the ground-truth G-P map parameters (Additional file 1: Fig. S1a-d).

Figure 3d summarizes the values of our information-theoretic metrics for model performance. On held-out test data we find that I_var=2.178±0.027 bits and I_pre=2.225±0.017 bits, where the uncertainties in these values reflect standard errors. The similarity of these two values suggests that the inferred GE measurement process, which includes a heteroscedastic skewed-t noise model, very well describes the distribution of residuals. We further find that 2.741±0.013 bits≤I_int≤3.215±0.007 bits (see “Methods”), meaning that the inferred G-P map accounts for 69–81% of the total sequence-dependent information in the dataset. While this performance is impressive, the additive G-P map evidently misses some relevant aspect of the true genetic architecture. This observation motivates the more complex biophysical model for GB1 discussed later in “Results”.

The ability of MAVE-NN to deconvolve experimental nonlinearities from additive G-P maps requires that some of the assayed sequences contain multiple mutations. This is because such nonlinearities are inferred by reconciling the effects of single mutations with the effects observed for combinations of two or more mutations. To investigate how many multiple-mutation variants are required, we performed GE inference on subsets of the GB1 dataset containing all 1045 single-mutation sequences and either 50,000, 5000, or 500 double-mutation sequences (see “Methods”). The shapes of the resulting GE nonlinearities are illustrated in Fig. 3e–g. Remarkably, MAVE-NN is able to recover the underlying nonlinearity using only about 500 randomly selected double mutants, which represent only ~0.1% of all possible double mutants. The analysis of simulated data also supports the ability to accurately recover ground-truth model predictions using highly reduced datasets (Additional file 1: Fig. S1f). These findings have important implications for the design of DMS experiments: even if one only wants to determine an additive G-P map, including a modest number of multiple-mutation sequences in the assayed library is often advisable because it an enable the removal of artifactual nonlinearities.

To test the capabilities of MAVE-NN on less complete DMS datasets, we analyzed recent experiments on amyloid beta (A β) [9] and TDP-43 [10], both of which exhibit aggregation behavior in the context of neurodegenerative diseases. In these experiments, protein functionality was assayed using selective growth [Fig. 1a(ii)] in genetically modified Saccaromyces cerevisiae: Seuma et al. [9] positively selected for A β aggregation, whereas Bolognesi et al. [10] selected against TDP-43 toxicity. Like with GB1, the variant libraries used in these two experiments included a substantial number of multiple-mutation sequences: 499 single- and 15,567 double-mutation sequences for A β; 1266 single- and 56,730 double-mutation sequences for TDP-43. But unlike with GB1, these datasets are highly incomplete due to the use of mutagenic PCR (for A β) or doped oligo synthesis (for TDP-43) in the construction of variant libraries.

We used MAVE-NN to infer additive G-P maps from these two datasets, adopting the same type of latent phenotype model used for GB1. Figure 4a illustrates the additive G-P map inferred from aggregation measurements of A β variants. In agreement with the original study, we see that most amino acid mutations between positions 30–40 have a negative effect on variant enrichment, suggesting that this region plays a major role in promoting nucleation. Figure 4b shows the corresponding measurement process (see also Additional file 1: Fig. S2). Even though these data are much sparser than the GB1 data, the inferred model performs well on held-out test data (I_var=1.142±0.065 bits,I_pre=1.187±0.050 bits, R²=0.763±0.024). Similarly, Fig. 4c, d show the G-P map parameters and GE measurement process inferred from toxicity measurements of TDP-43 variants, revealing among other things the toxicity-determining hot-spot observed by Bolognesi et al. [10] at positions 310–340. Again, the resulting latent phenotype model performs well on held-out test data (I_var=1.834±0.035 bits, I_pre=1.994±0.023 bits, R²=0.914±0.007).

Application: a massively parallel splicing assay

Exon/intron boundaries are defined by 5 ^′ splice sites (5 ^′ss), which bind the U1 snRNP during the initial stages of spliceosome assembly. To investigate how 5 ^′ss sequence quantitatively controls alternative mRNA splicing, Wong et al. [11] used a massively parallel splicing assay (MPSA) to measure percent spliced in (PSI) values for nearly all 32,768 possible 5 ^′ss of the form NNN/GYNNNN in three different genetic contexts (Fig. 1c,d). Applying MAVE-NN to data from the BRCA2 exon 17 context, we inferred four different types of G-P maps: additive, neighbor, pairwise, and black box. As with GB1, these G-P maps were each inferred using GE regression with a heteroscedastic skewed-t noise model. For comparison, we also inferred an additive G-P map using the epistasis package of Sailer and Harms [28].

Figure 5a compares the performance of these G-P map models on held-out test data, while Fig. 5b–d illustrate the corresponding inferred measurement processes. We observe that the additive G-P map inferred using the epistasis package [28] exhibits less predictive information (I_pre=0.180±0.011 bits) than the additive G-P map found using MAVE-NN (P=3.8×10⁻⁶, two-sided Z-test). This is likely because the epistasis package estimates the parameters of the additive G-P map prior to estimating the GE nonlinearity. We also note that, while the epistasis package provides a variety of options for modeling the GE nonlinearity, none of these options appear to work as well as our more flexible sum-of-sigmoids approach (compare Fig. 5b,c). This finding again demonstrates that the accurate inference of G-P maps requires the explicit and simultaneous modeling of experimental nonlinearities.

We also observe that increasingly complex G-P maps exhibit increased accuracy. For example, the additive G-P map gives I_pre=0.257±0.013 bits, whereas the pairwise G-P map (Fig. 5d, f) attains I_pre=0.374±0.014 bits. Using MAVE-NN’s built-in parametric bootstrap approach for quantifying parameter uncertainty, we find that both the additive and pairwise G-P map parameters are very precisely determined (see Additional file 1: Fig. S3). The black box G-P map, which is comprised of 5 densely connected hidden layers of 10 nodes each, performed the best of all four G-P maps, achieving I_pre = 0.458±0.015 bits. Remarkably, this predictive information reaches the lower bound on the intrinsic information estimated from replicate experiments (I_int ≥ 0.462±0.009 bits; see “Methods”). The black box G-P map can, therefore, explain all of the apparent sequence-dependence in this MPSA dataset. We thus conclude that pairwise interaction models are not flexible enough to fully account for how 5 ^′ss sequences control splicing. More generally, these results underscore the need for software that is capable of inferring and assessing a variety of different G-P maps through a uniform interface.

Application: biophysically interpretable G-P maps

Biophysical models, unlike the phenomenological models considered thus far, have mathematical structures that reflect specific hypotheses about how sequence-dependent interactions between macromolecules mechanistically define G-P maps. Thermodynamic models, which rely on a quasi-equilibrium assumption, are the most commonly used type of biophysical model [41–43]. Previous studies have shown that precise thermodynamic models can be inferred from MAVE datasets [12, 44–46], but no software intended for use by the broader MAVE community has yet been developed for doing this. MAVE-NN meets this need by enabling the inference of custom G-P maps. We now demonstrate this biophysical modeling capability in the contexts of protein-ligand binding [using the DMS data of Olson et al. [8]; Fig. 1a(i)] and bacterial transcriptional regulation (using sort-seq MPRA data from Kinney et al. [12]; Fig. 1e). An expanded discussion of how these models are mathematically formulated and specified within MAVE-NN is provided in Additional file 1: Section S3.

Otwinowski [47] showed that a three-state thermodynamic G-P map (Fig. 6a), one that accounts for GB1 folding energy in addition to GB1-IgG binding energy, [48] can explain the DMS data of Olson et al. [8] better than a simple additive G-P map does. This biophysical model subsequently received impressive confirmation in the work of Nisthal et al. [49], who measured the thermostability of 812 single-mutation GB1 variants. We tested the ability of MAVE-NN to recover the same type of thermodynamic model that Otwinowski had inferred using custom analysis scripts. MAVE-NN yielded a G-P map with significantly improved performance on the data of Olson et al. [8] (I_var=2.303±0.013 bits, I_pre=2.357±0.007 bits, R²=0.947±0.001) relative to the additive G-P map of Fig. 3a-d. Figure 6b shows the two inferred energy matrices that respectively describe the effects of every possible single-residue mutation on the Gibbs free energies of protein folding and protein-ligand binding. The folding energy predictions of our model also correlate as well with the data of Nisthal et al. [49] (R²=0.570±0.049) as the predictions of Otwinowski’s model do (R²=0.515±0.056). This demonstrates that MAVE-NN can infer accurate and interpretable quantitative models of protein biophysics.

To test MAVE-NN’s ability to infer thermodynamic models of transcriptional regulation, we re-analyzed the MPRA data of Kinney et al. [12], in which random mutations to a 75-bp region of the Escherichia coli lac promoter were assayed (Fig. 1e). This promoter region binds two regulatory proteins, σ⁷⁰ RNA polymerase (RNAP) and the transcription factor CRP. As in Kinney et al. [12], we proposed a four-state thermodynamic model that quantitatively explains how promoter sequences control transcription rate (Fig. 6c). The parameters of this G-P map include the Gibbs free energy of interaction between CRP and RNAP (ΔG_I), as well as energy matrices that describe the CRP-DNA and RNAP-DNA interaction energies (ΔG_C and ΔG_R, respectively). Because the sort-seq MPRA of Kinney et al. [12] yielded discrete measurements (Fig. 1f), we used an MPA measurement process in our latent phenotype model (Fig. 6d). The biophysical parameters we thus inferred (Fig. 6e), which include a CRP-RNAP interaction energy of ΔG_I=−2.598±0.018 kcal/mol, largely match those of Kinney et al., but were obtained far more rapidly (in ∼10 min versus multiple days) thanks to the use of stochastic gradient descent rather than Metropolis Monte Carlo.

Constraints on datasets and models

As mentioned above, MAVE-NN places certain limitations on both input datasets and latent phenotype models. Some of these constraints have been adopted to simplify the initial release of MAVE-NN and can be relaxed in future updates. Others reflect fundamental mathematical properties of latent phenotype models. Here we summarize the primary constraints users should be aware of.

MAVE-NN currently requires that all input sequences be the same length. This constraint has been adopted because a large fraction of MAVE datasets have this form, and all of the built-in G-P maps operate only on fixed-length sequences. Users who wish to analyze variable length sequences can still do so by padding the ends of sequences with dummy characters. Alternatively, users can provide a multiple-sequence alignment as input and include the gap character as one of the characters to consider when training models.

As previously stated, MAVE-NN can analyze MAVE datasets that have either continuous or discrete measurements. At present, both types of measurements must be one-dimensional, i.e., users cannot fit a single model to vectors of multiple measurements (e.g., joint measurements of protein binding affinity and protein stability, as in Faure et al. [31]). This constraint has been adopted only to simplify the user interface of the initial release. It is not a fundamental limitation of latent phenotype models and is scheduled to be relaxed in upcoming versions of MAVE-NN.

The current implementation of MAVE-NN also supports only one-dimensional latent phenotypes (though the latent phenotype of custom G-P maps can depend on multiple precursor phenotypes, e.g., binding energy or folding energy). This restriction was made because accurately interpreting multi-dimensional latent phenotypes is substantially more fraught than interpreting one-dimensional latent phenotypes, and we believe that additional computational tools need to be developed to facilitate such interpretation. That being said, the mathematical form of latent phenotype models is fully compatible with multi-dimensional latent phenotypes. Indeed, this modeling strategy has been used in other work [24, 31–33], and we plan to enable this functionality in future updates to MAVE-NN.

More fundamental constraints come into play when analyzing MAVE data that contains only single-mutation variants. In such experiments, the underlying effects of individual mutations are hopelessly confounded by the biophysical, physiological, and experimental nonlinearities that may be present. By contrast, when the same mutation is observed in multiple genetic backgrounds, MAVE-NN can use systematic differences in the mutational effects observed between stronger and weaker backgrounds to remove these confounding influences. Thus, for datasets that probe only single-mutant effects, we limit MAVE-NN to inferring only additive G-P maps using GE regression, and while the noise model in the GE measurement process is allowed to be heteroscedastic, the nonlinearity is constrained to be linear.

We emphasize that, in practice, only a modest number of multiple-mutation variants are required for MAVE-NN to learn the form of a nonlinear measurement process (see Fig. 3e–g). In this way, including a small fraction of the possible double-mutation variants in MAVE libraries can be beneficial even just for determining the effects of single mutations. Adding such non-comprehensive sets of double mutants to MAVE libraries is experimentally straight-forward, and our numerical experiments suggest that assaying roughly the same number of double-mutation variants as single-mutation variants should often suffice. We therefore recommend that experimentalists—even those primarily interested in the effects of single mutations—consider augmenting their MAVE libraries with a small subset of double-mutation variants.

Discussion

We have presented a unified strategy for inferring quantitative models of G-P maps from diverse MAVE datasets. At the core of our approach is the conceptualization of G-P maps as a form of information compression, i.e., the G-P map first compresses an input sequence into a latent phenotype value, which is then read out indirectly via a noisy and nonlinear measurement process. By explicitly modeling this measurement process, one can remove potentially confounding effects from the G-P map, as well as accommodate diverse experimental designs. We have also introduced three information-theoretic metrics for assessing the performance of the resulting models. These capabilities have been implemented within an easy-to-use Python package called MAVE-NN.

We have demonstrated the capabilities of MAVE-NN in diverse biological contexts, including in the analysis of both DMS and MPRA data. We have also demonstrated the superior performance of MAVE-NN relative to the epistasis package of Sailer and Harms [28]. Along the way, we observed that MAVE-NN can deconvolve experimental nonlinearities from additive G-P maps when a relatively small number of sequences containing multiple mutations are assayed. This capability provides a compelling reason for experimentalists to include such sequences in their MAVE libraries, even if they are primarily interested in the effects of single mutations. Finally, we showed how MAVE-NN can learn biophysically interpretable G-P maps from both DMS and MPRA data.

MAVE-NN thus fills a critical need in the MAVE community, providing user-friendly software capable of learning quantitative models of G-P maps from diverse MAVE datasets. MAVE-NN has a streamlined user interface and is readily installed from PyPI by executing “pip install mavenn” at the command line. Comprehensive documentation and step-by-step tutorials are available at http://mavenn.readthedocs.io[50].

Methods

Notation

We represent each MAVE dataset as a set of N observations, $\left \{ \left (x_{n},y_{n}\right) \right \}_{n = 0}^{N - 1}$, where each observation consists of a sequence x_n and a measurement y_n. Here, y_n can be either a continuous real-valued number, or a nonnegative integer representing a “bin” in which the nth sequence was found. Note that, in this representation, the same sequence x can be observed multiple times, potentially with different values for y due to experimental noise.

G-P maps

We assume that all sequences have the same length L, and that at each of the L positions in each sequence there is one of C possible characters. MAVE-NN represents sequences using a vector of one-hot encoded features of the form

$$ \begin{aligned} x_{l:c} & = \left\{ \begin{array}{ll} \ 1 & \text{if character }\ c\ \text{ occurs at position }l \\ \ 0 & \text{otherwise} \\ \end{array} \right., \\ \end{aligned} $$

(1)

where l=0,1,…,L−1 indexes positions within the sequence, and c indexes the C distinct characters in the alphabet. MAVE-NN supplies built-in alphabets for DNA, RNA and protein (with or without stop codons), and supports custom alphabets as well.

We assume that the latent phenotype is given by a linear function ϕ(x;θ) that depends on a set of G-P map parameters θ. As mentioned in the main text, MAVE-NN supports four types of G-P map models, all of which can be inferred using either GE regression or MPA regression. The additive model is given by

$$ \phi_{\text{additive}}\left(x;\theta \right) = \theta_{0} + \sum\limits_{l = 0}^{L - 1}{\sum\limits_{c}\theta_{l:c}}x_{l:c}. $$

(2)

Here, each position in x contributes independently to the latent phenotype. The neighbor model is given by

$$ {\phi_{\text{neighbor}}\left(x;\, \theta \right) = \theta}_{0} + \sum\limits_{l = 0}^{L - 1}{\sum\limits_{c}\theta_{l:c}}x_{l:c} + \sum\limits_{l = 0}^{L - 2}{\sum\limits_{c,c^{\prime}}\theta_{l:c,(l + 1):c^{\prime}}} x_{l:c} x_{(l + 1):c^{\prime}}, $$

(3)

and further accounts for potential epistatic interactions between neighboring positions. The pairwise model is given by

$$ {\phi_{\text{pairwise}}\left(x;\theta \right) = \theta}_{0} + \sum\limits_{l = 0}^{L - 1}{\sum\limits_{c}\theta_{l:c}}x_{l:c} + \sum\limits_{l = 0}^{L - 2}{\sum\limits_{l^{\prime} = l + 1}^{L - 1}{\sum\limits_{c,c^{\prime}}\theta_{l:c,l^{\prime}:c^{\prime}}}}x_{l:c}x_{l^{\prime}:c^{\prime}}, $$

(4)

and includes interactions between all pairs of positions. Note our convention of requiring l^′>l in the pairwise parameters $\theta _{l:c,l^{\prime }:c^{\prime }}$.

Unlike these three parametric models, the black box G-P map does not have a fixed functional form. Rather, it is given by a multilayer perceptron that takes a vector of sequence features (additive, neighbor, or pairwise) as input, contains multiple fully connected hidden layers with nonlinear activations, and has a single node output with a linear activation. Users are able to specify the number of hidden layers, the number of nodes in each hidden layer, and the activation function used by these nodes.

MAVE-NN further supports custom G-P maps, which users can define by subclassing the G-P map base class. These G-P maps can have arbitrary functional forms, e.g., representing specific biophysical hypotheses. This feature of MAVE-NN is showcased in Fig. 6.

Gauge modes and diffeomorphic modes

G-P maps typically have non-identifiable degrees of freedom that must be fixed, i.e., pinned down, before the values of individual parameters can be meaningfully interpreted or compared between models. These degrees of freedom come in two flavors: gauge modes and diffeomorphic modes. Gauge modes are changes to θ that do not alter the values of the latent phenotype ϕ. Diffeomorphic modes [20, 24] are changes to θ that do alter ϕ, but do so in ways that can be undone by transformations of the measurement process p(y|ϕ). As shown by Kinney and Atwal [20, 24], the diffeomorphic modes of linear G-P maps (such as the additive and pairwise G-P maps featured in Figs. 3, 4, and 5) will typically correspond to affine transformations of ϕ, although additional unconstrained modes can occur in special situations.

MAVE-NN automatically fixes the gauge modes and diffeomorphic modes of inferred models (except when using custom G-P maps). The diffeomorphic modes of G-P maps are fixed by transforming θ via

$$ \theta_{0} \rightarrow \theta_{0} - a, $$

(5)

and then

$$ \theta \rightarrow \frac{\theta}{b}, $$

(6)

where a=mean({ϕ_n}) and b=std({ϕ_n}) are the mean and standard deviation of ϕ values computed on the training data. This produces a corresponding change in latent phenotype values ϕ→(ϕ−a)/b. To avoid altering model likelihood, MAVE-NN makes a corresponding transformation to the measurement process p(y|ϕ). In GE regression this is done by adjusting the GE nonlinearity via

$$ g\left(\phi \right) \rightarrow g\big(a + b\phi \big), \\ $$

(7)

while keeping the noise model $p\left (y|\hat {y}\right)$ fixed. In MPA regression, MAVE-NN transforms the full measurement process via

$$ p\big(y | \phi \big) \rightarrow p\big(y | a + b\phi \big). \\ $$

(8)

For the three parametric G-P maps, gauge modes are fixed using what we call the “hierarchical gauge”. Here, the parameters θ are adjusted so that the lower-order terms in ϕ(x;θ) account for the highest possible fraction of variance in ϕ. This procedure requires a probability distribution on sequence space with respect to which these variances are computed. MAVE-NN assumes that such distributions factorize by position and can thus be represented by a probability matrix with elements p_l:c, denoting the probability of character c at position l. MAVE-NN provides three built-in choices for this distribution: uniform, empirical, or wildtype. The corresponding values of p_l:c are given by

$$ \begin{aligned} p_{l:c} & = & \left\{ \begin{array}{cc} 1/C & \text{for uniform} \\ n_{l:c}/N & \text{for empirical} \\ x_{l:c}^{\text{WT}} & \text{for wildtype} \\ \end{array} \right., \end{aligned} $$

(9)

where n_l:c denotes the number of observations in the dataset (out of N total) for which the sequence has character c at position l, and $x_{l:c}^{\text {WT}}$ is the one-hot encoding of a user-specified wildtype sequence. In particular, the wildtype gauge is used for illustrating the additive G-P maps in Figs. 3 and 4, while the uniform gauge is used for illustrating the pairwise G-P map in Fig. 5 and the energy matrices in Fig. 6. After a sequence distribution is chosen, MAVE-NN fixes the gauge of the pairwise G-P map by transforming

$$ \begin{aligned} \theta_{0} & \rightarrow \theta_{0} \\ & \quad\ \ \ + \sum\limits_{l}{\sum\limits_{c^{\prime}}\theta_{l:c^{\prime}}}p_{l:c^{\prime}} \\ & \quad\ \ \ + \sum\limits_{l}{\sum\limits_{l^{\prime} {>} l}{\sum\limits_{c,c^{\prime}}\theta_{l:c,l^{\prime}:c^{\prime}}}}p_{l:c}\, p_{l^{\prime}:c^{\prime}\ }, \end{aligned} $$

(10)

$$ \begin{aligned} \theta_{l:c} & \rightarrow \theta_{l:c} \\ & \quad\ \ \ - \sum\limits_{c^{\prime}}\theta_{l:c^{\prime}}\, p_{l:c^{\prime}} \\ & \quad\ \ \ + \sum\limits_{l^{\prime} {>} l}{\sum\limits_{c^{\prime}}\theta_{{l:c,l}^{\prime}:c^{\prime}}}p_{l^{\prime}:c^{\prime}} \\ & \quad\ \ \ + \sum\limits_{l^{\prime} {<} l}{\sum\limits_{c^{\prime}}\theta_{l^{\prime}:c^{\prime},l:c}}p_{l^{\prime}:c^{\prime}} \\ & \quad\ \ \ - \sum\limits_{l^{\prime} {>} l}{\sum\limits_{c^{\prime},c^{\prime\prime}}\theta_{l:c^{\prime},l^{\prime}:c^{\prime\prime}}}p_{l:c^{\prime}}\, p_{l^{\prime}:c^{\prime\prime}} \\ & \quad\ \ \ - \sum\limits_{l^{\prime} {<} l}{\sum\limits_{c^{\prime},c^{\prime\prime}}\theta_{l:c^{\prime\prime},l^{\prime}:c^{\prime}}}p_{l:c^{\prime}}\, p_{l^{\prime}:c^{\prime\prime}}, \\ \end{aligned} $$

(11)

and

$$ \begin{aligned} \theta_{l:c,l^{\prime}:c^{\prime}} & \rightarrow \theta_{l:c,l^{\prime}:c^{\prime}} \\ & \quad\ \ \ - \sum\limits_{c^{\prime\prime}}\theta_{l:c^{\prime\prime},l^{\prime}:c^{\prime}}p_{l:c^{\prime\prime}} \\ & \quad\ \ \ - \sum\limits_{c^{\prime\prime}}\theta_{l:c,l^{\prime}:c^{\prime\prime}}p_{l^{\prime}:c^{\prime\prime}} \\ & \quad\ \ \ + \sum\limits_{c^{\prime\prime},c^{\prime\prime\prime}}\theta_{l:c^{\prime\prime},l^{\prime}:c^{\prime\prime\prime}}\, p_{l:c^{\prime\prime}}\, p_{l^{\prime}:c^{\prime\prime\prime}}. \end{aligned} $$

(12)

This transformation is also used to gauge-fix the additive and neighbor G-P maps, but with $\theta _{l:c,l^{\prime }:c^{\prime }} = 0$ for all l,l^′ (additive model), or when l^′≠l+1 (neighbor model).

GE nonlinearities

GE models assume that each measurement y is a nonlinear function g(·) of the latent phenotype ϕ, plus some noise. In MAVE-NN, this nonlinearity is represented as a sum of hyperbolic tangent sigmoids:

$$ g\left(\phi;\alpha \right) = a + \sum\limits_{k = 0}^{K - 1}b_{k}\text{tanh}\left(c_{k}\phi + d_{k} \right). $$

(13)

Here, K specifies the number of hidden nodes contributing to the sum, and α = {a,b_k,c_k,d_k} are trainable parameters. We note that this mathematical form is an example of the bottleneck architecture previously used by others [27, 33] for modeling GE nonlinearities. By default, MAVE-NN constrains g(ϕ;α) to be monotonic in ϕ by requiring all b_k≥0 and c_k≥0, but this constraint can be relaxed.

GE noise models

MAVE-NN supports three types of GE noise model: Gaussian, Cauchy, and skewed-t. All of these noise models support the analytic computation of quantiles and prediction intervals, as well as the rapid sampling of simulated measurement values. The Gaussian noise model is given by

$$ p_{\text{gauss}}\left(y | \hat{y};\, s \right) = \frac{1}{\sqrt{2\pi s^{2}}}\text{exp}\ \left[- \frac{\left(y - \hat{y}\right)^{2}}{2s^{2}} \right], \\ $$

(14)

where s denotes the standard deviation. Importantly, MAVE-NN allows this noise model to be heteroscedastic by representing s as an exponentiated polynomial in $\hat {y}$, i.e.,

$$ s\left(\hat{y} \right) = \text{exp}\ \left[\sum\limits_{k = 0}^{K}a_{k}{\hat{y}}^{k} \right], $$

(15)

where K is the order of the polynomial and {a_k} are trainable parameters. The user has the option to set K, and setting K=0 renders this noise model homoscedastic. Quantiles are computed using $y_{q} = \hat {y} + s\,\sqrt {2}\,\text {erf}^{- 1}(2q - 1)$ for user-specified values of q∈[0,1]. Similarly, the Cauchy noise model is given by

$$ p_{\text{cauchy}}\left(y | \hat{y};s \right) = \left[\pi{s}\left(1 + \frac{\left(y - \hat{y}\right)^{2}}{s^{2}} \right) \right]^{- 1}, $$

(16)

where the scale parameter s is an exponentiated K-order polynomial in $\hat {y}$, and quantiles are computed using $y_{q} = \hat {y} + s\,\tan \left [ \pi \left (q - \frac {1}{2} \right) \right ]$.

The skewed-t noise model is of the form described by Jones and Faddy [34] and is given by

$$ p_{\text{skewt}}\left(y | \hat{y};s,a,b \right) = s^{- 1}f\left(t;a,b \right), $$

(17)

where

$$ t = t^{*} + \frac{y - \hat{y}}{s},\ \ {t^{*}} = \frac{\left(a - b \right)\sqrt{a + b}}{\sqrt{2a + 1}\sqrt{2b + 1}}, $$

(18)

and

$$ f\left(t;a,b \right) = \frac{2^{1 - a - b}}{\sqrt{a + b}}\frac{\Gamma\left(a + b \right)}{\Gamma\left(a \right)\Gamma\left(b \right)}{\ \left[1 + \frac{t}{\sqrt{a + b + t^{2}}} \right]}^{a + \frac{1}{2}} \times {\ \left[ 1 - \frac{t}{\sqrt{a + b + t^{2}}} \right]}^{b + \frac{1}{2}}. $$

(19)

Note that the t statistic here is an affine function of y chosen so that the distribution’s mode (corresponding to t^∗) is positioned at $\hat {y}$. The three parameters of this noise model, {s,a,b}, are each represented using K-order exponentiated polynomials with trainable coefficients. Quantiles are computed using

$$ y_{q} = \hat{y} + \left(t_{q} - t^{*} \right)s, $$

(20)

where

$$ t_{q} = \frac{\left(2x_{q} - 1 \right)\sqrt{a + b}}{\sqrt{1 - \left(2x_{q} - 1\right)^{2}}},\ \ x_{q} = I_{q}^{- 1}\left(a,b \right), $$

(21)

and $I_{q}^{- 1}\left (a,b \right)$ denotes the inverse of the regularized incomplete Beta function I_x(a,b).

Empirical noise models

MAVE-NN further supports the inference of GE regression models that account for user-specified measurement noise. In such cases, the user provides a set of measurement-specific standard errors, ${\{s_{n}\}}_{n=0}^{N-1}$, along with the corresponding observations. These uncertainties can, for example, be estimated by using a software package like Enrich2 [16] or DiMSum [19]. MAVE-NN then trains the parameters of latent phenotype models by assuming a Gaussian noise model of the form

$$ p_{\text{empirical}}\left(y_{n} | {\hat{y}}_{n},s_{n} \right) = \frac{1}{\sqrt{2\pi s_{n}^{2}}}\ \text{exp}\left[ - \frac{\left(y_{n} - {\hat{y}}_{n}\right)^{2}}{2s_{n}^{2}} \right], \\ $$

(22)

where ${\hat {y}}_{n} = g\left (f \left (x_{n};\theta \right);\alpha \right)$ is the expected measurement for sequence x_n, θ denotes G-P map parameters, and α denotes the parameters of the GE nonlinearity. This noise model thus has the advantage of having no free parameters, but it may be problematically misspecified if the true error distribution is heavy-tailed or skewed.

MPA measurement process

In MPA regression, MAVE-NN directly models the measurement process p(y|ϕ). At present, MAVE-NN only supports MPA regression for discrete values of y, which must be indexed using nonnegative integers. MAVE-NN supports two alternative forms of input for MPA regression. One is a set of sequence-measurement pairs, $\left \{ \left (x_{n},y_{n}\right) \right \}_{n = 0}^{N - 1}$, where N is the total number of reads, {x_n} is a set of (typically non-unique) sequences, each y_n∈{0,1,…,Y−1} is a bin number, and Y is the total number of bins. The other is a set of sequence-count-vector pairs, $\left \{ (x_{m},c_{m}) \right \}_{m = 0}^{M - 1}$, where M is the total number of unique sequences and c_m=(c_m0,c_m1,…,c_m(Y−1)) is a vector that lists the number of times c_my that the sequence x_m was observed in each bin y. MPA measurement processes are represented as a multilayer perceptron with one hidden layer (having tanh activations) and a softmax output layer. Specifically,

$$ p\big(y | \phi \big) = \frac{w_{y}\left(\phi \right)}{\sum_{y^{\prime}}^{}w_{y^{\prime}}\left(\phi \right)}, $$

(23)

where

$$ w_{y}\big(\phi \big) = \text{exp}\ \left[ a_{y} + \sum\limits_{k = 0}^{K - 1}b_{\text{yk}}\text{ tanh}\ \big(c_{\text{yk}}\phi + d_{\text{yk}} \big) \right], $$

(24)

and K is the number of hidden nodes per value of y. The trainable parameters of this measurement process are η={a_y,b_yk,c_yk,d_yk}.

Loss function

Let θ denote the G-P map parameters, and η denote the parameters of the measurement process. MAVE-NN optimizes these parameters using stochastic gradient descent on a loss function given by

$$ \mathcal{L} = \mathcal{L}_{\text{like}} + \mathcal{L}_{\text{reg}}\ , $$

(25)

where $\mathcal {L}_{\text {like}}$ is the negative log likelihood of the model, given by

$$ \mathcal{L}_{\text{like}}\left[\theta,\eta \right] = - \sum\limits_{n = 0}^{N - 1}\text{log}\, \left[p\big(y_{n} | \phi_{n};\eta \big) \right], $$

(26)

where ϕ_n=ϕ(x_n;θ), and $\mathcal {L}_{\text {reg}}$ provides for the regularization of both θ and η.

In the context of GE regression, we can write η = (α,β) where α represents the parameters of the GE nonlinearity g(ϕ;α) and β denotes the parameters of the noise model $p\big (y|\hat {y};\beta \big)$. The likelihood contribution from each observation n then becomes $p\big (y_{n}|\phi _{n};\eta \big) = p\big (y_{n}|{\hat {y}}_{n};\beta \big)$ where ${\hat {y}}_{n} = g(\phi _{n};\alpha)$. In the context of MPA regression with a dataset of the form $\left \{ (x_{m},c_{m}) \right \}_{m = 0}^{M - 1}$, the loss function can be written as

$$ \mathcal{L}_{\text{like}}\left[ \theta,\eta \right] = - \sum\limits_{m = 0}^{M - 1}{\sum_{y = 0}^{Y - 1}c_{\text{my}}}\text{log}\left[ p\big(y | \phi_{m};\eta \big) \right] $$

(27)

where ϕ_m=ϕ(x_m;θ). For the regularization term, MAVE-NN uses an L₂ penalty of the form

$$ \mathcal{L}_{\text{reg}}\left[\theta,\eta \right] = \lambda_{\theta}\left\| \theta \right\|^{2} + \ \lambda_{\eta}\left\| \eta \right\|^{2}\ , $$

(28)

where the user-adjustable parameters λ_θ (default value 10⁻³) and λ_η (default value 10⁻¹) respectively control the strength of regularization for θ and η.

Predictive information

In what follows, we use p_model(y| ϕ) to denote a measurement process inferred by MAVE-NN, whereas p_true(y| ϕ) denotes the empirical conditional distribution of y and ϕ values that would be observed in the limit of infinite test data. Predictive information is defined by I_pre=I[y;ϕ], where I[·;·] represents mutual information computed on data not used for training (i.e., a held-out test set or data from a different experiment). I_pre provides a measure of how strongly a G-P map predicts experimental measurements. Importantly, this quantity does not depend on the corresponding measurement process p_model(y| ϕ). To estimate I_pre, we use k-nearest neighbor (kNN) estimators of entropy and mutual information adapted from the NPEET Python package [51]. Here, the user has the option of adjusting k, which controls a variance/bias tradeoff. When y is discrete (MPA regression), I_pre is computed using the classic kNN entropy estimator [52, 53] via the decomposition $I[ y;\phi ] = H[\phi ] - \sum _{y}^{}p(y)H_{y}[\phi ]$, where H_y[ϕ] denotes the entropy of p_true(ϕ| y). When y is continuous (GE regression), I[y;ϕ] is estimated using the kNN-based Kraskov Stögbauer Grassberger (KSG) algorithm [53]. This approach optionally supports the local nonuniformity correction of Gao et al. [54], which is important when y and ϕ exhibit strong dependencies, but which also requires substantially more time to compute.

Variational information

We define variational information as an affine transformation of $\mathcal {L}_{\text {like}}$,

$$ I_{\text{var}} = H[y] - \frac{\text{log}_{2}\left(e \right)}{N}\mathcal{L}_{\text{like}}. $$

(29)

Here, H[y] is the entropy of the data {y_n}, which is estimated using the kNN estimator from the NPEET package [51]. Noting that this quantity can also be written as I_var=H[y]−mean({Q_n}), where Q_n=−log₂p(y_n|ϕ_n), we estimate the associated uncertainty (denoted by δ) using

$$ \delta I_{\text{var}}\left[ y;\phi \right] = \sqrt{\delta H[ y]^{2} + \frac{\text{var}\left(\left\{ Q_{n} \right\} \right)}{N}}. $$

(30)

The inference strategy used by MAVE-NN is based on the fact that I_var provides a tight variational lower bound on I_pre. Indeed, in the large data limit,

$$ I_{\text{pre}} = I_{\text{var}} + D_{\text{KL}}\left(p_{\text{true}}\ \| \ {p}_{\text{model}}\right), $$

(31)

where D_KL(·)≥0 is a Kullback-Leibler divergence that quantifies the accuracy of the inferred measurement process. From Eq. 31 one can see that, with appropriate caveats, maximizing I_var (or equivalently, minimizing $\mathcal {L}_{\text {like}}$) will also maximize I_pre [24]. But unlike I_pre, I_var is readily compatible with backpropagation and stochastic gradient descent. See Additional file 1 for a derivation of Eq. 31 and a discussion of relevant prior work [20, 24, 37–39]. We note that Sharpee et al. [55] cleverly showed that I_pre can, in fact, be optimized using stochastic gradient descent. Computing gradients of I_pre, however, requires a time-consuming density estimation step, whereas optimizing I_var can be done using standard per-datum backpropagation.

Intrinsic information

Intrinsic information, I_int=I[x;y], is the mutual information between the sequences x and measurements y in a dataset. This quantity is somewhat tricky to estimate due to the high-dimensional nature of sequence space. Here we used three different methods to obtain the upper and lower bounds on I_int shown in Figs. 3d and 5a. To compute the upper bound on I_int for the GB1 data of Olson et al. [8] (in Fig. 3d), we used the fact that

$$ I\big[x;y \big] = H\big[y\big] - \big\langle H_{x}\big[ y \big] \big\rangle_{x}, $$

(32)

where H[y] is the entropy of all measurements y, H_x[y] is the entropy of p(y| x) for a specific choice of sequence x, and 〈·〉_x indicates averaging over the sequences x in the dataset. Here, the measurement value y_n for each sequence x was computed using Eq. 36 (below). H[y] was then estimated using the kNN estimator applied to these measurements [52]. We also estimated the uncertainty in y_n by propagating errors expected due to Poisson fluctuations in read counts:

$$ \delta y_{n} = \text{log}_{2}\left(e \right)\sqrt{\frac{1}{c_{n}^{\text{in}} + 1} + \frac{1}{c_{n}^{\text{out}} + 1}}. $$

(33)

Assuming p(y| x) to be Gaussian, we find the corresponding conditional entropy to be

$$ H_{x_{n}}\big[y_{n} \big] = \frac{1}{2}\text{log}_{2}\bigg(2\pi e\,\delta y_{n}^{2} \bigg). $$

(34)

Notice that we did not include the effects of fluctuations in $c_{\text {WT}}^{\text {in}}$ or $c_{\text {WT}}^{\text {out}}$ in Eq. 33, as these shift all y values by the same amount and thus do not affect either H[y] or H_x[y].

These H[y] and H_x[y] values were then used in Eq. 32 to estimate I_int. We expect this to provide an upper bound on the true value of I_int because uncertainty in y must be at least that expected under the Poisson sampling of reads. However the use of linear error propagation and the assumption that p(y|x) is approximately Gaussian complicate this conclusion. Also, when applied to MPSA data, this method yielded an upper bound of 0.96 bits, a value that is likely far higher than the true value of I_int. This mismatch likely resulted from read counts in the MPSA data being over-dispersed.

To compute the lower bound on I_int for the GB1 dataset (Fig. 3d) we used the predictive information I_pre (on test data) of a GE regression model having a black box G-P map. This provides a lower bound because I_int≥I_pre for any model (when evaluated on test data) due to the Data Processing Inequality and the Markov chain nature of the dependencies y←x→ϕ in Fig. 2e [24, 36].

To compute a lower bound on I_int for MPSA data (Fig. 5c), we leveraged the availability of replicate data in Wong et al. [11]. Let y and y^′ represent the original and replicate measurements obtained for a sequence x. Because y ← x → y^′ forms a Markov chain, I[x;y]≥I[y;y^′] [36]. We therefore used an estimate of I[y;y^′], computed using the KSG method [51, 53], as the lower bound for I_int.

Uncertainties in kNN estimates of mutual information

MAVE-NN quantifies uncertainties in H[y] and I[y;ϕ] using multiple random samples of half the data. Let $\mathcal {D}_{100\%}$ denote the dataset on which we wish to compute these quantities, and let $\mathcal {D}_{50\%,r}$ denote a 50% subsample (indexed by r) of this dataset. Given an estimator E(·) of either entropy or mutual information, as well as the number of subsamples R to use, the uncertainty in $E(\mathcal {D}_{100\%})$ is estimated as

$$ \delta\text{E}\left(\mathcal{D}_{100\%} \right) = \frac{1}{\sqrt{2}}\text{std}\left[\left\{ E\left(\mathcal{D}_{50\%,r} \right) \right\}_{r = 0}^{R - 1} \right]. $$

(35)

MAVE-NN uses R=25 by default. Note that computing such uncertainty estimates substantially increases computation time, as E(·) needs to be evaluated R+1 times instead of just once. We also note that we favor this subsampling approach, as opposed to standard bootstrap resampling [56, 57], as the latter can systematically underestimate H[y] and overestimate I[y;z].

Uncertainties in G-P map parameters

Given a trained latent phenotype model having G-P map parameters θ^∗ and measurement process parameters η^∗, MAVE-NN can optionally assess model uncertainty using the following parametric bootstrap approach. Using the trained model with parameters (θ^∗,η^∗) as “ground truth”, MAVE-NN first simulates R MAVE datasets $\mathcal {D}_{r} = \left \{ \left ({x}_{n},y_{n}^{(r)}\right) \right \}_{n = 0}^{N - 1}$, where r=0,1,…,R−1. For each simulated dataset $\mathcal {D}_{r}$, MAVE-NN then trains a new model (by default using the same hyperparameters as were used for the ground truth model). This procedure yields a set $\left \{ \left (\theta ^{\left (r \right)},\eta ^{\left (r \right)} \right) \right \}_{r = 0}^{R - 1} $ of simulation-inferred G-P map parameters and corresponding measurement process parameters. Users can then use this sampling of G-P map parameters to estimate uncertainties, e.g., by reporting $\delta \theta _{k} = \text {std}\left [\left \{ \theta _{k}^{\left (r \right)} \right \}_{r = 0}^{R - 1} \right ]$.

There are good reasons to estimate parameter uncertainties using this parametric bootstrap approach, rather than through other bootstrapping methods. In particular, the parametric bootstrap keeps the sequences in each simulated data set fixed, and treats only the measurements as being stochastic. This comports with the fact that, in many MAVE designs, a specific set of user-specified sequences are assayed, with each sequence receiving a single measurement value. If each simulated dataset were instead generated by resampling sequence-measurement pairs from the full dataset, virtually all of these simulated datasets would end up lacking measurements for a substantial subset of the original assayed sequences. Among other issues, this can lead to parameter non-identifiability in models trained on simulated data even when such non-identifiability is absent in models trained on the original dataset.

Another important detail when assessing parameter uncertainty is to ensure that both the gauge modes and the diffeomorphic modes of each model are fixed. This is necessary so that differences in the parameters that do not affect model predictions do not inflate uncertainty estimates. For additive, neighbor, and pairwise G-P maps, MAVE-NN automatically implements the procedure described in the “Gauge modes and diffeomorphic modes” section above, thereby removing these extra degrees of freedom. However, for more complex models such as those implemented by MAVE-NN’s custom G-P map functionality (e.g., representing biophysical models), different gauge freedoms and diffeomorphic modes may arise depending on the details of the model, and users must take care to determine and fix these prior to assessing parameter uncertainty. We also note that no meaningful computation of individual parameter uncertainties is likely to be possible for highly overparameterized models, such as the “black box” multilayer perceptron models supported by MAVE-NN.

Datasets

For the GB1 DMS dataset of Olson et al. [8], measurements were computed using

$$ y_{n} = \log_{2}\frac{\left({c}_{n}^{\text{out}} + 1\right) / \left({c}_{\text{WT}}^{\text{out}} + 1\right)}{\left({c}_{n}^{\text{in}} + 1\right) / \left({c}_{\text{WT}}^{\text{in}} + 1\right)}, $$

(36)

where $c_{n}^{\text {in}}$ and $c_{n}^{\text {out}}$ respectively represent the number of reads from the input and output samples (i.e., pre-selection and post-selection libraries), and n→WT represents the 55 aa wildtype sequence, corresponding to positions 2–56 of the GB1 domain. To infer the model in Fig. 3a–c and to compute the information metrics in Fig. 3d, only double-mutant sequences with $c_{n}^{\text {in}} \geq 10$ were used; these represent 530,737 out of the 536,085 possible double mutants. For the models in Fig. 3e–g, y_n values for the 1045 single-mutant were also used in the inference procedure.

For the A β DMS data of Seuma et al. [9] and TDP-43 DMS data of Bolognesi et al. [10], y_n values respectively represent the nucleation scores and toxicity scores reported by the authors.

For the MPSA of Wong et al. [11], we used the data from replicate 1 of the BRCA2 minigene library 1. Measurements were computed as

$$ y_{n} = \log_{10}\left[100 \times \frac{\left({c}_{n}^{\text{inc}} + 1\right) / \left({c}_{\text{CONS}}^{\text{inc}} + 1\right)}{\left({c}_{n}^{\text{tot}} + 1\right) / \left({c}_{\text{CONS}}^{\text{tot}} + 1\right)} \right], $$

(37)

where $c_{n}^{\text {inc}}$ and $c_{n}^{\text {tot}}$ respectively represent the number of barcode reads obtained from exon inclusion isoforms and from total mRNA, and n→CONS corresponds to the consensus 5 ^′ss sequence CAG/GUAAGU. Corresponding PSI values were computed as $\text {PSI}_{n} = 10^{y_{n}}$. Only sequences with $c_{n}^{\text {tot}} \geq 10$ were used for inference; these represent 30,483 of the 32,768 possible sequences of the form NNN/GYNNNN.

For the lac promoter sort-seq MPRA data of Kinney et al. [12], the {c_my} values used for inference represent raw read counts from the “full-wt” experiment. We obtained these data from https://github.com/jbkinney/09_sortseq.

Availability of data and materials

MAVE-NN can be installed from PyPI by executing “pip install mavenn” at the POSIX command line. Comprehensive documentation, including step-by-step tutorials, is provided [50]. Source code, the data sets analyzed in this paper, and the scripts used for training the models and making the figures presented herein, are available under an MIT open-source license [58]. MAVE-NN version 1.0.1 was used for all of the analysis described in this manuscript and is archived on Zenodo [59].

References

Kinney JB, McCandlish DM. Massively parallel assays and quantitative sequence–function relationships. Annu Rev Genom Hum G. 2019; 20:99–127.
Article CAS Google Scholar
Starita LM, et al.Variant Interpretation: functional assays to the rescue. Am J Hum Genet. 2017; 101:315–25.
Article CAS PubMed PubMed Central Google Scholar
Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014; 11:801–7.
Article CAS PubMed PubMed Central Google Scholar
Levo M, Segal E. In pursuit of design principles of regulatory sequences. Nat Rev Genet. 2014; 15:453–68.
Article CAS PubMed Google Scholar
White MA. Understanding how cis-regulatory function is encoded in DNA sequence using massively parallel reporter assays and designed sequences. Genomics. 2015; 106:165–70.
Article CAS PubMed Google Scholar
Inoue F, Ahituv N. Decoding enhancers using massively parallel reporter assays. Genomics. 2015; 106:159–64.
Article CAS PubMed Google Scholar
Peterman N, Levine E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics. 2016; 17:206.
Article PubMed PubMed Central CAS Google Scholar
Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr Biol. 2014; 24:2643–51.
Article CAS PubMed PubMed Central Google Scholar
Seuma M, Faure A, Badia M, Lehner B, Bolognesi B. The genetic landscape for amyloid beta fibril nucleation accurately discriminates familial Alzheimer’s disease mutations. Elife. 2021; 10:e63364.
Article CAS PubMed PubMed Central Google Scholar
Bolognesi B, et al.The mutational landscape of a prion-like domain. Nat Commun. 2019; 10:4162.
Article PubMed PubMed Central CAS Google Scholar
Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5 ^′ splice sites. Mol Cell. 2018; 71:1012–1026.e3.
Article CAS PubMed PubMed Central Google Scholar
Kinney JB, Murugan A, Callan CG, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci. 2010; 107:9158–63.
Article CAS PubMed PubMed Central Google Scholar
Fowler DM, Araya CL, Gerard W, Fields S. Enrich: software for analysis of protein function by enrichment and depletion of variants. Bioinformatics. 2011; 27:3430–1.
Article CAS PubMed PubMed Central Google Scholar
Alam KK, Chang JL, Burke DH. FASTAptamer: a bioinformatic toolkit for high-throughput sequence analysis of combinatorial selections. Mol Ther Nucleic Acids. 2015; 4:e230.
Article CAS PubMed PubMed Central Google Scholar
Bloom JD. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinformatics. 2015; 16:168.
Article PubMed PubMed Central Google Scholar
Rubin AF, et al.A statistical framework for analyzing deep mutational scanning data. Genome Biol. 2017; 18:1–15.
Article CAS Google Scholar
Ashuach T, et al.MPRAnalyze: statistical framework for massively parallel reporter assays. Genome Biol. 2019; 20:183.
Article PubMed PubMed Central Google Scholar
Niroula A, Ajore R, Nilsson B. MPRAscore: robust and non-parametric analysis of massively parallel reporter assays. Bioinformatics. 2019; 35:5351–3.
Article CAS PubMed Google Scholar
Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biol. 2020; 21:207.
Article PubMed PubMed Central Google Scholar
Atwal GS, Kinney JB. Learning quantitative sequence–function relationships from massively parallel experiments. J Stat Phys. 2016; 162:1203–43.
Article Google Scholar
Melnikov A, et al.Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol. 2012; 30:271–7.
Article CAS PubMed PubMed Central Google Scholar
Mogno I, Kwasnieski JC, Cohen BA. Massively parallel synthetic promoter assays reveal the in vivo effects of binding site variants. Genome Res. 2013; 23:1908–15.
Article CAS PubMed PubMed Central Google Scholar
Abadi M, et al.TensorFlow: a systems for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016): 2016.
Kinney JB, Atwal GS. Parametric inference in the large data limit using maximally informative models. Neural Comput. 2014; 26:637–53.
Article PubMed Google Scholar
Kinney JB, Tkacik G, Callan CG. Precise physical models of protein–DNA interaction from high-throughput data. Proc Natl Acad Sci. 2007; 104:501–6.
Article CAS PubMed Google Scholar
Otwinowski J, Nemenman I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS One. 2013; 8:e61570.
Article CAS PubMed PubMed Central Google Scholar
Sarkisyan KS, et al.Local fitness landscape of the green fluorescent protein. Nature. 2016; 533:397–401.
Article CAS PubMed PubMed Central Google Scholar
Sailer ZR, Harms MJ. Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics. 2017; 205:1079–88.
Article CAS PubMed PubMed Central Google Scholar
Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proc Natl Acad Sci. 2018; 115:E7550–8.
Article CAS PubMed PubMed Central Google Scholar
Gelman S, Fahlberg SA, Heinzelman P, Romero PA, Gitter A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc Natl Acad Sci. 2021; 118:e2104878118.
Article CAS PubMed PubMed Central Google Scholar
Faure AJ, et al.Global mapping of the energetic and allosteric landscapes of protein binding domains. bioRxiv. 2021. https://doi.org/10.1101/2021.09.14.460249.
Tonner PD, Pressman A, Ross D. Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power. bioRxiv. 2021. https://doi.org/10.1101/2021.06.11.448129.
Pokusaeva VO, et al.An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PloS Genet. 2019; 15:e1008079.
Article CAS PubMed PubMed Central Google Scholar
Jones MC, Faddy MJ. A skew extension of the t-distribution, with applications. J Royal Statistical Soc Ser B Statistical Methodol. 2003; 65:159–74.
Article Google Scholar
Kinney JB, Atwal GS. Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci. 2014; 111:3354–9.
Article CAS PubMed PubMed Central Google Scholar
Cover TM, Thomas JA. Elements of Information Theory: Wiley; 2006.
Barber D, Agakov F. The IM algorithm: a variational approach to information maximization. Adv Neural Inf Proces Syst. 2004; 16:201.
Google Scholar
Alemi AA, Fischer I, Dillon JV, Murphy K. Deep Variational Information Bottleneck. arXiv:1612.00410 [cs.LG]. 2016.
Chalk M, Marre O, Tkačik G. Relevant sparse codes with variational information bottleneck. arXiv:1605.07332 [stat. ML]. 2016.
Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Bioinformatics. 2020; 36:2272–4.
Article CAS PubMed Google Scholar
Bintu L, et al.Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005; 15:116–24.
Article CAS PubMed PubMed Central Google Scholar
Sherman MS, Cohen BA. Thermodynamic state ensemble models of cis-regulation. Plos Comput Biol. 2012; 8:e1002407.
Article CAS PubMed PubMed Central Google Scholar
Wong F, Gunawardena J. Gene regulation in and out of equilibrium. Annu Rev Biophys. 2020; 49:199–226.
Article CAS PubMed Google Scholar
Belliveau NM, et al.Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. Proc Natl Acad Sci. 2018; 115:201722055.
Article CAS Google Scholar
Barnes SL, Belliveau NM, Ireland WT, Kinney JB, Phillips R. Mapping DNA sequence to transcription factor binding energy in vivo. PloS Comput Biol. 2019; 15:e1006226.
Article CAS PubMed PubMed Central Google Scholar
Tareen A, Kinney JB. Biophysical models of cis-regulation as interpretable neural networks. In: Proceedings of the 14th conference on Machine Learning in Computational Biology (MLCB 2019): 2019.
Otwinowski J. Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol Biol Evol. 2018; 35:2345–54.
Article CAS PubMed PubMed Central Google Scholar
Manhart M, Morozov AV. Protein folding and binding can emerge as evolutionary spandrels through structural coupling. Proc Natl Acad Sci. 2015; 112:1797–802.
Article CAS PubMed PubMed Central Google Scholar
Nisthal A, Wang CY, Ary ML, Mayo SL. Protein stability engineering insights revealed by domain-wide comprehensive mutagenesis. Proc Natl Acad Sci. 2019; 116:16367–77.
Article CAS PubMed PubMed Central Google Scholar
Tareen A, Kooshkbaghi M, Posfai A. Ireland W. T., McCandlish D.M. Kinney JB. https://mavenn.readthedocs.io. Accessed 24 Mar 2022.
Steeg GV. Non-Parametric Entropy Estimation Toolbox (NPEET). Github repository. https://github.com/gregversteeg/NPEET. Accessed 24 Mar 2022.
Vasicek O. A Test for Normality Based on Sample Entropy. J R Stat Soc Ser B. 1976; 38:54–9.
Google Scholar
Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004; 69:066138.
Article CAS Google Scholar
Gao S, Steeg GV, Galstyan A. Efficient estimation of mutual information for strongly dependent variables. arXiv:1411.2003 [cs.IT]. 2014.
Sharpee T, Rust NC, Bialek W. Analyzing neural responses to natural signals: maximally informative dimensions. Neural Comput. 2004; 16:223–50.
Article PubMed Google Scholar
Efron B. Bootstrap methods: another look at the jackknife. Ann Stat. 1979; 7:1–26.
Article Google Scholar
Efron B, Tibshirani R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci. 1986; 1:54–75.
Google Scholar
Tareen A, Kooshkbaghi M, Kinney JB. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. GitHub repository. https://github.com/jbkinney/mavenn. Accessed 24 Mar 2022.
Tareen A, Kooshkbaghi M, Kinney JB. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Zenodo. 2022. https://doi.org/10.5281/zenodo.595405.

Download references

Review history

The review history is available as Additional file 2.

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Funding

This work was supported by NIH grant R35GM133777 (awarded to JBK), NIH Grant R35GM133613 (awarded to DMM), an Alfred P. Sloan Research Fellowship (awarded to DMM), a grant from the CSHL/Northwell Health partnership, and funding from the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.

Author information

Authors and Affiliations

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, 11724, NY, USA
Ammar Tareen, Mahdi Kooshkbaghi, Anna Posfai, David M. McCandlish & Justin B. Kinney
Present Address: Regeneron Pharmaceuticals, Inc., Tarrytown, 10591, NY, USA
Ammar Tareen
Department of Physics, California Institute of Technology, Pasadena, 91125, CA, USA
William T. Ireland
Present Address: Department of Applied Physics, Harvard University, Cambridge, 02134, MA, USA
William T. Ireland

Authors

Ammar Tareen
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Kooshkbaghi
View author publications
You can also search for this author in PubMed Google Scholar
Anna Posfai
View author publications
You can also search for this author in PubMed Google Scholar
William T. Ireland
View author publications
You can also search for this author in PubMed Google Scholar
David M. McCandlish
View author publications
You can also search for this author in PubMed Google Scholar
Justin B. Kinney
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AT, WTI, DMM, and JBK conceived the project. AT and JBK wrote the software with assistance from AP and MK. WTI and JBK wrote a preliminary version of the software. AT, MK, and JBK performed the data analysis. JBK, AT, and DMM wrote the manuscript with contributions from MK and AP. All author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Justin B. Kinney.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

Appendix. Contains a derivation of variational information as a lower bound on predictive information, analyses of multiple simulated data sets, and an in-depth description of the biophysical models featured in Fig. 6.

Additional file 2

Peer review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Tareen, A., Kooshkbaghi, M., Posfai, A. et al. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome Biol 23, 98 (2022). https://doi.org/10.1186/s13059-022-02661-7

Download citation

Received: 16 July 2021
Revised: 21 March 2022
Accepted: 24 March 2022
Published: 15 April 2022
DOI: https://doi.org/10.1186/s13059-022-02661-7

MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

Abstract

Background

Results

Latent phenotype modeling strategy

Information-theoretic measures of model performance

Application: deep mutational scanning assays

Application: a massively parallel splicing assay

Application: biophysically interpretable G-P maps

Constraints on datasets and models

Discussion

Methods

Notation

G-P maps

Gauge modes and diffeomorphic modes

GE nonlinearities

GE noise models

Empirical noise models

MPA measurement process

Loss function

Predictive information

Variational information

Intrinsic information

Uncertainties in kNN estimates of mutual information

Uncertainties in G-P map parameters

Datasets

Availability of data and materials

References

Review history

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1

Additional file 2

Rights and permissions

About this article

Cite this article

Share this article

Genome Biology

Contact us