MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

Multiplex assays of variant effect (MAVEs) are a family of methods that includes deep mutational scanning experiments on proteins and massively parallel reporter assays on gene regulatory sequences. Despite their increasing popularity, a general strategy for inferring quantitative models of genotype-phenotype maps from MAVE data is lacking. Here we introduce MAVE-NN, a neural-network-based Python package that implements a broadly applicable information-theoretic framework for learning genotype-phenotype maps—including biophysically interpretable models—from MAVE datasets. We demonstrate MAVE-NN in multiple biological contexts, and highlight the ability of our approach to deconvolve mutational effects from otherwise confounding experimental nonlinearities and noise. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02661-7.

As in the main text, let p true (y, φ) denote the joint density of y and model-assigned φ values that would be observed on test data in the N → ∞ limit, and let p true (y|φ) be the corresponding conditional density. Also let p model (y|φ) represent the inferred measurement process of the model in question. We then recover Eq. 31 in Online Methods as follows: where D KL denotes the Kullback-Leibler divergence, and · true indicates averaging over p true (y, φ).
Now consider the infinite training data limit. Also assume that the correct G-P map and measurement process are within the class of models considered, as is often the case when analyzing simulated data. Then it is simple to see that minimizing the loss L like (or equivalently, maximizing I var ) will push D KL → 0 and I pre → I int . For such models, we therefore recover I var = I pre = I int .
On finite datasets we must use approximate methods to estimate these information values. The model performance inequality I var ≤ I pre ≤ I true becomes approximate as a result. Still, the uncertainties in these information quantities are often quite small, due to the large size of typical MAVE datasets, and this model performance inequality can serve well to guide judgements about model accuracy and completeness.

S2 Analysis of simulated GB1 data
Here we assess the performance of MAVE-NN on realistic simulated DMS data, thereby demonstrating its ability to recover ground-truth models when one's modeling assumptions are accurate. sequences. This was done using the model's simulate_dataset() method, which is included in all MAVE-NN models to streamline the process of dataset simulation. We then trained a new additive GE model on these simulated data, and compared both the parameters and predictions of this inferred model to those of the true model ( Fig. S1a-d).
Fig. S1a shows that the inferred GE nonlinearity g(·) agrees remarkably well with the true nonlinearity. Model predictions on held-out test data also correlate highly with the predictions of the true model (Fig. S1b). Plotting inferred latent phenotype values against true latent phenotype values reveals a near-perfect correspondence (Fig. S1c), and the same is found when comparing inferred G-P map parameters against true G-P map parameters (Fig. S1d).
Additionally, we fit additive GE models to simulated datasets of varying size and recorded the model inference time required on one CPU node of a computer cluster. These times are plotted against dataset size in Fig. S1e. Model inference for the original published dataset was fast and took only ∼ 5 minutes. While the performance of these inferred models improved with increasing simulated dataset size, even moderately sized datasets (e.g., 10,000 observations) yielded models with high correspondence to ground truth (Fig. S1f). Here we review the general rationale for models of this form, after which we derive the equations for the two specific models. Please refer to the online documentation for details about how these equations were coded as G-P maps within the MAVE-NN API.

S3.1 General mathematical form of thermodynamic models
Both of the biophysical models featured in Fig. 6 are examples of thermodynamic models, which are defined by three key assumptions: 1. At each instant in time, the system of interest can occupy one of a fixed number of possible states. 2. The probability of the system being in each state depends on that state's Gibbs free energy via Boltzmann's law, which describes the probability one would observe for a system in thermal equilibrium. 3. The molecular phenotype of interest is a probability-weighted average of the activities of each individual state.
Mathematically, this means that the latent phenotype φ(x), for a given sequence x, is a weighted sum of state-specific activities: where s indexes possible states of the system, φ s (x) is the (generally sequence-dependent) activity of state s, and p(s|x) is the Boltzmann probability of state s. Botlzmann's law tells us that this probability is given by where G s (x) is the (generally sequence-dependent) Gibbs free energy of state s, is a normalization factor that ensures the probabilities of all states sum to one, k B = 1.987 × 10 −3 kcal mol·K is Boltzmann's constant, and T is temperature in kelvin. Note that expressing model parameters in units of kcal/mol thus requires knowing the temperature at which the latent phenotype was measured. For example, k B T = 0.582 kcal mol at 20 • C (room temperature), whereas k B T = 0.616 kcal mol at 37 • C (body temperature). To infer biophysical models from MAVE data, one must propose specific mathematical formulas for how the state activities φ s (x) and Gibbs free energies G s (x) depend on sequence. These quantities will depend on some set of a priori unknown parameters; these are the G-P map parameters θ. The mathematical form of the G-P map itself, φ(x; θ), then follows from Eqs. S4, S5, and S6. To use a thermodynamic model as a G-P map within MAVE-NN, users must provide the specific equation for φ(x; θ), written in terms of a one-hot encoded sequence vector x, to the MAVE-NN API. Please refer to the online documentation for details on how to do this.
S3.2 Thermodynamic model for protein GB1 Fig. S4a illustrates the thermodynamic model for protein GB1 that was proposed by Otwinowski (2018) to explain the DMS data of Olson et al. (2014). This model assumes three possible states for GB1: (1) unfolded, (2) folded but not bound to IgG, and (3) folded and bound to IgG. The activity in question, φ(x), is the fraction of time a GB1 molecule with peptide sequence x is bound to IgG. The corresponding state-specific activities are φ (1) = 0, φ (2) = 0, and φ (3) = 1. Note that, both here and in the models below, we assume that all φ s are fixed binary numbers that are known a priori, i.e., they do not depend on sequence x or on any trainable parameters in θ.
The unfolded state (1) is taken to be the reference state and assigned a Gibbs free energy of 0. ∆G F denotes the Gibbs free energy of the folded state relative to the unfolded state, and thus state (2) has energy ∆G F . ∆G B is the Gibbs free energy change upon the binding of a folded GB1 molecule to IgG, and so the energy of state (3) is ∆G F + ∆G B . Each of these energies is assumed to be an additive function of the GB1 sequence x; the specific mathematical formulas for ∆G F and ∆G B in terms of the one-hot encoded vector x are as shown. The parameters that control these two energies are θ F , θ 0 F (for ∆G F ) and θ B , θ 0 B (for ∆G B ). The resulting set of trainable G-P map parameters is To infer values for these parameters using MAVE-NN, we paired the biophysical G-P map φ(x; θ) with a GE measurement process p(y|φ) having a trainable nonlinearity and a heteroscedastic skewed-t noise model. The resulting latent phenotype model was trained on the DMS data of Olson et al. (2014). Figure S4: Thermodynamic models featured in Fig. 6 of the main text. (a) Three-state model for GB1 folding and binding to IgG. x denotes the one-hot encoding of a variant GB1 protein sequence. The trainable G-P map parameters are θ F , θ 0 F , θ B , and θ 0 B . (b) Four-state model for transcriptional regulation of the E. coli lac promoter by CRP (purple) and RNAP (blue). Variable promoter DNA is indicated in brown. x C and x R respectively denote one-hot encodings of a variant 26 nt CRP binding site and a variant 41 nt RNAP binding site. The trainable G-P map parameters are θ C , θ 0 C , θ R , θ 0 R , and ∆G I . GB1: protein G domain B1; IgG: immunoglobulin G; G-P: genotype-phenotype; CRP: cAMP receptor protein; RNAP: σ 70 RNA polymerase.
The resulting values for θ F and θ B are illustrated as heatmaps in Fig. 6b of the main text. These are expressed in the "wildtype" gauge, i.e., θ 0 F and θ 0 B are the folding and binding energies of the wildtype GB1 sequence, while the elements of the vectors θ F and θ B describe the energetic effects of single amino acid mutations. All of these parameter values represent Gibbs free energies computed using k B T = 0.582 kcal mol , since the RNA display experiment of Olson et al. (2014) was performed at room temperature.
S3.3 Thermodynamic model for the lac promoter Fig. S4b illustrates a thermodynamic model for transcriptional activation at the lac promoter of Escherichia coli. This model describes the DNA binding energies of two transcription factors, the cAMP receptor protein (CRP) and σ 70 RNA polymerase (RNAP), as well as a cooperative interaction between these two proteins that occurs when both are bound to promoter DNA. This model, which was originally proposed by Kinney et al. (2010), was trained on data from their sort-seq MPRA, in which a 75 bp region spanning the CRP and RNAP binding sites of the wildtype lac promoter was mutagenized at 12% per nucleotide, resulting in ∼9 mutations per sequence on average.
In this model, promoter DNA is assumed to be in one of four possible states: (1) empty, (2) bound by CRP, (3) bound by RNAP, and (4) bound by both CRP and RNAP. State (1) is taken to be the reference state. ∆G C denotes the Gibbs free energy of CRP binding to DNA, and ∆G R is the energy of RNAP binding to DNA. Both of these quantities are assumed to be additive functions of the lac promoter sequence x. More specifically, ∆G C is assumed to depend only on a 26 nt subsequence (the CRP binding site), the one-hot encoding of which we denote by x C . Similarly, ∆G R is taken to depend on a 41 nt subsequence x R representing the RNAP binding site. The corresponding parameters of these additive models are θ 0 C , θ C for CRP, and θ 0 R , θ R for RNAP. ∆G I is the Gibbs free energy of interaction between DNA-bound CRP and DNA-bound RNAP, and is assumed to be independent of sequence. The latent phenotype of interest, φ, is the fraction of time that RNAP is bound to promoter DNA. We assume this quantity is proportional to the rate of transcript initiation and thus the level of gene expression. The state-specific activities are therefore fixed and given by φ (1) = φ (2) = 0 and φ (3) = φ (4) = 1. The resulting set of trainable G-P map parameters is θ = θ 0 C , θ C , θ 0 R , θ R , ∆G I .
This biophysical G-P map, φ(x; θ), was paired with an MPA measurement process and trained on the data of Kinney et al. 2010. The resulting inferred values for θ C and θ R are illustrated in Fig. 6d as sequence logos; the inferred value of ∆G I is also shown. The standard errors on ∆G I were estimated by repeating the inference procedure on 10 simulated datasets using the inferred model's built-in bootstrap() method.