MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

Multiplex assays of variant effect (MAVEs) are a family of methods that includes deep mutational scanning (DMS) experiments on proteins and massively parallel reporter assays (MPRAs) on gene regulatory sequences. However, a general strategy for inferring quantitative models of genotype-phenotype (G-P) maps from MAVE data is lacking. Here we introduce MAVE-NN, a neural-network-based Python package that implements a broadly applicable information-theoretic framework for learning G-P maps—including biophysically interpretable models—from MAVE datasets. We demonstrate MAVE-NN in multiple biological contexts, and highlight the ability of our approach to deconvolve mutational effects from otherwise confounding experimental nonlinearities and noise.


Introduction
accounting for this nonlinearity yields predictions that correlate quite well with measurement values. Moreover, every latent phenotype model inferred by MAVE-NN can be used as a MAVE dataset simulator (see Methods). By analyzing simulated data generated by our inferred model 179 for this GB1 experiment, we further observed that MAVE-NN can accurately and robustly 180 recover the GE nonlinearity and ground-truth G-P map parameters (Supplementary Fig. S1). 181 Fig. 3d summarizes the values of our information-theoretic metrics for model 182 performance. On held-out test data, we find that ()& = 2.178 ± 0.027 bits and %&' = 2.225 ± 183 0.017 bits. The similarity of these two values suggests that the inferred GE measurement 184 process, which includes a heteroscedastic skewed-t noise model, has nearly sufficient accuracy 185 to fully describe the distribution of residuals. We further find that 2.741 ± 0.013 bits ≤ "#$ ≤ 186 3.215 ± 0.007 bits (see Methods), meaning that the inferred G-P map accounts for 69%-81% of the total sequence-dependent information in the dataset. While this performance is impressive, the additive G-P map evidently misses some relevant sequence features. This observation randomly selected double mutants, which represent only ~0.1% of all possible double mutants.

199
The analysis of simulated data also supports the ability to accurately recover ground-truth model  We used MAVE-NN to infer additive G-P maps from these two datasets, adopting the 216 same type of latent phenotype model used for GB1. Fig. 4a illustrates the additive G-P map 217 inferred from aggregation measurements of Aβ variants. In agreement with the original study, 218 we see that most amino acid mutations between positions 30-40 have a negative effect on 219 nucleation, suggesting that this region plays a major role in nucleation behavior. Fig. 4b shows 220 the corresponding measurement process (see also Supplemental Information Fig. S2). Even 221 though these data are much sparser than the GB1 data, the inferred model performs well on    (Fig. 1c,d). Applying MAVE-234 NN to data from the BRCA2 exon 17 context, we inferred four different types of G-P maps: 235 additive, neighbor, pairwise, and black box. As with GB1, these G-P maps were each inferred 236 using GE regression with a heteroscedastic skewed-t noise model. For comparison, we also 237 inferred an additive G-P map using the epistasis package of Sailer and Harms. 24 Fig. 5a compares the performance of these G-P map models on held-out test data, while 239 Figs. 5b-d illustrate the corresponding inferred measurement processes. We observe that the 240 additive G-P map inferred using the epistasis package 24 exhibits less predictive information 241 ( %&' = 0.180 ± 0.011 bits) than the additive G-P map found using MAVE-NN ( = 3.8 × 10 *+ , 242 two-sided z-test). This is likely because the epistasis package estimates the parameters of the 243 additive G-P map prior to estimating the GE nonlinearity. We also note that, while the epistasis 244 package provides a variety of options for modeling the GE nonlinearity, none of these options 245 appear to work as well as our mixture-of-sigmoids approach (compare Figs. 5b,c). This finding 246 again demonstrates that the accurate inference of G-P maps requires the explicit and 247 simultaneous modeling of experimental nonlinearities.

248
We also observe that increasingly complex G-P maps exhibit increased accuracy. For 249 example, the additive G-P map gives %&' = 0.257 ± 0.013 bits, whereas the pairwise G-P map

273
Otwinowski 42 showed that a three-state thermodynamic G-P map (Fig. 6a), one that

281
relative to the additive G-P map of Fig. 3. Fig. 6b shows the two inferred energy matrices that

302
As stated above, MAVE-NN places certain limitations on both input datasets and latent 303 phenotype models. Some of these constraints have been adopted to simplify the initial release 304 of MAVE-NN and can potentially be relaxed in future updates. Others reflect fundamental 305 mathematical properties of latent phenotype models. Here we summarize the primary 306 constraints users should be aware of.
in G-P maps operate only on fixed-length sequences. Users who wish to analyze variable length 310 sequences can still do so by padding the ends of sequences with dummy characters.

311
Alternatively, users can provide a multiple-sequence alignment as input and include the gap 312 character as one of the characters to consider when training models.

313
As stated above, MAVE-NN can analyze MAVE datasets that have either continuous or 314 discrete measurements. At present, both types of measurements must be one-dimensional, i.e., 315 users cannot fit a single model to vectors of multiple measurements (e.g., joint measurements of 316 protein binding affinity and protein stability, as in Ref. 27 ). This constraint has been adopted only 317 to simplify the user interface of the initial release. It is not a fundamental limitation of latent 318 phenotype models and is scheduled to be relaxed in upcoming versions of MAVE-NN.

319
The current implementation of MAVE-NN also supports only one-dimensional latent 320 phenotypes (though the latent phenotype of custom G-P maps can depend on multiple 321 precursor phenotypes, e.g., binding energy or folding energy). This restriction was made 322 because accurately interpreting multi-dimensional latent phenotypes is substantially more 323 fraught than interpreting one-dimensional latent phenotypes, and we believe that additional 324 computational tools need to be developed to facilitate such interpretation. That being said, the 325 mathematical form of latent phenotype models is fully compatible with multi-dimensional latent 326 phenotypes. Indeed, this modeling strategy has been used in other work, 20,27,28,45 and we plan to 327 enable this functionality in future updates to MAVE-NN.

328
More fundamental constraints come into play when analyzing MAVE data that contains 329 only single-mutation variants. In such experiments, the underlying effects of individual mutations 330 are hopelessly confounded by the biophysical, physiological, and experimental nonlinearities 331 that may be present. By contrast, when the same mutation is observed in multiple genetic 332 backgrounds, MAVE-NN can use systematic differences in the mutational effects observed allowed to be heteroscedastic, the nonlinearity is constrained to be linear.

337
We emphasize that, in practice, only a modest number of multiple-mutant variants are 338 required for MAVE-NN to learn the form of a non-linear measurement process (see Fig. 3e-g).

339
In this way, including a small fraction of the possible double-mutation variants in MAVE libraries

347
In this work we have presented a unified strategy for inferring quantitative models of G-P 348 maps from diverse MAVE datasets. At the core of our approach is the conceptualization of G-P 349 maps as a form of information compression, i.e., the G-P map first compresses an input 350 sequence into a latent phenotype value, which the MAVE then reads out indirectly via a noisy 351 nonlinear measurement process. By explicitly modeling this measurement process, one can 352 remove potentially confounding effects from the G-P map, as well as accommodate diverse 353 experimental designs. We have also introduced three information-theoretic metrics for 354 assessing the performance of the resulting models. These capabilities have been implemented within an easy-to-use Python package called MAVE-NN.
-18 -step tutorials, is available at http://mavenn.readthedocs.io. Source code, the data sets analyzed 381 in this paper, and the scripts used for training the models and making the figures presented 382 herein, are available under an MIT open-source license at https://github.com/jbkinney/mavenn.

506
We assume that the latent phenotype is given by a linear function ( ; ) that depends

561
The corresponding values of 3:5 are given by

563
where 3:5 denotes the number of sequences (out of total) that have character at position , 564 and 3:5 9$ is the one-hot encoding of a user-specified wildtype sequence. In particular, the 565 wildtype gauge was used for illustrating the additive G-P maps in Fig. 3 and Fig. 4, while the

GE nonlinearities 576
GE models assume that each measurement is a nonlinear function of the latent 577 phenotype ( ) plus some noise. In MAVE-NN, this nonlinearity is represented as a sum of

587
These all support the analytic computation of quantiles and confidence intervals, as well as the 588 rapid sampling of simulated measurement values. The Gaussian noise model is given by 613 and *2 denotes the inverse of the regularized incomplete Beta function J ( , ).

614
Empirical noise models 615 MAVE-NN further supports the inference of GE regression models that account for user-616 specified measurement noise. In such cases, the user provides a set of measurement-specific

621
where " ! = ( ( ! ; ); ) is the expected measurement for sequence ! , denotes G-P map 622 parameters, and denotes the parameters of the GE nonlinearity. This noise model thus has 623 the advantage of having no free parameters, but it may be problematically mis-specified if the 624 true error distribution is heavy-tailed or skewed. 625

MPA measurement process 626
In MPA regression, MAVE-NN directly models the measurement process ( | ). At  633 is a vector that lists the number of times MR that the sequence M was observed in each bin .

634
MPA measurement processes are represented as multilayer perceptron with one hidden layer 635 (having tanh activations) and a softmax output layer. Specifically,

728
To compute a lower bound on "#$ for MPSA data (Fig. 5c)