MPRAnalyze: statistical framework for massively parallel reporter assays

Massively parallel reporter assays (MPRAs) can measure the regulatory function of thousands of DNA sequences in a single experiment. Despite growing popularity, MPRA studies are limited by a lack of a unified framework for analyzing the resulting data. Here we present MPRAnalyze: a statistical framework for analyzing MPRA count data. Our model leverages the unique structure of MPRA data to quantify the function of regulatory sequences, compare sequences’ activity across different conditions, and provide necessary flexibility in an evolving field. We demonstrate the accuracy and applicability of MPRAnalyze on simulated and published data and compare it with existing methods. Electronic supplementary material The online version of this article (10.1186/s13059-019-1787-z) contains supplementary material, which is available to authorized users.


Introduction
MPRAnalyze aims to infer the true induced transcription rate α for each of the candidate enhancers included in a given MPRA dataset, by modeling the raw counts and avoiding ratio-based summary statistics.
Formally, the MPRA data for a given enhancer can be denoted as d, r where d =

Regression Model
MPRAnalyze assumes a linear relationship between RNA and DNA, an assumption easily checked and mostly supported by MPRA data. Formally, we assume that for some transcription rate α, the true DNA and RNA measures follow a linear relationship with slope α: However, the data is usually generated by a set of different distributions which depend on wanted sources of variation like different cell types or cellular environments, or unwanted sources of variation like batch effects or barcode-specific effects. To enable controlling these factors in the analysis, MPRAnalyze assumes all effects are independent and additive in log space and implements a flexible and extendable regression model. We construct a separate regression model for the DNA and RNA models, and fit them by likelihood maximization. Formally, we model each observation as follows: Where S d,i , S r,i are library depth normalization factors, , m relate the observation index to the appropriate batch factor, and β, γ are the set of parameters to optimize.
In this framework, β (i) is the estimate of the true number of DNA molecules, and m γ m(i) is the estimate of α, the transcription rate. Note that library depth factors are purely technical and therefore S d,i is only corrected for when fitting the DNA model and not included in the RNA model, and that these factors can be estimated externally. Fitting the parameters across the observations can be done by matrix multiplication with binary design matrices X d , X r , simplifying to: We then optimize these parameters by optimizing the likelihood function of the data, conditioned on some distributional assumptions (detailed below). Formally we use the vector of estimates d , r as estimators for the first moment of the distribution the observations were generated by. The exact likelihood computation depends on the distributional model, but conceptually we get: With L D , L R are the likelihood functions of the DNA, RNA distributions (respectively), and D , R their respective log-likelihood functions.

Distributional Models
MPRAnalyze models the generation of the data as: The main complexity of the model therefore stems from the RNA observations being a convolution of the scaled DNA distribution and the RNA measurement distribution. We offer two ways to handle this:

Closed Form Solution
Some distributions provide a closed form for this scenario. MPRAnalyze uses this by assuming a Gamma-Poisson model: we assume the DNA model follows a Gamma distribution: d ∼ Gamma (k, b) and that r| d ∼ P oisson α · d . The resulting likelihood for the RNA model has a closed form as a Negative-Binomial distribution, which is a well established approximation of sequencing data. Formally: To incorporate the regression scheme described above in this setting, we share the dispersion parameter a between the DNA and RNA models. The DNA regression model is the estimate for 1 b , and the RNA regression model is the estimate for α. Therefore, the MLE estimates are: Alternatively, the model can be simplified by fitting the DNA model and collapsing it to a point estimator. Unlike the closed form solution, this means that not all of the information of the DNA model will be shared with the RNA model. However, this means that this option allows for more flexibility in the distributional assumptions, essentially allowing any pair of distributions to be used for the DNA and RNA models. This flexibility is mostly important for the restrictions imposed on the variance estimation. MPRAnalyze implements this approach in two ways, allowing for two sets of constraints on the variance.
2.1 log-Normal DNA, Negative-Binomial RNA assuming: Which, along with the regression scheme, translates to: which, along with the regression scheme, translates to:

Hypothesis testings
MPRAnalyze implements a variety of hypothesis testing options to accommodate the variety of different MPRA experimental design settings. The two major types of analyses MPRA facilitates are quantitative analysis: determining the level of transcriptional activity induced by each candidate enhancer; and comparative analysis: identifying differential activity of a candidate enhancer between biological conditions. MPRAnalyze provides a robust statistical framework that enables addressing these questions while leveraging the existence of negative controls (scramble sequences) in the experiment. The various use cases are summarized by the following diagram, and fully explained below:

Comparative Analysis
A comparative analysis is one where the leading question is about the difference in activity of a given enhancer between conditions. Examples of these include comparing enhancer activity between cell types to detect tissue-specific activity, comparing different time-points in a differentiation process and comparing activity in unstimulated cells with cells exposed to various stimuli. MPRAnalyze offers several possible settings and testing frameworks to address these kinds of questions.

Coefficient-based Testing
A coefficient-based testing framework addresses these comparisons by directly testing the regression coefficients for significane, using a Wald test. This approach enables testing multiple different hypotheses using the same model fit, which can be useful in some cases. For example, if effects of different stimuli are to be tested, a single model can be fit to the entire dataset, then testing the effect of each stimulus is simply computing the Wald-test for the appropriate regression coefficient, with the unstimulated samples as the reference.

Likelihood-Ratio Testing
A likelihood Ratio test lacks the post-fitting flexibility of a coefficient-based testing, but allows more complex questions to be asked. For instace, looking at a time-series dataset of a differentiation process, one might ask "does this enhancer change their activity over time". This question can be difficult to address with coefficient-based testing, since each single coefficient only corresponds to one time point. However, this can easily be addressed with a likelihood-ratio test. This is done by fitting two models: a 'full' model with the complete design including the conditional factor (in this case, the factors that identify the time-point each data-point is associated with), and another model -a 'reduced' modelthat is identical to the full model in everything expect that it lacks this factor of interest. Since the reduced space is a sub-space of the full space (in which the time-related coefficients are 0), these models are nested and a standard Likelihood-Ratio Test can be used to test for statistical significance.

Classification Analysis
A classification analysis address the question of the overall degree of activity of a given enhancer. Note that addressing this question without an established baseline for the basal transcription rate is difficult, and the power and interpretability of these analyses are very limited in the absence of negative controls. MPRAnalyze performs classification by testing each enhancer's estimated transcription rate, α, against the null distribution. This null distribution is based on the alphas of the negative controls if available, and on the bottom half of the total distribution otherwise, under the implicit assumption that active enhancers will be more active than the median enhancer. In either case, MPRAnalyze computes both a Z-score and a MAD-score, which is a median-based variant of the Z-score. These scores are normally distributed under the null, and we can therefore compute p-values from the standard normal distribution.

Classification with Controls
Negative controls can be used directly to establish the null distribution, then computing the scores within the null. Formally, for α c = (α c1 , . . . , α c ) the estimated rates for the controls included in the experiment: Additionally, MPRAnalyze also calculates the empirical p-value:

Classification without Controls
Without the advantage of negative controls to establish the null transcription rate, MPRAnalyze empirically determines the distributional properties of the null set. For this, MPRAnalyze relies on three assumptions: (1) the data contains both active and inactive enhancers; (2) active enhancers generally have a higher estimated rate of transcription than inactive enhancers; (3) not all active enhancers are equally active, but all inactive enhancers are equally inactive, meaning generated by the same distribution determined by the properties of the promoter used. MPRAnalyze empirically estimates the mode of the distribution, by taking the center of the window which contains the maximal number of samples, with a window size set to be 0.02 of the total range of values in the sample. This mode is assumed to be the center of the null distribution, due to the third assumption. Then samples that are lower from the mode are then defined as the null set, and are used to estimate the width of the distribution, assuming the null is symmetric with respect to the mode. The calculation of the scores is done using the null set:

Incorporating Negative Controls
MPRAnalyze enables incorporating negative controls into the model to correct systemic bias in the data. Such bias often stems from inherent biological differences between condition that are not related to the regulatory mechanism studied, e.g cellular size differences. Including negative controls in the experiment, preferably by including randomized "scrambled" sequences in the experiment, allows quantifying and correcting these effects. This can be done in several ways, all of which involve some manner of fitting a joint model for control enhancers. All negative controls are assumed to be generated by the same, baseline, RNA model. However, since there's no reason to assume this has any effect on transfection efficiency, a separate DNA model is still fit to each control enhancer. Formally, for n control enhancers, the joint model is: Fitting a joint models of or with controls can be used in several ways to account for systemic bias in the data.

Control-based Correction Factors
The joint control model described above can be used to generate normalization factors based on systemic bias reflected in the A separate model can be fit for the controls, and correction factors extracted from it and included in the general model. Similar to library-depth correction, this would amount to an additional scaling factor added to the regression scheme that corrects the inherent bias in the system as measured by the negative controls.

Gamma-Poisson convolution proof
Proof of the closed form relationship of the likelihood of the RNA model in the Gamma-Poisson model: Let D ∼ Gamma (k, β) and R|D ∼ P oisson (λ = αD).