MAGeCK-MLE: a maximum likelihood approach for essential gene detection
The Negative Binomial model for high-throughput CRISPR screening read counts
After read mapping, the sequencing results of CRISPR screening are presented as a read count table, where rows correspond to sgRNAs and columns correspond to samples. Read counts generated from high-throughput sequencing data have higher variances when a high number of read counts are observed (also called ‘over-dispersion’). This is usually modeled using Negative Binomial (NB) distribution, such as in the statistical models used in many RNA-seq differential expression analysis algorithms: edgeR, DESeq/DESeq2, and so on [35–37]. MAGeCK-MLE uses a similar model; briefly, the read count of sgRNA i in sample j, or x
ij
, is modeled as:
$$ {x}_{ij}\sim NB\left({\mu}_{ij},{\alpha}_i\right) $$
Where μ
ij
and α
i
are the mean and over-dispersion factor of the NB distribution, respectively. The mean value μ
ij
is further modeled as:
$$ {\mu}_{ij}={s}_j{q}_{ij} $$
(1)
Where s
j
is the size factor of sample j for adjusting sequencing depths of the samples, and qij is a variable modeling the behavior of sgRNA i in sample j that will be discussed in later sections. s
j
is calculated by the ‘median ratio method’ in MAGeCK and DESeq2 [10, 37]:
$$ {s}_j= media{n}_i\left\{\frac{x_{ij}}{{\hat{x}}_{i\cdot }}\right\} $$
Here, \( {\widehat{x}}_i \) is the geometric mean of the read counts of sgRNA i across all J samples: \( {\widehat{x}}_i={\left({\displaystyle \prod_{k=1}^J}{x}_{ik}\right)}^{1/J}. \)
s
j
can also be calculated based on a set of predefined ‘control’ sgRNAs instead of all sgRNAs. This is particularly useful when a majority of the genes in the library are supposed to be essential; in such cases it is not suitable to calculate s
j
based on all sgRNAs. Both methods are implemented in MAGeCK-VISPR and users can specify which method to use.
The over-dispersion factor α
i
is calculated based on the regression residual and will be discussed in more details in the last Methods section.
Modeling sgRNA knockout efficiency and complex experimental settings
Different studies demonstrated that sgRNAs have various DNA cutting efficiencies [11, 38], but such information is not considered in most essential gene calling algorithms (including MAGeCK). In MAGeCK-MLE, we use a binary variable π
i
to model whether sgRNA i is efficient or not: π
i
= 1 corresponds to an efficient sgRNA i and vice versa. Since π
i
is unknown, the probability of observing a read count x from x
ij
is a mixture of two distributions:
$$ P\left({x}_{ij}=x\right)=p\left({x}_{ij}=x\Big|{\pi}_i=1\right)p\left({\pi}_i=1\right)+p\left({x}_{ij}=x\Big|{\pi}_i=0\right)p\left({\pi}_i=0\right) $$
In CRISPR screening experiments, it is common to have cells treated with different conditions. For example in melanoma activation dataset [5], cell lines underwent different sgRNA expression selection methods (cells are first selected using puromycin or zeocin), duration of treatment (3-day or 21-day treatment) and drug treatments (DMSO or PLX). For an efficient sgRNA i (π
i
= 1), MAGeCK-MLE uses a generalized linear model (GLM) to model the effect of q
ij
as a linear combination of effects from different sources:
$$ \begin{array}{c}\hfill P\left({x}_{ij}=x\Big|{\pi}_i=1\right)\sim NB\left(x;{s}_j{q}_{ij},{\alpha}_i\right)\hfill \\ {}\hfill log\left({q}_{ij}\right)={\beta}_{i0}+{\displaystyle \sum_r}{d}_{jr}{\beta}_{gr}\hfill \end{array} $$
(2)
Here, β
i0 is the baseline abundance of sgRNA i, corresponding to its abundance in an initial state (in plasmid or day 0). d
jr
is an element of a design matrix given by the user (explained later), and β
gr
is the (unknown) coefficient that we would like to estimate.
If sgRNA i is inefficient (π
i
= 0), then its read counts in all samples are not determined by any experimental conditions except the baseline abundance:
$$ \begin{array}{c}\hfill P\left({x}_{ij}=x\Big|{\pi}_i=0\right)\sim NB\left(x;{s}_j{q}_{ij},{\alpha}_i\right)\hfill \\ {}\hfill log{q}_{ij}={\beta}_{i0}\hfill \end{array} $$
(3)
The design matrix
Design matrices have been used in many gene expression analysis algorithms for modeling complex experimental designs, including LIMMA [39], VOOM [40], DESeq2 [37], and so on. The design matrix D models the combination of effects of different conditions. For J samples that are affected by R conditions, D is a J * R binary matrix with element d
jr
= 1 if sample j is affected by condition R, and 0 otherwise. An example of the design matrix is presented in Additional file 1.
Based on the design matrix, the equations in (2) and (3) can be written in a matrix form. For a gene g with N sgRNAs in J samples, let \( {\overrightarrow{q}}_g \) be the vector of q values of all sgRNAs in all samples in gene g:
$$ {\overrightarrow{q}}_g={\left({q}_{11},{q}_{21}, \dots, {q}_{N1},\dots, {q}_{1J},{q}_{2J},\dots, {q}_{NJ}\right)}^T $$
It can be written as:
$$ \log \left({\overrightarrow{q}}_g\right)=D\hbox{'}*{\overrightarrow{\beta}}_g $$
Where \( {\overrightarrow{\beta}}_g \) is a N + r vector of β values in Equations (2) and (3). The first N elements of \( {\overrightarrow{\beta}}_g \) are the baseline abundances of N sgRNAs, and the following R elements of \( {\overrightarrow{\beta}}_g \) are the coefficients corresponding to R columns in the design matrix:
$$ {\overrightarrow{\beta}}_g={\left({\beta}_{00},{\beta}_{10},\dots,\ {\beta}_{N0},{\beta}_1,\dots, {\beta}_r\right)}^T. $$
The binary extended design matrix D’ is used to set up the linear relationship between \( {\overrightarrow{\beta}}_g \) and \( {\overrightarrow{q}}_g \), and can be derived directly from the design matrix. See Additional file 1 for the definition and an example of D’.
The EM approach
MAGeCK-MLE uses a maximum likelihood estimation (MLE) approach to find the values of \( {\overrightarrow{\beta^{\ast}}}_g \) The objective function of MAGeCK-MLE is:
$$ \left({\overrightarrow{\beta}}_g^{*},{\pi_i}^{*}\right)= arg\underset{\upbeta_{\mathrm{g}},\ {\pi}_i}{max}\left({\displaystyle \sum_{\begin{array}{c}\hfill i\in g,\ \hfill \\ {}\hfill j=1,\dots J\hfill \end{array}}} \log\ p\left({x}_{ij}\right)\ \right) $$
Similar to DESeq2 [37], MAGeCK-MLE also adds a prior \( p\left({\overrightarrow{\beta}}_g\right) \) that follows a normal distribution centered on zero in the objective function. Adding this prior makes sure \( {\overrightarrow{\beta^{\ast}}}_g \) does not become arbitrarily large, when the sgRNA knockout efficiency is low and the differences of read counts between samples are high.
The objective function can be maximized using expectation maximization (EM). At the beginning, we have an initial guess of p(π
i
= 1). Subsequently, we iteratively update the values of p(π
i
= 1) and \( \overrightarrow{\beta} \) in the E step and the M step, respectively.
The initial guess of sgRNA knockout efficiency
We demonstrated that the SSC (Spacer Scoring of CRISPR) algorithm accurately predicts sgRNA knockout efficiency from genomic sequence content [11]. For each sgRNA, SSC generates an efficiency score in the range (−2,2). We scale the score linearly to the range (0,1) as an initial guess of p(π
i
= 1). If no initial estimates are given, MAGeCK-MLE starts with p(π
i
= 1) = 1 for all sgRNAs.
The expectation step
In the E step, we re-estimate the posterior probability p(π
i
= 1) and the current estimation of \( {\overrightarrow{\beta}}_g \):
$$ p\left({\pi}_i=1\Big|{x}_{ij},{\overrightarrow{\beta}}_g\ \right)=\frac{{\displaystyle {\prod}_j}p\left({x}_{ij}\left|{\pi}_i=1,\ {\overrightarrow{\beta}}_g\ \right)p\left({\pi}_i=1\right|{\overrightarrow{\beta}}_g\right)}{{\displaystyle {\prod}_j}p\left({x}_{ij}\Big|{\pi}_i=1,{\overrightarrow{\beta}}_g\right)p\left({\pi}_i=1\Big|{\overrightarrow{\beta}}_g\right) + {\displaystyle {\prod}_j}p\left({x}_{ij}\left|{\pi}_i=0,\ {\overrightarrow{\beta}}_g\ \right)p\left({\pi}_i=0\right|{\overrightarrow{\beta}}_g\right)\ } $$
The maximization step
In the M step, we maximize the values of \( {\overrightarrow{\beta}}_g \) based on the values of p(π
i
= 1). To derive the formula for updating \( {\overrightarrow{\beta}}_g \), we write the probability of observing a read count x of x
ij
as:
$$ P\left({x}_{ij}=x\right)=P{\left({x}_{ij}=x\Big|{\pi}_i=1\right)}^{I\left({\pi}_i=1\right)}*P{\left({x}_{ij}=x\Big|{\pi}_i=0\right)}^{I\left({\pi}_i=0\right)} $$
where I(.) is an indicator function. Taking the logarithm on both sides of the equation, we get
$$ \log P\left({x}_{ij}=x\right)=I\left({\pi}_i=1\right) \log P\left({x}_{ij}=x\Big|{\pi}_i=1\right)+I\left({\pi}_i=0\right) \log P\left({x}_{ij}=x\Big|{\pi}_i=0\right) $$
In the EM algorithm, it can be approximated by replacing the indicator function I(π
i
= 1) and I(π
i
= 0) with the posterior probability of P(π
i
= 1) and P(π
i
= 0), respectively [41], using the results from the E step. Therefore, the log likelihood function from the mixture model can be written as:
$$ \begin{array}{l}{\displaystyle \sum_{i,j}} \log P\left({x}_{ij}=x\right)\hfill \\ {}={\displaystyle {\sum}_{i,j}P\left({\pi}_i=1\Big|{x}_{ij},{\overrightarrow{\beta}}_g\right) logP\left({x}_{ij}=x\Big|\ {\pi}_i=1\right)+P\left({\pi}_i=0\Big|{x}_{ij},{\overrightarrow{\beta}}_g\right) logP\left({x}_{ij}=x\Big|{\pi}_i=0\right)}\hfill \end{array} $$
Since NB distribution belongs to exponential family distributions, a fast algorithm exists for the maximum likelihood estimation of generalized linear models [42]. Taking the prior of \( {\overrightarrow{\beta}}_g \) into consideration, the objective function can be maximized using iteratively reweighted ridge regression, or weighted updates, the same the algorithm used in DESeq2 [37]. The update rule for calculating \( {\overrightarrow{\beta^t}}_g \) at step t of the iteration can be written as:
$$ \overrightarrow{\beta_g^t}={\left({\mathrm{D}}^{\hbox{'}\mathrm{T}}W\mathrm{D}\hbox{'}+\lambda I\right)}^{-1}{\mathrm{D}}^{\hbox{'}\mathrm{T}}W\overrightarrow{z^t} $$
Here, W is the diagonal matrix with its values given by w
ii
= e
i
t/(1/μ
i
+ α
i
), where e
i
t is the current estimate of the efficiency of sgRNA \( i:{e}_i^t=P\left({\pi}_i=1\Big|{x}_{ij},\overrightarrow{\beta_g^{t-1}}\right) \), λ is the regularization parameter in the ridge regression, and μ
i
is the current estimate of the mean of the NB variable:
$$ \begin{array}{c}\kern1em \overrightarrow{\mu^t}={s}_j exp\left(\overrightarrow{h^t}\right)\kern1em \\ {}\kern1em \overrightarrow{{\mathrm{h}}^{\mathrm{t}-1}}=\mathrm{D}\hbox{'}\overrightarrow{\beta_g^{t-1}}\kern1em \end{array} $$
(4)
\( {\overrightarrow{\mathrm{z}}}^t \) is the residue vector of the current estimate, with its ith element:
$$ {z}_i^t={\mathrm{h}}_{\mathrm{i}}^{t-1}+{e}_i^t\left({\mathrm{x}}_{\mathrm{i}}-{\mu}_i^t\right)/{\mu}_i^t $$
Here, x
i
is the read count of sgRNA i.
Convergence
The EM approach iterates the E step and the M step until it converges or reaches a predefined maximum number of iteration.
Statistical significance
The statistical significance of \( {\overrightarrow{\beta}}_g \) is calculated in both permutation and Wald test. In permutation test, MAGeCK-MLE shuffles all sgRNAs in a gene to generate empirical null distribution of \( {\overrightarrow{\beta}}_g \) . The number of shufflings is a parameter specified by the user, and the default value is set to be 2*(total number of genes). In the Wald test, MAGeCK-MLE compares the value of \( {\overrightarrow{\beta}}_g/SE\left({\overrightarrow{\beta}}_g\right) \) to the standard Normal distribution, where \( SE\left({\overrightarrow{\beta}}_g\right) \) is the standard error of \( {\overrightarrow{\beta}}_g \):
$$ \begin{array}{c}\hfill SE\left({\overrightarrow{\beta}}_g\right)=\sqrt{diag\Big(Cov\left(\left({\overrightarrow{\beta}}_g\right)\right)\ }\hfill \\ {}\hfill Cov\left(\left({\overrightarrow{\beta}}_g\right)\right)={\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}+\lambda I\right)}^{-1}\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}\right){\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}+\lambda I\right)}^{-1}\hfill \end{array} $$
Here, \( diag\left(Cov\left({\overrightarrow{\beta}}_g\right)\right) \) are the diagonal elements of the covariance matrix of \( {\overrightarrow{\beta}}_g \).
Calculating the over-dispersion factor
The over-dispersion factor, α
i
, is calculated based on the mean and variance estimation algorithm used in MAGeCK [10] and VOOM [40]. We first calculate the fitted values of \( {\overrightarrow{\beta}}_g \), or \( {\widehat{\beta}}_g \), using the EM algorithm proposed before, with the over-dispersion factor set to a fixed value (for example, 0.01). Then the fitted means \( {\widehat{\mu}}_i \) are calculated using Equation (4), and the residual variances are calculated using the following equation:
$$ {{\widehat{\sigma}}_i}^2={\left({x}_i-{\widehat{\mu}}_i\right)}^2 $$
MAGeCK-MLE then models the sample residual variance \( {\widehat{\sigma}}^2 \) and fitted mean \( \widehat{\mu} \) using the same model as in MAGeCK [10]:
$$ {\widehat{\sigma}}^2=\widehat{\mu}+k{\widehat{\mu}}^b $$
Where k and b are learned from the fitted means and residual variances of all sgRNA read counts. The values of α
i
are then calculated based on the fitted values of sample residual variance \( {{\widehat{\sigma}}_f}^2 \) from this model:
$$ {\alpha}_i=\frac{{\widehat{\sigma}}_f^2-{\widehat{\mu}}_i}{{{\widehat{\mu}}_i}^2} $$
Availability
The MAGeCK-VISPR workflow is available open source at http://bitbucket.org/liulab/mageck-vispr under the MIT license.