 Method
 Open Access
 Published:
Quality control, modeling, and visualization of CRISPR screens with MAGeCKVISPR
Genome Biology volume 16, Article number: 281 (2015)
Abstract
Highthroughput CRISPR screens have shown great promise in functional genomics. We present MAGeCKVISPR, a comprehensive quality control (QC), analysis, and visualization workflow for CRISPR screens. MAGeCKVISPR defines a set of QC measures to assess the quality of an experiment, and includes a maximumlikelihood algorithm to call essential genes simultaneously under multiple conditions. The algorithm uses a generalized linear model to deconvolute different effects, and employs expectationmaximization to iteratively estimate sgRNA knockout efficiency and gene essentiality. MAGeCKVISPR also includes VISPR, a framework for the interactive visualization and exploration of QC and analysis results. MAGeCKVISPR is freely available at http://bitbucket.org/liulab/mageckvispr.
Background
The clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 system is a powerful genetic engineering technique, allowing direct modifications of genomic loci in most model organisms in a costeffective way. Based on this system, the recent development of highthroughput CRISPR screening technology has shown great promise in functional genomics, allowing researchers to systematically identify genes associated with various phenotypes [1–4]. CRISPR screens can be performed by either direct knockout of genes using CRISPR/Cas9 [1, 2], or perturbing gene expressions using CRISPR and a deadCas9 (dCas9) fused with activation or repression effectors [5, 6].
While CRISPR screening is a powerful technique, it creates computational challenges that include: (1) how to evaluate the data quality; (2) how to identify gene or pathway hits from the screens and assess their statistical significance; and (3) how to visualize and explore the screening results efficiently. Until now, a comprehensive quality control (QC), data analysis, and visualization method for CRISPR screen was not available. Several algorithms are developed for screening analysis on microarray or highthroughput sequencing data, such as RIGER [7], RSA [8], HitSelect [9], as well as the MAGeCK algorithm we previously developed [10]. These algorithms are designed based on a comparison of two conditions, although many screens are conducted simultaneously across several time points, under many treatment conditions or over many cell lines. In addition, these algorithms do not consider the knockout efficiency of single guide RNAs (sgRNA) on target genes. The knockout efficiency is the ability of a sgRNA to induce cutting events that lead to the knockout of the targeted gene. It is influenced by sgRNA sequence content [11], chromatin accessibility and exon position of the targeting gene [12], and so on.
In this study, we present MAGeCKVISPR to overcome the computational challenges of CRISPR screens. MAGeCKVISPR (1) defines a set of QC measurements and (2) extends the MAGeCK algorithm by a maximum likelihood estimation method (MAGeCKMLE) to call essential genes under multiple conditions while considering sgRNA knockout efficiency. Further, MAGeCKVISPR (3) provides a webbased visualization framework (VISPR) for interactive exploration of CRISPR screen quality control and analysis results. MAGeCKVISPR employs a Snakemake [13] workflow to combine MAGeCK and VISPR in a scalable and reproducible way (Fig. 1).
Results and discussion
Quality control measurements for CRISPR screening experiments
Apart from the determination of essential genes with MAGeCK, a central purpose of MAGeCKVISPR is to collect quality control (QC) measurements at various levels (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The proposed measurements (Table 1) can be divided into four categories: sequence level, read count level, sample level, and gene level (Fig. 2).
Sequence level QC measurements aim to detect problems with the sequencing, similar as in other nextgeneration sequencing (NGS) experiments. Two measurements are reported: sample GC content distribution (Fig. 2a) and the base quality distribution of sequencing reads (Fig. 2b, c). Ideally, sequencing reads should have reasonable base qualities (median value >25), and samples from the same experiment should have similar GC content distributions.
The second level of QC measurements is based on the sgRNA read counts collected from MAGeCK. Raw sequencing reads are first mapped to sgRNA sequences in the library with no mismatches tolerated. After that, the number of sequencing reads, mapped reads (and thereof the percentage of mapped reads), sgRNAs with zero read count, and the Gini index of read count distribution are reported for each sample (Fig. 2df). The percentage of mapped reads is a good indicator of sample quality, and low mappability could be due to sequencing error, oligonucleotide synthesis error, or sample contamination. Good statistical power of downstream analysis relies on sufficient reads (preferably over 300 reads) for each sgRNA, with low number of zerocount sgRNAs in the plasmid library or early time points. Gini index, a common measure of income inequality in economics, can measure the evenness of sgRNA read counts [14]. It is perfectly normal for later time points in positive selection experiments to have higher Gini index since a few surviving clones (a few sgRNA with extreme high counts) could dominate the final pool while most of the other cells die (more sgRNAs with zerocount). In contrast, high Gini index in plasmid library, in early time points, or in negative selection experiments may indicate CRISPR oligonucleotide synthesis unevenness, low viral transfection efficiency, and over selection, respectively.
Sample level QC (Fig. 2gj) checks the consistency between samples. MAGeCKVISPR reports the distributions of normalized read counts by box plots and cumulative distribution functions. It also calculates pairwise Pearson correlations of sample log read counts, and draws the samples on the first three components of a Principle Component Analysis (PCA). Biological replicates or samples with similar conditions should have similar read count distributions and higher correlations, and appear closer to each other in the PCA plot. PCA plots can also identify potential batch effects if the screens are conducted under different batches.
Finally, gene level QC determines the extent of negative selection in the screens. Since knocking out ribosomal genes lead to a strong negative selection phenotype [1, 2], the significance of negative selection on ribosomal genes can be evaluated in MAGeCKVISPR by Gene Ontology (GO) enrichment analysis using GOrilla [15]. A working negative selection experiment should have a significant P value (<0.001), although many good experiments could have much smaller P values (<1e10, see Section A of Additional file 1).
Calling essential genes under multiple conditions with MAGeCKMLE
MAGeCKVISPR includes a new algorithm, ‘MAGeCKMLE’, to estimate the essentiality of genes in various screening conditions using a maximum likelihood estimation (MLE) approach. Compared with the original MAGeCK algorithm using Robust Rank Aggregation (‘MAGeCKRRA’) that can only compare samples between two conditions, MAGeCKMLE is able to model complex experimental designs. Furthermore, MAGeCKMLE explicitly models the sgRNA knockout efficiency, which may vary depending on different sequence contents and chromatin structures [11, 12]. In MAGeCKMLE, the read count of a sgRNA i targeting gene g in sample j is modeled as a Negative Binomial (NB) random variable. The mean of the NB distribution (μ _{ ij }) is dependent on three factors: the sequencing depth of sample j (s _{ j }), the knockout efficiency of sgRNA i, and a linear combination of the effects in different conditions (that is, different drug treatments) on gene g. If sgRNA i knocks out target gene g efficiently, then μ _{ ij } is modeled as:
The effects of r different conditions are represented as the score ‘β _{ gr }’, a measurement of gene selections similar to the term of ‘log fold change’ in differential expression analysis. The presence or absence of each condition on each sample is encoded into binary elements of the design matrix d _{ jr }, and can be obtained from experiment designs. ‘β’ scores reflect the extent of selection in each condition: β _{ gr } >0 (or <0) means g is positively (or negatively) selected in condition r. μ _{ ij } is also dependent on β _{ i0}, the initial sgRNA abundance which is usually measured in plasmid or the day 0 of the experiment.
The values of β, together with the information whether an sgRNA is efficient, can be estimated by maximizing the joint loglikelihood of observing all sgRNA read counts of g on all different samples, and are optimized using an ExpectationMaximization (EM) algorithm. In the EM algorithm, MAGeCKMLE iteratively determines the knockout efficiency of each sgRNA based on the current estimation of ‘β’ scores (the E step), and uses the updated knockout efficiency information to recalculate ‘β’ scores (the M step). By examining the patterns of read counts of each sgRNA across all samples, the EM algorithm minimizes the effect of inefficient sgRNAs. A detailed description of the method is presented in the Methods section.
We tested MAGeCK algorithms on four public datasets. The first two datasets (the ‘ESC’ and ‘leukemia’ dataset) correspond to negative selection experiments on mouse embryonic stem cells (ESCs) and two human leukemia cell lines (KBM7 and HL60), respectively (Fig. 3a and b) [1, 4]. In both datasets, cells were grown with their natural growing condition and negative selections occurred in cells after CRISPR/Cas9 is activated. The other two datasets (‘melanoma’ knockout and activation dataset) are different CRISPR screens on the human melanoma cell line A375 that harbors a BRAF V600E mutation (Figs. 4 and 5). The cells were treated with BRAF inhibitor vemurafenib (PLX) or dimethyl sulfoxide (DMSO) control, and screened either with GeCKO [2] or with CRISPR/dCas9 Synergistic Activation Mediator (SAM) libraries [5]. These two datasets include multiple experimental conditions that are difficult to compare directly using the original MAGeCKRRA algorithm. In the melanoma knockout dataset, cells were under 7day or 14day selection [2]. In the melanoma activation dataset, two different drugs (puromycin and zeocin) were used to select cells with lentiviral infection, and both DMSO and PLX treatments were profiled under 3day or 21day selection [5].
In twocondition comparisons, MAGeCKMLE gives similar results with existing methods such as MAGeCKRRA, RSA, and RIGER. All the algorithms identified genes that are commonly essential to different cell types [16], as well as known positively selected genes in PLX treated conditions in two melanoma datasets (Fig. 3; also see Section A and B of Additional file 1). In the leukemia dataset, twocondition comparison algorithms (like MAGeCKRRA) identified genes that are differentially selected in two cell lines by a direct comparison of HL60 and KBM7 (Fig. 3a) [10]. However, not all of these genes are equally biologically interesting, as MAGeCKMLE further distinguished them into two groups: genes having little effect in one (β scores close to zero) but strong selection effect in the other cell line (large absolute β scores), and genes having weak and opposite effects in two cell lines (Fig. 3c). The first group of genes are often more biologically interesting as they are cell typespecific genes. This includes some wellknown driver genes (like BCR in KBM7) as well as genes that may be functional in only one cell type: CDK6 and TRIB1 in HL60 [17, 18], and RUNX1 in KBM7 [19].
One of the advantages of MAGeCKMLE over other methods is that it enables accurate comparisons of gene essentialities across multiple conditions and experiments in one run (Fig. 4 and Section C of Additional file 1). In the melanoma knockout dataset, a kmeans clustering of the β scores of top selected genes demonstrated that these genes have various essentialities across conditions (Fig. 4a). Some of the genes are universally positively or negatively selected in all conditions (cluster 3), while others have different essentiality across different conditions (clusters 1, 2, and 4). Genes in cluster 4 are particularly interesting as they show strong positive selection in 14day PLX treated condition. Indeed, genes whose knockout leads to strong positive selection in PLXtreated cells are in cluster 4, including NF1, NF2, MED12, CUL3 [2]. In contrast, the kmeans clusters of measurements from other algorithms did not reveal the strong effect of genes in cluster 4 (Section C of Additional file 1). This is because their score distributions are similar across different conditions (Fig. 4b), and do not reflect the fact that the one condition (PLX 14day treatment) induces much stronger positive selection than other conditions [2]. This is partly because MAGeCKRRA, RIGER, and RSA all use a rankbased method to compare sgRNA between two conditions, which may lose quantitative information.
Another example of using MAGeCKMLE on multiple conditions is demonstrated in the melanoma activation dataset, where cells underwent different selection methods (using puromycin or zeocin), drug treatments (DMSO or PLX), and durations (3day or 21day treatment) (Fig. 5). Similar to the melanoma knockout dataset, we performed kmeans clustering of the topselected gene β scores. Many positively selected genes are dependent on the selection method, which might not be biologically interesting. For example, genes in clusters 2 and 4 correspond to positively selected genes that are specific to puromycin or zeocin selection, respectively. A small set of genes (cluster 5) are consistently selected in both zeocin and puromycin, including genes that are validated in the original study, for example, EGFR, GPR35, LPAR1/5 [5]. We further examined the genes in cluster 5 (Fig. 5b), and focused on genes positively selected in the CRISPR activation experiment but strongly negatively selected in the knockout experiment. These genes include EGFR and BRAF, two known kinases that drive melanoma progression and PLX resistance [20, 21], and CRKL, a protein kinase that activates RAS and JUN pathway. CRKL amplification is reported to lead to drug resistance against EGFR inhibitors by activating EGFR downstream pathways [22], implying its potential role in PLX drug resistance.
Visualization of QC measurements and gene essentiality with VISPR
VISPR (VISualization of crisPR screens) is a webbased frontend for interactive visualization of CRISPR screen QC and comparison results. Interactive access is provided by an HTML5 based browser interface, while visualizations are realized with Vega [23], a declarative visualization grammar on top of DataDriven Documents (D3) [24]. VISPR provides three types of views for interactive exploration of CRISPR screening: a quality control view, a result view, and an experiment comparison view. The quality control view shows the QC measurements described before (Fig. 2).
In the result view, screening results can be interactively explored. It contains a table showing the comparison results of each gene (Fig. 6a). The table can be sorted by different columns and filtered (from ‘Search’) via gene names or regular expressions. Further, the distribution of P values is displayed as cumulative distribution function (CDF) (Fig. 6b) and as a histogram (Fig. 6c). For each gene, the normalized sgRNA counts in all samples can be displayed in a parallel coordinate visualization (Fig. 6d). If available, knockout efficiency predictions [11] and gene coordinates of each sgRNA are displayed as separate axes. Axes can be reordered or toggled on or off, and sgRNAs can be highlighted by selecting ranges on each axis. Genes selected in the table are highlighted in the CDF, allowing to assess their occurrence within the P value distribution of all genes.
VISPR provides various ways to further explore the analysis results. Individual genes can be viewed in Ensembl [25] and IGV [26]. Selected genes can be visualized in terms of their interaction network and function via GeneMANIA [27]. Functional analysis can be performed with GOrilla [15], an online Gene Ontology (GO) enrichment analysis tool. GOrilla takes a ranked list of genes (here based on the P values reported by MAGeCK) to perform a thresholdfree enrichment analysis. The resulting GO term enrichments can be further used for genelevel quality control.
The comparison view of VISPR can compare different experiments by visualizing the common and exclusive significant genes via Euler diagrams (Fig. 6e). Clicking on segments of the Euler diagram opens the result views of the corresponding experiments. For example, clicking on the intersection between two experiments will open ‘restricted’ result views for each experiment, where only the common significant genes are displayed. These views provide the same features as the unconstrained result views described above. However, in this case, GO enrichment analysis with GOrilla is performed with the shown genes (that is, the genes from the intersection) as foreground and the other genes of the experiment as background.
The visualizations displayed in VISPR can be downloaded as publicationready SVG files. In addition, a command line interface is provided to store visualizations as Vega specifications. This format allows users to modify and style the output of VISPR programmatically.
Implementation of the MAGeCKVISPR workflow with Snakemake
We implemented the MAGeCKVISPR workflow with the workflow management system Snakemake [13], allowing an automatic execution of some or all of the MAGeCKVISPR functions: quality control, essential gene analysis, and visualization. Choosing a workflow management system like Snakemake has several advantages. First, the workflow steps can be automatically parallelized and executed on workstations, servers, and compute clusters without the modification of the workflow. Second, Snakemake tracks metadata (like creation date, input, and log files) for all generated result and intermediate files. This way, used data, methods, and parameters are documented comprehensively for each analysis (also called data provenance), an important requirement of reproducible science. MAGeCKVISPR provides a command line interface to initialize the workflow in a given work directory. This installs the workflow definition as a socalled Snakefile, along with a configuration file and documentation. The configuration file is used to define locations of raw data and additional parameters for MAGeCKVISPR. Once configured, the Snakefile can be executed with Snakemake. Since the Snakefile is installed into the given work directory, it can be easily modified or extended by the user.
We provide all components of the workflow as Conda packages [28], such that MAGeCKVISPR can be installed with a single command. Optionally, the Conda package manager can create isolated environments for the workflow to, for example, freeze or compare different software versions or publish snapshots of a MAGeCKVISPR workflow instance along with all data and used software. This further increases the reproducibility of the generated results.
Conclusion
The recently developed CRISPR screening is a powerful technology in functional studies with different foci, including tumor progression and metastasis [29], drug resistance [3], immune response [30], and stem cell differentiation [4]. To our knowledge, MAGeCKVISPR is the first comprehensive pipeline developed for quality control, analysis, and visualization of CRISPR screens, and highlights new features compared with existing screening analysis algorithms. For example, a typical CRISPR screening experiment usually includes complex designs that are difficult to analyze using existing algorithms, as they are all designed for twocondition comparisons. To address this challenge, MAGeCKVISPR uses a maximum likelihood approach to estimate the effect of different conditions using a generalized linear model (GLM). It also incorporates sgRNA knockout efficiency information by using a probabilistic mixture model. We demonstrated that MAGeCKMLE provides additional insights into cell typespecific essential genes and is able to compare gene essentiality scores across conditions or even experiments. Also, MAGeCKVISPR is able to handle screens of different types including CRISPR knockout and CRISPR activation screens, and can be potentially applied to highthroughput sequencing datasets of traditional RNA interference (RNAi) screens.
The MAGeCKMLE approach is able to estimate sgRNA knockout efficiencies from CRISPR screens besides gene essentiality. We previously reported that sequencespecific features learned from CRISPR screening data helped the design of efficient sgRNAs [11]. With more CRISPR screen data becoming available, the algorithm will help us identify sgRNAs with the best behavior and learn patterns of ‘good’ sgRNAs. The information will further guide the design of optimized sgRNAs for CRISPR screens and individual gene knockouts.
One potential limitation of MAGeCKMLE is that its EM algorithm uses an iterative process involving matrix operations, making it slower than our previous MAGeCK RRA method and other competing algorithms. Future approaches to speed up MAGeCKMLE include improving parametric tests for P value estimation (instead of using permutation) and implementing the algorithm in Cython instead of Python. Another potential limitation of MAGeCKVISPR on the quality control assessment is that the current QC thresholds for ‘successful’ experiments are determined heuristically due to limited number of publicly available CRISPR screening datasets. We and other researchers have previously reported that bigger collections of ChIPseq datasets provide better criteria on ChIPseq quality control [31, 32]. As more public CRISPR screening datasets become available, the QC metrics (and other parts of MAGeCKVISPR) can be further refined.
As CRISPR screens become more popular, complications in the data such as batch effects will be unavoidable which need proper correction for meaningful downstream analysis. Existing batch removal algorithms, including ComBat [33] and RUVseq [34], have been widely used to remove batch effects in gene expression analysis. In the future, these algorithms can be integrated into MAGeCKVISPR pipeline. After that, MAGeCKVISPR will be able to identify cancer and diseasespecific essential genes by a direct comparison between different datasets or experiments, providing potentially new therapeutic insights into the mechanisms of diseases and cancers.
Methods
MAGeCKMLE: a maximum likelihood approach for essential gene detection
The Negative Binomial model for highthroughput CRISPR screening read counts
After read mapping, the sequencing results of CRISPR screening are presented as a read count table, where rows correspond to sgRNAs and columns correspond to samples. Read counts generated from highthroughput sequencing data have higher variances when a high number of read counts are observed (also called ‘overdispersion’). This is usually modeled using Negative Binomial (NB) distribution, such as in the statistical models used in many RNAseq differential expression analysis algorithms: edgeR, DESeq/DESeq2, and so on [35–37]. MAGeCKMLE uses a similar model; briefly, the read count of sgRNA i in sample j, or x _{ ij }, is modeled as:
Where μ _{ ij } and α _{ i } are the mean and overdispersion factor of the NB distribution, respectively. The mean value μ _{ ij } is further modeled as:
Where s _{ j } is the size factor of sample j for adjusting sequencing depths of the samples, and q_{ij} is a variable modeling the behavior of sgRNA i in sample j that will be discussed in later sections. s _{ j } is calculated by the ‘median ratio method’ in MAGeCK and DESeq2 [10, 37]:
Here, \( {\widehat{x}}_i \) is the geometric mean of the read counts of sgRNA i across all J samples: \( {\widehat{x}}_i={\left({\displaystyle \prod_{k=1}^J}{x}_{ik}\right)}^{1/J}. \) s _{ j } can also be calculated based on a set of predefined ‘control’ sgRNAs instead of all sgRNAs. This is particularly useful when a majority of the genes in the library are supposed to be essential; in such cases it is not suitable to calculate s _{ j } based on all sgRNAs. Both methods are implemented in MAGeCKVISPR and users can specify which method to use.
The overdispersion factor α _{ i } is calculated based on the regression residual and will be discussed in more details in the last Methods section.
Modeling sgRNA knockout efficiency and complex experimental settings
Different studies demonstrated that sgRNAs have various DNA cutting efficiencies [11, 38], but such information is not considered in most essential gene calling algorithms (including MAGeCK). In MAGeCKMLE, we use a binary variable π _{ i } to model whether sgRNA i is efficient or not: π _{ i } = 1 corresponds to an efficient sgRNA i and vice versa. Since π _{ i } is unknown, the probability of observing a read count x from x _{ ij } is a mixture of two distributions:
In CRISPR screening experiments, it is common to have cells treated with different conditions. For example in melanoma activation dataset [5], cell lines underwent different sgRNA expression selection methods (cells are first selected using puromycin or zeocin), duration of treatment (3day or 21day treatment) and drug treatments (DMSO or PLX). For an efficient sgRNA i (π _{ i } = 1), MAGeCKMLE uses a generalized linear model (GLM) to model the effect of q _{ ij } as a linear combination of effects from different sources:
Here, β _{ i0} is the baseline abundance of sgRNA i, corresponding to its abundance in an initial state (in plasmid or day 0). d _{ jr } is an element of a design matrix given by the user (explained later), and β _{ gr } is the (unknown) coefficient that we would like to estimate.
If sgRNA i is inefficient (π _{ i } = 0), then its read counts in all samples are not determined by any experimental conditions except the baseline abundance:
The design matrix
Design matrices have been used in many gene expression analysis algorithms for modeling complex experimental designs, including LIMMA [39], VOOM [40], DESeq2 [37], and so on. The design matrix D models the combination of effects of different conditions. For J samples that are affected by R conditions, D is a J * R binary matrix with element d _{ jr } = 1 if sample j is affected by condition R, and 0 otherwise. An example of the design matrix is presented in Additional file 1.
Based on the design matrix, the equations in (2) and (3) can be written in a matrix form. For a gene g with N sgRNAs in J samples, let \( {\overrightarrow{q}}_g \) be the vector of q values of all sgRNAs in all samples in gene g:
It can be written as:
Where \( {\overrightarrow{\beta}}_g \) is a N + r vector of β values in Equations (2) and (3). The first N elements of \( {\overrightarrow{\beta}}_g \) are the baseline abundances of N sgRNAs, and the following R elements of \( {\overrightarrow{\beta}}_g \) are the coefficients corresponding to R columns in the design matrix:
The binary extended design matrix D’ is used to set up the linear relationship between \( {\overrightarrow{\beta}}_g \) and \( {\overrightarrow{q}}_g \), and can be derived directly from the design matrix. See Additional file 1 for the definition and an example of D’.
The EM approach
MAGeCKMLE uses a maximum likelihood estimation (MLE) approach to find the values of \( {\overrightarrow{\beta^{\ast}}}_g \) The objective function of MAGeCKMLE is:
Similar to DESeq2 [37], MAGeCKMLE also adds a prior \( p\left({\overrightarrow{\beta}}_g\right) \) that follows a normal distribution centered on zero in the objective function. Adding this prior makes sure \( {\overrightarrow{\beta^{\ast}}}_g \) does not become arbitrarily large, when the sgRNA knockout efficiency is low and the differences of read counts between samples are high.
The objective function can be maximized using expectation maximization (EM). At the beginning, we have an initial guess of p(π _{ i } = 1). Subsequently, we iteratively update the values of p(π _{ i } = 1) and \( \overrightarrow{\beta} \) in the E step and the M step, respectively.
The initial guess of sgRNA knockout efficiency
We demonstrated that the SSC (Spacer Scoring of CRISPR) algorithm accurately predicts sgRNA knockout efficiency from genomic sequence content [11]. For each sgRNA, SSC generates an efficiency score in the range (−2,2). We scale the score linearly to the range (0,1) as an initial guess of p(π _{ i } = 1). If no initial estimates are given, MAGeCKMLE starts with p(π _{ i } = 1) = 1 for all sgRNAs.
The expectation step
In the E step, we reestimate the posterior probability p(π _{ i } = 1) and the current estimation of \( {\overrightarrow{\beta}}_g \):
The maximization step
In the M step, we maximize the values of \( {\overrightarrow{\beta}}_g \) based on the values of p(π _{ i } = 1). To derive the formula for updating \( {\overrightarrow{\beta}}_g \), we write the probability of observing a read count x of x _{ ij } as:
where I(.) is an indicator function. Taking the logarithm on both sides of the equation, we get
In the EM algorithm, it can be approximated by replacing the indicator function I(π _{ i } = 1) and I(π _{ i } = 0) with the posterior probability of P(π _{ i } = 1) and P(π _{ i } = 0), respectively [41], using the results from the E step. Therefore, the log likelihood function from the mixture model can be written as:
Since NB distribution belongs to exponential family distributions, a fast algorithm exists for the maximum likelihood estimation of generalized linear models [42]. Taking the prior of \( {\overrightarrow{\beta}}_g \) into consideration, the objective function can be maximized using iteratively reweighted ridge regression, or weighted updates, the same the algorithm used in DESeq2 [37]. The update rule for calculating \( {\overrightarrow{\beta^t}}_g \) at step t of the iteration can be written as:
Here, W is the diagonal matrix with its values given by w _{ ii } = e _{ i } ^{t}/(1/μ _{ i } + α _{ i }), where e _{ i } ^{t} is the current estimate of the efficiency of sgRNA \( i:{e}_i^t=P\left({\pi}_i=1\Big{x}_{ij},\overrightarrow{\beta_g^{t1}}\right) \), λ is the regularization parameter in the ridge regression, and μ _{ i } is the current estimate of the mean of the NB variable:
\( {\overrightarrow{\mathrm{z}}}^t \) is the residue vector of the current estimate, with its ith element:
Here, x _{ i } is the read count of sgRNA i.
Convergence
The EM approach iterates the E step and the M step until it converges or reaches a predefined maximum number of iteration.
Statistical significance
The statistical significance of \( {\overrightarrow{\beta}}_g \) is calculated in both permutation and Wald test. In permutation test, MAGeCKMLE shuffles all sgRNAs in a gene to generate empirical null distribution of \( {\overrightarrow{\beta}}_g \) . The number of shufflings is a parameter specified by the user, and the default value is set to be 2*(total number of genes). In the Wald test, MAGeCKMLE compares the value of \( {\overrightarrow{\beta}}_g/SE\left({\overrightarrow{\beta}}_g\right) \) to the standard Normal distribution, where \( SE\left({\overrightarrow{\beta}}_g\right) \) is the standard error of \( {\overrightarrow{\beta}}_g \):
Here, \( diag\left(Cov\left({\overrightarrow{\beta}}_g\right)\right) \) are the diagonal elements of the covariance matrix of \( {\overrightarrow{\beta}}_g \).
Calculating the overdispersion factor
The overdispersion factor, α _{ i }, is calculated based on the mean and variance estimation algorithm used in MAGeCK [10] and VOOM [40]. We first calculate the fitted values of \( {\overrightarrow{\beta}}_g \), or \( {\widehat{\beta}}_g \), using the EM algorithm proposed before, with the overdispersion factor set to a fixed value (for example, 0.01). Then the fitted means \( {\widehat{\mu}}_i \) are calculated using Equation (4), and the residual variances are calculated using the following equation:
MAGeCKMLE then models the sample residual variance \( {\widehat{\sigma}}^2 \) and fitted mean \( \widehat{\mu} \) using the same model as in MAGeCK [10]:
Where k and b are learned from the fitted means and residual variances of all sgRNA read counts. The values of α _{ i } are then calculated based on the fitted values of sample residual variance \( {{\widehat{\sigma}}_f}^2 \) from this model:
Availability
The MAGeCKVISPR workflow is available open source at http://bitbucket.org/liulab/mageckvispr under the MIT license.
Abbreviations
 AML:

acute myeloid leukemia
 CDF:

cumulative distribution function
 CML:

chronic myeloid leukemia
 CRISPR:

clustered regularly interspaced short palindromic repeats
 D3:

DataDriven Documents
 dCas9:

dead Cas9
 DMSO:

dimethyl sulfoxide
 EM:

expectationmaximization
 GeCKO:

genomescale CRISPR/Cas9 knockout
 GLM:

generalized linear model
 GO:

gene ontology
 MAGeCK:

Modelbased Analysis of Genomewide CRISPR/Cas9 Knockout
 MLE:

maximumlikelihood estimation
 NB:

negative binomial
 NGS:

nextgeneration sequencing
 PCA:

principle component analysis
 QC:

quality control
 RNAi:

RNA interference
 RRA:

robust rank aggregation
 SAM:

Synergistic Activation Mediator
 sgRNA:

singleguide RNA
 VISPR:

VISualization of crisPR screens
References
 1.
Wang T, Wei JJ, Sabatini DM, Lander ES. Genetic screens in human cells using the CRISPRCas9 system. Science. 2014;343:80–4.
 2.
Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelsen TS, et al. Genomescale CRISPRCas9 knockout screening in human cells. Science. 2014;343:84–7.
 3.
Zhou Y, Zhu S, Cai C, Yuan P, Li C, Huang Y, et al. Highthroughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature. 2014;509:487–91.
 4.
KoikeYusa H, Li Y, Tan EP, VelascoHerrera MDC, Yusa K. Genomewide recessive genetic screening in mammalian cells with a lentiviral CRISPRguide RNA library. Nat Biotechnol. 2014;32:267–73.
 5.
Konermann S, Brigham MD, Trevino AE, Joung J, Abudayyeh OO, Barcena C, et al. Genomescale transcriptional activation by an engineered CRISPRCas9 complex. Nature. 2015;517:583–8.
 6.
Gilbert LA, Horlbeck MA, Adamson B, Villalta JE, Chen Y, Whitehead EH, et al. Genomescale CRISPRmediated control of gene repression and activation. Cell. 2014;159:647–61.
 7.
Luo B, Cheung HW, Subramanian A, Sharifnia T, Okamoto M, Yang X, et al. Highly parallel identification of essential genes in cancer cells. Proc Natl Acad Sci U S A. 2008;105:20380–5.
 8.
König R, Chiang CY, Tu BP, Yan SF, DeJesus PD, Romero A, et al. A probabilitybased approach for the analysis of largescale RNAi screens. Nat Methods. 2007;4:847–9.
 9.
Diaz AA, Qin H, RamalhoSantos M, Song JS. HiTSelect: a comprehensive tool for highcomplexitypooled screen analysis. Nucleic Acids Res. 2015;43:e16–6.
 10.
Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, et al. MAGeCK enables robust identification of essential genes from genomescale CRISPR/Cas9 knockout screens. Genome Biol. 2014;15:554.
 11.
Xu H, Xiao T, Chen CH, Li W, Meyer C, Wu Q, et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015;25:1147–57.
 12.
Wu X, Scott DA, Kriz AJ, Chiu AC, Hsu PD, Dadon DB, et al. Genomewide binding of the CRISPR endonuclease Cas9 in mammalian cells. Nat Biotechnol. 2014;32:670–6.
 13.
Köster J, Rahmann S. Snakemake  a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2.
 14.
Wittebolle L, Marzorati M, Clement L, Balloi A, Daffonchio D, Heylen K, et al. Initial community evenness favours functionality under selective stress. Nature. 2009;458:623–6.
 15.
Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009;10:48.
 16.
Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol Syst Biol. 2014;10:733–3.
 17.
Placke T, Faber K, Nonami A, Putwain SL, Salih HR, Heidel FH, et al. Requirement for CDK6 in MLLrearranged acute myeloid leukemia. Blood. 2014;124:13–23.
 18.
Röthlisberger B, Heizmann M, Bargetzi MJ, Huber AR. TRIB1 overexpression in acute myeloid leukemia. Cancer Genet Cytogenet. 2007;176:58–60.
 19.
Zhao LJ, Wang YY, Li G, Ma LY, Xiong SM, Weng XQ, et al. Functional features of RUNX1 mutants in acute transformation of chronic myeloid leukemia and their contribution to inducing murine fullblown leukemia. Blood. 2012;119:2873–82.
 20.
Davies H, Bignell GR, Cox C, Stephens P, Edkins S, Clegg S, et al. Mutations of the BRAF gene in human cancer. Nature. 2002;417:949–54.
 21.
Prahallad A, Sun C, Huang S, Di Nicolantonio F, Salazar R, Zecchin D, et al. Unresponsiveness of colon cancer to BRAF(V600E) inhibition through feedback activation of EGFR. Nature. 2012;483:100–3.
 22.
Cheung HW, Du J, Boehm JS, He F, Weir BA, Wang X, et al. Amplification of CRKL induces transformation and epidermal growth factor receptor inhibitor resistance in human nonsmall cell lung cancers. Cancer Discov. 2011;1:608–25.
 23.
VEGA. A Visualization Grammar. [https://vega.github.io].
 24.
Bostock M, Ogievetsky V, Heer J. D^{3}: DataDriven Documents. IEEE Trans Vis Comput Graph. 2011;17:2301–9.
 25.
Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2015. Nucleic Acids Res. 2015;43(Database issue):D662–9.
 26.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): highperformance genomics data visualization and exploration. Brief Bioinformatics. 2013;14:178–92.
 27.
WardeFarley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38(Web Server issue):W214–20.
 28.
The Conda project [https://anaconda.org].
 29.
Chen S, Sanjana NE, Zheng K, Shalem O, Lee K, Shi X, et al. Genomewide CRISPR screen in a mouse model of tumor growth and metastasis. Cell. 2015;160:1246–60.
 30.
Parnas O, Jovanovic M, Eisenhaure TM, Herbst RH, Dixit A, Ye CJ, et al. A Genomewide CRISPR Screen in Primary Immune Cells to Dissect Regulatory Networks. Cell. 2015;162:675–86.
 31.
Wang Q, Huang J, Sun H, Liu J, Wang J, Wang Q, et al. CR Cistrome: a ChIPSeq database for chromatin regulators and histone modification linkages in human and mouse. Nucleic Acids Res. 2014;42(Database issue):D450–8.
 32.
Diaz A, Nellore A, Song JS. CHANCE: comprehensive software for quality control and validation of ChIPseq data. Genome Biol. 2012;13:R98.
 33.
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.
 34.
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32:896–902.
 35.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–40.
 36.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.
 37.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:550.
 38.
Doench JG, Hartenian E, Graham DB, Tothova Z, Hegde M, Smith I, et al. Rational design of highly active sgRNAs for CRISPRCas9mediated gene inactivation. Nat Biotechnol. 2014;32:1262–7.
 39.
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3–25.
 40.
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15:R29.
 41.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.
 42.
Fox J. Applied Regression Analysis and Generalized Linear Models. London: SAGE Publications; 2015.
Acknowledgements
The authors would like to thank Michael I. Love, Clifford Meyer, Peng Jiang, Bo Li, and Graham McVicker for helpful discussions.
Funding
The project was supported by the NIH grant U01 CA180980 (to XSL), R01 HG008728 (to MB and XSL), Department of Defense Synergistic Idea Development Award PC140817 (to MB and XSL), R01 GM11324201 (to JSL), NSF grant DMS1120368 (to JSL), and the Claudia Adams Barr Award in Innovative Basic Cancer Research from the DanaFarber Cancer Institute.
Author information
Additional information
Competing interests
The authors declare no competing financial interests.
Authors’ contributions
WL, JSL, and XSL designed the statistical model. WL and JK developed the algorithm, designed and performed the analysis. XH and CHC performed sgRNA efficiency prediction analysis. WL, JK, and XSL wrote the manuscript with help from all other authors. XSL and MB supervised the whole project. All authors read and approved the final manuscript.
Wei Li and Johannes Köster contributed equally to this work.
Additional file
Additional file 1:
Supplementary materials. (PDF 2045 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 CRISPR/Cas9
 Screening
 Maximum likelihood
 ExpectationMaximization
 Negative binomial
 Datadriven documents
 D3
 Visualization
 Quality control