Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR

High-throughput CRISPR screens have shown great promise in functional genomics. We present MAGeCK-VISPR, a comprehensive quality control (QC), analysis, and visualization workflow for CRISPR screens. MAGeCK-VISPR defines a set of QC measures to assess the quality of an experiment, and includes a maximum-likelihood algorithm to call essential genes simultaneously under multiple conditions. The algorithm uses a generalized linear model to deconvolute different effects, and employs expectation-maximization to iteratively estimate sgRNA knockout efficiency and gene essentiality. MAGeCK-VISPR also includes VISPR, a framework for the interactive visualization and exploration of QC and analysis results. MAGeCK-VISPR is freely available at http://bitbucket.org/liulab/mageck-vispr. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0843-6) contains supplementary material, which is available to authorized users.

In the melanoma knockout dataset [2], known genes related to PLX resistance are also identified in both MAGeCK-RRA and MAGeCK-MLE, including NF1, NF2, MED12, CUL3 (Table M1). In the melanoma activation dataset [3], both methods identified known genes whose over-expression contribute to faster cell growth in the PLX treated condition, including EGFR, GPR35, and LPAR1/5 (Table M2). 6.03 10 6.30e-12 1 Table M1: The scores and ranks of known genes whose knockout leads to PLX resistance in melanoma knockout dataset. Here, the scores and rankings of 14-day PLX treated samples vs. DMSO Table M2: The scores and ranks of known genes whose over-expression leads to PLX resistance in melanoma activation dataset. Here, the scores and rankings of puromycin selected 21-day PLX treated samples vs. DMSO treated samples are shown.

B. Comparing MAGeCK-MLE with other methods in two-condition comparisons
We also compared MAGeCK-MLE with MAGeCK-RRA and other published algorithms, including RIGER [4] and RSA [5]. To evaluate the performance, we adopted a set of "gold- standard" essential and non-essential genes from multiple RNAi screens [6], and calculate the precision and recall for the outputs of different algorithms. Precision and recall are defined as /( + ) and /( + ), respectively, where TP (True Positive) is the number of essential genes identified by the algorithm, FP (False Positive) is the number of non-essential genes identified by the algorithm, and FN (False Negative) is the number of essential genes not identified by the algorithm.
We compared the precision-recall curve of each algorithm on leukemia and melanoma knockout dataset, as well as the value of AUCPR (Area Under the Precision-Recall Curve) for each algorithm in Figure M3. For two-condition comparisons, MAGeCK-RRA, MAGeCK-MLE and RSA reached similar performances in both datasets. Figure M3: The Precision-Recall curve of all algorithms for leukemia dataset (left) as well as melanoma knockout dataset (right). For the ranked list of genes generated by each algorithm, "gold-standard" essential and non-essential genes from RNAi screens [6] are used calculate the values of precision and recall. The Area Under the Precision-Recall Curve value (AUCPR) for each algorithm is also displayed in the legend.
The melanoma knockout dataset includes two time point measurements (day 7 and day 14) for each condition (PLX or DMSO treatment). We used this dataset to evaluate the consistency of the results between two time points. For each algorithm and each condition, we defined a set of "reference" genes as those that are consistently ranked top on two time points, and used them to calculate the true positive rate and false positive rate for each algorithm. If the top genes identified by each algorithm includes "reference" genes, then the false positive rate is defined as ( − )/ , and true positive rate is defined as / , where is the number of "reference" genes.
We compared the Receiver Operating Characteristics (ROC) curve, as well as the Area Under the Curve (AUC) value of different algorithms in Figure M4. MAGeCK-MLE has a better performance compared with other algorithms, indicating that the results of MAGeCK-MLE are more consistent across similar conditions compared with other methods. This may be due to the ability of MAGeCK-MLE to model multiple conditions concurrently.  Figure M4: The ROC (Receiver Operation Characteristics) curve of two conditions (DMSO and PLX treatment) of different algorithms using melanoma knockout dataset. The "gold-standard" genes are defined separately for each method as those genes that are consistently identified as essential in two time points (7 day and 14 day). The Area Under the Curve value (AUC) for each algorithm is displayed in the legend.

C. Comparing MAGeCK-MLE with other methods in multiple conditions
Although MAGeCK-MLE is specifically designed for multiple-condition screens, other algorithms based on two-condition comparisons can also be used for multiple-condition comparisons. For these algorithms, a straightforward approach is to simply combine and compare the proper "scores" generated from two-condition comparisons.

D. sgRNA efficiency estimation improves gene identification
We investigate whether the information of sgRNA efficiency can improve the performance of gene identification. To do this, we ran MAGeCK-MLE on two modes, one with sgRNA efficiency estimation from the SSC algorithm [7] and iteratively update sgRNA efficiency during the EM algorithm, and the other assuming all sgRNAs are efficient. The following tables (Table  M3 and Table M4) demonstrate the improvement of identifying essential genes using sgRNA efficiency estimation.   We also plotted the Precision-Recall curve (PR curve) of two modes using the "gold-standard" essential and non-essential genes, similar to the approach we performed in Section B ( Figure  M7). The value of AUCPR (Area Under the Precision-Recall Curve) using sgRNA efficiency estimation is higher than that without such information, demonstrating that it better improves the performance of gene callings. Figure M7: The Precision-Recall curve of MAGeCK-MLE with or without sgRNA efficiency in leukemia dataset, including the estimation of HL60 essential genes (left) and KBM7 essential genes (right). We used "gold-standard" essential and non-essential genes from RNAi screens [6] to calculate the values of precision and recall. The Area Under the Precision-Recall Curve value (AUCPR) for each algorithm is also displayed in the legend.

E. The design matrix and extended design matrix
In MAGeCK-MLE, we use a design matrix to model complex experimental conditions. Using an extended design matrix, the values in the Methods section can be written conveniently in a matrix form. The extended design matrix ′ has * rows and + columns, and can be derived from the design matrix.
For a given design matrix with size * , the extended design matrix can be written in the following form:   where ! is the identity matrix with size N, and !" is the element of row and column in .
Under the mixture model, the extended design matrix in Table M5 assumes all sgRNAs are efficient ( ! = 1). If the sgRNAs are not efficient, the extended design matrix becomes Rows/Columns 1..N + 1 + 2 ...  An example of the design matrix and extended design matrix is given below. Suppose we have 5 samples: sample 0 is the baseline sample (e.g., plasmid), sample 1 and 2 are from cell line A, and sample 2 are from cell line B. Sample 1 and 3 are treated with drug X, while sample 2 and 4 are treated with drug Y. In this example, three factors should be considered: baseline factor ( ! ), the cell-line specific factor for cell line A and B ( ! , ! ) and the drug specific factor ( ! , ! ). The design matrix (5*5) can be written as

F. The design matrixes of four datasets
Leukemia dataset [8] includes 4 samples: the initial state and the final state of HL60 and KBM7 cells. MAGeCK-MLE uses both initial state samples as "baseline" samples, and the design matrix is specified in Table M7. Melanoma knockout dataset [2] includes 9 samples: PLX treated and DMSO treated A375 cells with two time windows (7-day and 14-day), 2 replicates for each condition, and 1 plasmid. We intend to model the effects of 4 factors, 7-day and 14-day, DMSO and PLX treatments. The design matrix is shown in Table M9.  Table M9: The design matrix used in melanoma knockout dataset. Melanoma activation dataset [3] includes 14 samples. Two are plasmid libraries, four are zeocin and puromycin selected samples in day 3 (2 replicates for each), eight are zeocin and puromycin selected, PLX or DMSO treated samples in day 21 (2 replicates for each). The design matrix is shown in Table M10.  Table M10: The design matrix used in melanoma activation dataset.