Skip to content

Advertisement

What do you think about BMC? Take part in

Open Access

Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR

Contributed equally
Genome Biology201516:281

https://doi.org/10.1186/s13059-015-0843-6

Received: 31 July 2015

Accepted: 23 November 2015

Published: 16 December 2015

Abstract

High-throughput CRISPR screens have shown great promise in functional genomics. We present MAGeCK-VISPR, a comprehensive quality control (QC), analysis, and visualization workflow for CRISPR screens. MAGeCK-VISPR defines a set of QC measures to assess the quality of an experiment, and includes a maximum-likelihood algorithm to call essential genes simultaneously under multiple conditions. The algorithm uses a generalized linear model to deconvolute different effects, and employs expectation-maximization to iteratively estimate sgRNA knockout efficiency and gene essentiality. MAGeCK-VISPR also includes VISPR, a framework for the interactive visualization and exploration of QC and analysis results. MAGeCK-VISPR is freely available at http://bitbucket.org/liulab/mageck-vispr.

Keywords

CRISPR/Cas9ScreeningMaximum likelihoodExpectation-MaximizationNegative binomialData-driven documentsD3VisualizationQuality control

Background

The clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 system is a powerful genetic engineering technique, allowing direct modifications of genomic loci in most model organisms in a cost-effective way. Based on this system, the recent development of high-throughput CRISPR screening technology has shown great promise in functional genomics, allowing researchers to systematically identify genes associated with various phenotypes [14]. CRISPR screens can be performed by either direct knockout of genes using CRISPR/Cas9 [1, 2], or perturbing gene expressions using CRISPR and a dead-Cas9 (dCas9) fused with activation or repression effectors [5, 6].

While CRISPR screening is a powerful technique, it creates computational challenges that include: (1) how to evaluate the data quality; (2) how to identify gene or pathway hits from the screens and assess their statistical significance; and (3) how to visualize and explore the screening results efficiently. Until now, a comprehensive quality control (QC), data analysis, and visualization method for CRISPR screen was not available. Several algorithms are developed for screening analysis on microarray or high-throughput sequencing data, such as RIGER [7], RSA [8], HitSelect [9], as well as the MAGeCK algorithm we previously developed [10]. These algorithms are designed based on a comparison of two conditions, although many screens are conducted simultaneously across several time points, under many treatment conditions or over many cell lines. In addition, these algorithms do not consider the knockout efficiency of single guide RNAs (sgRNA) on target genes. The knockout efficiency is the ability of a sgRNA to induce cutting events that lead to the knockout of the targeted gene. It is influenced by sgRNA sequence content [11], chromatin accessibility and exon position of the targeting gene [12], and so on.

In this study, we present MAGeCK-VISPR to overcome the computational challenges of CRISPR screens. MAGeCK-VISPR (1) defines a set of QC measurements and (2) extends the MAGeCK algorithm by a maximum likelihood estimation method (MAGeCK-MLE) to call essential genes under multiple conditions while considering sgRNA knockout efficiency. Further, MAGeCK-VISPR (3) provides a web-based visualization framework (VISPR) for interactive exploration of CRISPR screen quality control and analysis results. MAGeCK-VISPR employs a Snakemake [13] workflow to combine MAGeCK and VISPR in a scalable and reproducible way (Fig. 1).
Fig. 1

An overview of the MAGeCK-VISPR workflow. Given FASTQ files and an sgRNA design, the workflow employs several preprocessing steps, including using FastQC and MAGeCK to collect quality control metrics and calculate per-sgRNA read counts. Then, either MAGeCK-RRA or MAGeCK-MLE is used to determine essential genes under user-defined conditions. Finally, results are composed for visualization and interactive exploration in VISPR

Results and discussion

Quality control measurements for CRISPR screening experiments

Apart from the determination of essential genes with MAGeCK, a central purpose of MAGeCK-VISPR is to collect quality control (QC) measurements at various levels (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The proposed measurements (Table 1) can be divided into four categories: sequence level, read count level, sample level, and gene level (Fig. 2).
Table 1

Quality control (QC) measures from MAGeCK-VISPR

QC term

Description

Expected

GC content

GC content distribution of the sequencing reads

Similar distribution for all samples from same library

Base quality

Base quality distribution of the sequencing reads

Single-peak distribution with median base quality at least 25

Sequencing reads

Total number of sequencing reads

Varies depending on sequencing platform

Mapped reads

Total number of reads mapped to the sgRNA library

300 * (number of sgRNAs)

% Mapped reads

Percentage of mapped reads to the total number of sequencing reads

At least 65 %

Zero sgRNAs

Number of sgRNAs with zero read counts

At most 1 % of total sgRNAs

Gini index

Gini index of log-scaled read count distributions

At most 0.1 for plasmid or initial state samples, and at most 0.2 for negative selection samples

Sample correlation

Pearson correlation coefficient between samples

At least 0.8 for replicates

Correlation clustering or PCA clustering

Hierarchical clustering of samples or first three PCA components

Samples with similar conditions should cluster together

Ribosomal gene selection

Negative selection enrichment statistics of ribosomal genes

Significant P values (<0.001) for ribosomal subunit (GO:0044391) in negative selection experiments

Fig. 2

The quality control (QC) view of VISPR, the visualization framework of MAGeCK-VISPR. The measurements include the distribution of GC content (a), median base quality (b), the distribution of mean sequence quality (c), the number of zero-count sgRNAs (d), Gini-index (e), total number of reads and the percentage of mapped reads (f), Principle Component Analysis (PCA) plot (g), normalized read count distribution (h, i), and pairwise sample correlations (j). Shown results are from ESC (a-f) and melanoma dataset (g-j)

Sequence level QC measurements aim to detect problems with the sequencing, similar as in other next-generation sequencing (NGS) experiments. Two measurements are reported: sample GC content distribution (Fig. 2a) and the base quality distribution of sequencing reads (Fig. 2b, c). Ideally, sequencing reads should have reasonable base qualities (median value >25), and samples from the same experiment should have similar GC content distributions.

The second level of QC measurements is based on the sgRNA read counts collected from MAGeCK. Raw sequencing reads are first mapped to sgRNA sequences in the library with no mismatches tolerated. After that, the number of sequencing reads, mapped reads (and thereof the percentage of mapped reads), sgRNAs with zero read count, and the Gini index of read count distribution are reported for each sample (Fig. 2d-f). The percentage of mapped reads is a good indicator of sample quality, and low mappability could be due to sequencing error, oligonucleotide synthesis error, or sample contamination. Good statistical power of downstream analysis relies on sufficient reads (preferably over 300 reads) for each sgRNA, with low number of zero-count sgRNAs in the plasmid library or early time points. Gini index, a common measure of income inequality in economics, can measure the evenness of sgRNA read counts [14]. It is perfectly normal for later time points in positive selection experiments to have higher Gini index since a few surviving clones (a few sgRNA with extreme high counts) could dominate the final pool while most of the other cells die (more sgRNAs with zero-count). In contrast, high Gini index in plasmid library, in early time points, or in negative selection experiments may indicate CRISPR oligonucleotide synthesis unevenness, low viral transfection efficiency, and over selection, respectively.

Sample level QC (Fig. 2g-j) checks the consistency between samples. MAGeCK-VISPR reports the distributions of normalized read counts by box plots and cumulative distribution functions. It also calculates pairwise Pearson correlations of sample log read counts, and draws the samples on the first three components of a Principle Component Analysis (PCA). Biological replicates or samples with similar conditions should have similar read count distributions and higher correlations, and appear closer to each other in the PCA plot. PCA plots can also identify potential batch effects if the screens are conducted under different batches.

Finally, gene level QC determines the extent of negative selection in the screens. Since knocking out ribosomal genes lead to a strong negative selection phenotype [1, 2], the significance of negative selection on ribosomal genes can be evaluated in MAGeCK-VISPR by Gene Ontology (GO) enrichment analysis using GOrilla [15]. A working negative selection experiment should have a significant P value (<0.001), although many good experiments could have much smaller P values (<1e-10, see Section A of Additional file 1).

Calling essential genes under multiple conditions with MAGeCK-MLE

MAGeCK-VISPR includes a new algorithm, ‘MAGeCK-MLE’, to estimate the essentiality of genes in various screening conditions using a maximum likelihood estimation (MLE) approach. Compared with the original MAGeCK algorithm using Robust Rank Aggregation (‘MAGeCK-RRA’) that can only compare samples between two conditions, MAGeCK-MLE is able to model complex experimental designs. Furthermore, MAGeCK-MLE explicitly models the sgRNA knockout efficiency, which may vary depending on different sequence contents and chromatin structures [11, 12]. In MAGeCK-MLE, the read count of a sgRNA i targeting gene g in sample j is modeled as a Negative Binomial (NB) random variable. The mean of the NB distribution (μ ij ) is dependent on three factors: the sequencing depth of sample j (s j ), the knockout efficiency of sgRNA i, and a linear combination of the effects in different conditions (that is, different drug treatments) on gene g. If sgRNA i knocks out target gene g efficiently, then μ ij is modeled as:
$$ {\mu}_{ij}={s}_j\ exp\left({\beta}_{i0}+{\displaystyle \sum_r}{d}_{jr}{\beta}_{gr}\right) $$

The effects of r different conditions are represented as the score ‘β gr ’, a measurement of gene selections similar to the term of ‘log fold change’ in differential expression analysis. The presence or absence of each condition on each sample is encoded into binary elements of the design matrix d jr , and can be obtained from experiment designs. ‘β’ scores reflect the extent of selection in each condition: β gr >0 (or <0) means g is positively (or negatively) selected in condition r. μ ij is also dependent on β i0, the initial sgRNA abundance which is usually measured in plasmid or the day 0 of the experiment.

The values of β, together with the information whether an sgRNA is efficient, can be estimated by maximizing the joint log-likelihood of observing all sgRNA read counts of g on all different samples, and are optimized using an Expectation-Maximization (EM) algorithm. In the EM algorithm, MAGeCK-MLE iteratively determines the knockout efficiency of each sgRNA based on the current estimation of ‘β’ scores (the E step), and uses the updated knockout efficiency information to re-calculate ‘β’ scores (the M step). By examining the patterns of read counts of each sgRNA across all samples, the EM algorithm minimizes the effect of inefficient sgRNAs. A detailed description of the method is presented in the Methods section.

We tested MAGeCK algorithms on four public datasets. The first two datasets (the ‘ESC’ and ‘leukemia’ dataset) correspond to negative selection experiments on mouse embryonic stem cells (ESCs) and two human leukemia cell lines (KBM7 and HL-60), respectively (Fig. 3a and b) [1, 4]. In both datasets, cells were grown with their natural growing condition and negative selections occurred in cells after CRISPR/Cas9 is activated. The other two datasets (‘melanoma’ knockout and activation dataset) are different CRISPR screens on the human melanoma cell line A375 that harbors a BRAF V600E mutation (Figs. 4 and 5). The cells were treated with BRAF inhibitor vemurafenib (PLX) or dimethyl sulfoxide (DMSO) control, and screened either with GeCKO [2] or with CRISPR/dCas9 Synergistic Activation Mediator (SAM) libraries [5]. These two datasets include multiple experimental conditions that are difficult to compare directly using the original MAGeCK-RRA algorithm. In the melanoma knockout dataset, cells were under 7-day or 14-day selection [2]. In the melanoma activation dataset, two different drugs (puromycin and zeocin) were used to select cells with lentiviral infection, and both DMSO and PLX treatments were profiled under 3-day or 21-day selection [5].
Fig. 3

The gene essentiality scores (β scores) reported from MAGeCK-MLE on two conditions. a, b the β scores of two leukemia cell lines in the leukemia dataset (a), and two biological replicates of mouse ESC cells in the ESC dataset (b). In (a), some well-known driver genes and cell type-specific genes are also labeled. These genes may play distinct roles in two different leukemia subtypes (HL60: acute myeloid leukemia; KBM7: chronic myeloid leukemia), including CDK6 and TRIB1 for HL60, and RUNX1 in KBM7. CDK6 is required in AML growth [17] and TRIB1 over-expression is observed in AML patients compared with CML patients [18]. On the other hand, the frequent RUNX1 loss-of-function mutations are observed in CML to AML transformations [19]. c An illustration of differentially selected genes identified by two-condition comparison algorithms (like RRA, blue rectangles). MAGeCK-MLE can further distinguish cell type-specific genes (red dots) from other genes. Cell type-specific genes are genes having no essentiality in one condition but strong essentiality in the other, and are usually more biologically interesting

Fig. 4

The β scores of MAGeCK-MLE on the melanoma knockout dataset. a A k-means clustering view of β scores of all conditions from top selected genes (k = 4). Only genes with the highest or lowest 1 % β scores in DMSO or PLX 14-day treatment conditions are shown. b The distribution of scores across four conditions using different algorithms. The red rectangle in MAGeCK-MLE indicates genes in cluster 4 in Fig. 4a, or genes that are strongly positively selected in PLX 14-day condition. Some validated genes in the original study are marked as red dots, including NF1, NF2, MED12, and CUL3

Fig. 5

The β scores of MAGeCK-MLE on the melanoma activation dataset. a A k-means clustering view of β scores of all conditions from top selected genes (k = 5). Only genes with the highest or lowest 1 % β scores in DMSO or PLX 21-day treatment conditions are shown. b the average β scores of genes in cluster 5 of A (consistently positive selected genes in both zeocin and puromycin conditions), as well as the β scores of these genes in melanoma knockout dataset. Similar to (a), the k-means clustering algorithm is applied to the selected genes (k = 4)

In two-condition comparisons, MAGeCK-MLE gives similar results with existing methods such as MAGeCK-RRA, RSA, and RIGER. All the algorithms identified genes that are commonly essential to different cell types [16], as well as known positively selected genes in PLX treated conditions in two melanoma datasets (Fig. 3; also see Section A and B of Additional file 1). In the leukemia dataset, two-condition comparison algorithms (like MAGeCK-RRA) identified genes that are differentially selected in two cell lines by a direct comparison of HL60 and KBM7 (Fig. 3a) [10]. However, not all of these genes are equally biologically interesting, as MAGeCK-MLE further distinguished them into two groups: genes having little effect in one (β scores close to zero) but strong selection effect in the other cell line (large absolute β scores), and genes having weak and opposite effects in two cell lines (Fig. 3c). The first group of genes are often more biologically interesting as they are cell type-specific genes. This includes some well-known driver genes (like BCR in KBM7) as well as genes that may be functional in only one cell type: CDK6 and TRIB1 in HL60 [17, 18], and RUNX1 in KBM7 [19].

One of the advantages of MAGeCK-MLE over other methods is that it enables accurate comparisons of gene essentialities across multiple conditions and experiments in one run (Fig. 4 and Section C of Additional file 1). In the melanoma knockout dataset, a k-means clustering of the β scores of top selected genes demonstrated that these genes have various essentialities across conditions (Fig. 4a). Some of the genes are universally positively or negatively selected in all conditions (cluster 3), while others have different essentiality across different conditions (clusters 1, 2, and 4). Genes in cluster 4 are particularly interesting as they show strong positive selection in 14-day PLX treated condition. Indeed, genes whose knockout leads to strong positive selection in PLX-treated cells are in cluster 4, including NF1, NF2, MED12, CUL3 [2]. In contrast, the k-means clusters of measurements from other algorithms did not reveal the strong effect of genes in cluster 4 (Section C of Additional file 1). This is because their score distributions are similar across different conditions (Fig. 4b), and do not reflect the fact that the one condition (PLX 14-day treatment) induces much stronger positive selection than other conditions [2]. This is partly because MAGeCK-RRA, RIGER, and RSA all use a rank-based method to compare sgRNA between two conditions, which may lose quantitative information.

Another example of using MAGeCK-MLE on multiple conditions is demonstrated in the melanoma activation dataset, where cells underwent different selection methods (using puromycin or zeocin), drug treatments (DMSO or PLX), and durations (3-day or 21-day treatment) (Fig. 5). Similar to the melanoma knockout dataset, we performed k-means clustering of the top-selected gene β scores. Many positively selected genes are dependent on the selection method, which might not be biologically interesting. For example, genes in clusters 2 and 4 correspond to positively selected genes that are specific to puromycin or zeocin selection, respectively. A small set of genes (cluster 5) are consistently selected in both zeocin and puromycin, including genes that are validated in the original study, for example, EGFR, GPR35, LPAR1/5 [5]. We further examined the genes in cluster 5 (Fig. 5b), and focused on genes positively selected in the CRISPR activation experiment but strongly negatively selected in the knockout experiment. These genes include EGFR and BRAF, two known kinases that drive melanoma progression and PLX resistance [20, 21], and CRKL, a protein kinase that activates RAS and JUN pathway. CRKL amplification is reported to lead to drug resistance against EGFR inhibitors by activating EGFR downstream pathways [22], implying its potential role in PLX drug resistance.

Visualization of QC measurements and gene essentiality with VISPR

VISPR (VISualization of crisPR screens) is a web-based frontend for interactive visualization of CRISPR screen QC and comparison results. Interactive access is provided by an HTML5 based browser interface, while visualizations are realized with Vega [23], a declarative visualization grammar on top of Data-Driven Documents (D3) [24]. VISPR provides three types of views for interactive exploration of CRISPR screening: a quality control view, a result view, and an experiment comparison view. The quality control view shows the QC measurements described before (Fig. 2).

In the result view, screening results can be interactively explored. It contains a table showing the comparison results of each gene (Fig. 6a). The table can be sorted by different columns and filtered (from ‘Search’) via gene names or regular expressions. Further, the distribution of P values is displayed as cumulative distribution function (CDF) (Fig. 6b) and as a histogram (Fig. 6c). For each gene, the normalized sgRNA counts in all samples can be displayed in a parallel coordinate visualization (Fig. 6d). If available, knockout efficiency predictions [11] and gene coordinates of each sgRNA are displayed as separate axes. Axes can be reordered or toggled on or off, and sgRNAs can be highlighted by selecting ranges on each axis. Genes selected in the table are highlighted in the CDF, allowing to assess their occurrence within the P value distribution of all genes.
Fig. 6

The result and comparison views of VISPR. The result view includes a gene comparison table (a), the distribution of P values as CDF (b) and histogram (c), and the normalized sgRNA counts in all samples of selected genes along with chromosome positions and predicted efficiency (d). The comparison view of VISPR (e) displays the overlap between significant genes under different selectable conditions and experiments as an Euler diagram. Shown results are from ESC (a, d) and melanoma dataset (b, c, e)

VISPR provides various ways to further explore the analysis results. Individual genes can be viewed in Ensembl [25] and IGV [26]. Selected genes can be visualized in terms of their interaction network and function via GeneMANIA [27]. Functional analysis can be performed with GOrilla [15], an online Gene Ontology (GO) enrichment analysis tool. GOrilla takes a ranked list of genes (here based on the P values reported by MAGeCK) to perform a threshold-free enrichment analysis. The resulting GO term enrichments can be further used for gene-level quality control.

The comparison view of VISPR can compare different experiments by visualizing the common and exclusive significant genes via Euler diagrams (Fig. 6e). Clicking on segments of the Euler diagram opens the result views of the corresponding experiments. For example, clicking on the intersection between two experiments will open ‘restricted’ result views for each experiment, where only the common significant genes are displayed. These views provide the same features as the unconstrained result views described above. However, in this case, GO enrichment analysis with GOrilla is performed with the shown genes (that is, the genes from the intersection) as foreground and the other genes of the experiment as background.

The visualizations displayed in VISPR can be downloaded as publication-ready SVG files. In addition, a command line interface is provided to store visualizations as Vega specifications. This format allows users to modify and style the output of VISPR programmatically.

Implementation of the MAGeCK-VISPR workflow with Snakemake

We implemented the MAGeCK-VISPR workflow with the workflow management system Snakemake [13], allowing an automatic execution of some or all of the MAGeCK-VISPR functions: quality control, essential gene analysis, and visualization. Choosing a workflow management system like Snakemake has several advantages. First, the workflow steps can be automatically parallelized and executed on workstations, servers, and compute clusters without the modification of the workflow. Second, Snakemake tracks metadata (like creation date, input, and log files) for all generated result and intermediate files. This way, used data, methods, and parameters are documented comprehensively for each analysis (also called data provenance), an important requirement of reproducible science. MAGeCK-VISPR provides a command line interface to initialize the workflow in a given work directory. This installs the workflow definition as a so-called Snakefile, along with a configuration file and documentation. The configuration file is used to define locations of raw data and additional parameters for MAGeCK-VISPR. Once configured, the Snakefile can be executed with Snakemake. Since the Snakefile is installed into the given work directory, it can be easily modified or extended by the user.

We provide all components of the workflow as Conda packages [28], such that MAGeCK-VISPR can be installed with a single command. Optionally, the Conda package manager can create isolated environments for the workflow to, for example, freeze or compare different software versions or publish snapshots of a MAGeCK-VISPR workflow instance along with all data and used software. This further increases the reproducibility of the generated results.

Conclusion

The recently developed CRISPR screening is a powerful technology in functional studies with different foci, including tumor progression and metastasis [29], drug resistance [3], immune response [30], and stem cell differentiation [4]. To our knowledge, MAGeCK-VISPR is the first comprehensive pipeline developed for quality control, analysis, and visualization of CRISPR screens, and highlights new features compared with existing screening analysis algorithms. For example, a typical CRISPR screening experiment usually includes complex designs that are difficult to analyze using existing algorithms, as they are all designed for two-condition comparisons. To address this challenge, MAGeCK-VISPR uses a maximum likelihood approach to estimate the effect of different conditions using a generalized linear model (GLM). It also incorporates sgRNA knockout efficiency information by using a probabilistic mixture model. We demonstrated that MAGeCK-MLE provides additional insights into cell type-specific essential genes and is able to compare gene essentiality scores across conditions or even experiments. Also, MAGeCK-VISPR is able to handle screens of different types including CRISPR knockout and CRISPR activation screens, and can be potentially applied to high-throughput sequencing datasets of traditional RNA interference (RNAi) screens.

The MAGeCK-MLE approach is able to estimate sgRNA knockout efficiencies from CRISPR screens besides gene essentiality. We previously reported that sequence-specific features learned from CRISPR screening data helped the design of efficient sgRNAs [11]. With more CRISPR screen data becoming available, the algorithm will help us identify sgRNAs with the best behavior and learn patterns of ‘good’ sgRNAs. The information will further guide the design of optimized sgRNAs for CRISPR screens and individual gene knockouts.

One potential limitation of MAGeCK-MLE is that its EM algorithm uses an iterative process involving matrix operations, making it slower than our previous MAGeCK RRA method and other competing algorithms. Future approaches to speed up MAGeCK-MLE include improving parametric tests for P value estimation (instead of using permutation) and implementing the algorithm in Cython instead of Python. Another potential limitation of MAGeCK-VISPR on the quality control assessment is that the current QC thresholds for ‘successful’ experiments are determined heuristically due to limited number of publicly available CRISPR screening datasets. We and other researchers have previously reported that bigger collections of ChIP-seq datasets provide better criteria on ChIP-seq quality control [31, 32]. As more public CRISPR screening datasets become available, the QC metrics (and other parts of MAGeCK-VISPR) can be further refined.

As CRISPR screens become more popular, complications in the data such as batch effects will be unavoidable which need proper correction for meaningful downstream analysis. Existing batch removal algorithms, including ComBat [33] and RUVseq [34], have been widely used to remove batch effects in gene expression analysis. In the future, these algorithms can be integrated into MAGeCK-VISPR pipeline. After that, MAGeCK-VISPR will be able to identify cancer- and disease-specific essential genes by a direct comparison between different datasets or experiments, providing potentially new therapeutic insights into the mechanisms of diseases and cancers.

Methods

MAGeCK-MLE: a maximum likelihood approach for essential gene detection

The Negative Binomial model for high-throughput CRISPR screening read counts

After read mapping, the sequencing results of CRISPR screening are presented as a read count table, where rows correspond to sgRNAs and columns correspond to samples. Read counts generated from high-throughput sequencing data have higher variances when a high number of read counts are observed (also called ‘over-dispersion’). This is usually modeled using Negative Binomial (NB) distribution, such as in the statistical models used in many RNA-seq differential expression analysis algorithms: edgeR, DESeq/DESeq2, and so on [3537]. MAGeCK-MLE uses a similar model; briefly, the read count of sgRNA i in sample j, or x ij , is modeled as:
$$ {x}_{ij}\sim NB\left({\mu}_{ij},{\alpha}_i\right) $$
Where μ ij and α i are the mean and over-dispersion factor of the NB distribution, respectively. The mean value μ ij is further modeled as:
$$ {\mu}_{ij}={s}_j{q}_{ij} $$
(1)
Where s j is the size factor of sample j for adjusting sequencing depths of the samples, and qij is a variable modeling the behavior of sgRNA i in sample j that will be discussed in later sections. s j is calculated by the ‘median ratio method’ in MAGeCK and DESeq2 [10, 37]:
$$ {s}_j= media{n}_i\left\{\frac{x_{ij}}{{\hat{x}}_{i\cdot }}\right\} $$

Here, \( {\widehat{x}}_i \) is the geometric mean of the read counts of sgRNA i across all J samples: \( {\widehat{x}}_i={\left({\displaystyle \prod_{k=1}^J}{x}_{ik}\right)}^{1/J}. \) s j can also be calculated based on a set of predefined ‘control’ sgRNAs instead of all sgRNAs. This is particularly useful when a majority of the genes in the library are supposed to be essential; in such cases it is not suitable to calculate s j based on all sgRNAs. Both methods are implemented in MAGeCK-VISPR and users can specify which method to use.

The over-dispersion factor α i is calculated based on the regression residual and will be discussed in more details in the last Methods section.

Modeling sgRNA knockout efficiency and complex experimental settings

Different studies demonstrated that sgRNAs have various DNA cutting efficiencies [11, 38], but such information is not considered in most essential gene calling algorithms (including MAGeCK). In MAGeCK-MLE, we use a binary variable π i to model whether sgRNA i is efficient or not: π i  = 1 corresponds to an efficient sgRNA i and vice versa. Since π i is unknown, the probability of observing a read count x from x ij is a mixture of two distributions:
$$ P\left({x}_{ij}=x\right)=p\left({x}_{ij}=x\Big|{\pi}_i=1\right)p\left({\pi}_i=1\right)+p\left({x}_{ij}=x\Big|{\pi}_i=0\right)p\left({\pi}_i=0\right) $$
In CRISPR screening experiments, it is common to have cells treated with different conditions. For example in melanoma activation dataset [5], cell lines underwent different sgRNA expression selection methods (cells are first selected using puromycin or zeocin), duration of treatment (3-day or 21-day treatment) and drug treatments (DMSO or PLX). For an efficient sgRNA i (π i  = 1), MAGeCK-MLE uses a generalized linear model (GLM) to model the effect of q ij as a linear combination of effects from different sources:
$$ \begin{array}{c}\hfill P\left({x}_{ij}=x\Big|{\pi}_i=1\right)\sim NB\left(x;{s}_j{q}_{ij},{\alpha}_i\right)\hfill \\ {}\hfill log\left({q}_{ij}\right)={\beta}_{i0}+{\displaystyle \sum_r}{d}_{jr}{\beta}_{gr}\hfill \end{array} $$
(2)

Here, β i0 is the baseline abundance of sgRNA i, corresponding to its abundance in an initial state (in plasmid or day 0). d jr is an element of a design matrix given by the user (explained later), and β gr is the (unknown) coefficient that we would like to estimate.

If sgRNA i is inefficient (π i  = 0), then its read counts in all samples are not determined by any experimental conditions except the baseline abundance:
$$ \begin{array}{c}\hfill P\left({x}_{ij}=x\Big|{\pi}_i=0\right)\sim NB\left(x;{s}_j{q}_{ij},{\alpha}_i\right)\hfill \\ {}\hfill log{q}_{ij}={\beta}_{i0}\hfill \end{array} $$
(3)

The design matrix

Design matrices have been used in many gene expression analysis algorithms for modeling complex experimental designs, including LIMMA [39], VOOM [40], DESeq2 [37], and so on. The design matrix D models the combination of effects of different conditions. For J samples that are affected by R conditions, D is a J * R binary matrix with element d jr  = 1 if sample j is affected by condition R, and 0 otherwise. An example of the design matrix is presented in Additional file 1.

Based on the design matrix, the equations in (2) and (3) can be written in a matrix form. For a gene g with N sgRNAs in J samples, let \( {\overrightarrow{q}}_g \) be the vector of q values of all sgRNAs in all samples in gene g:
$$ {\overrightarrow{q}}_g={\left({q}_{11},{q}_{21}, \dots, {q}_{N1},\dots, {q}_{1J},{q}_{2J},\dots, {q}_{NJ}\right)}^T $$
It can be written as:
$$ \log \left({\overrightarrow{q}}_g\right)=D\hbox{'}*{\overrightarrow{\beta}}_g $$
Where \( {\overrightarrow{\beta}}_g \) is a N + r vector of β values in Equations (2) and (3). The first N elements of \( {\overrightarrow{\beta}}_g \) are the baseline abundances of N sgRNAs, and the following R elements of \( {\overrightarrow{\beta}}_g \) are the coefficients corresponding to R columns in the design matrix:
$$ {\overrightarrow{\beta}}_g={\left({\beta}_{00},{\beta}_{10},\dots,\ {\beta}_{N0},{\beta}_1,\dots, {\beta}_r\right)}^T. $$

The binary extended design matrix D’ is used to set up the linear relationship between \( {\overrightarrow{\beta}}_g \) and \( {\overrightarrow{q}}_g \), and can be derived directly from the design matrix. See Additional file 1 for the definition and an example of D’.

The EM approach

MAGeCK-MLE uses a maximum likelihood estimation (MLE) approach to find the values of \( {\overrightarrow{\beta^{\ast}}}_g \) The objective function of MAGeCK-MLE is:
$$ \left({\overrightarrow{\beta}}_g^{*},{\pi_i}^{*}\right)= arg\underset{\upbeta_{\mathrm{g}},\ {\pi}_i}{max}\left({\displaystyle \sum_{\begin{array}{c}\hfill i\in g,\ \hfill \\ {}\hfill j=1,\dots J\hfill \end{array}}} \log\ p\left({x}_{ij}\right)\ \right) $$

Similar to DESeq2 [37], MAGeCK-MLE also adds a prior \( p\left({\overrightarrow{\beta}}_g\right) \) that follows a normal distribution centered on zero in the objective function. Adding this prior makes sure \( {\overrightarrow{\beta^{\ast}}}_g \) does not become arbitrarily large, when the sgRNA knockout efficiency is low and the differences of read counts between samples are high.

The objective function can be maximized using expectation maximization (EM). At the beginning, we have an initial guess of p(π i  = 1). Subsequently, we iteratively update the values of p(π i  = 1) and \( \overrightarrow{\beta} \) in the E step and the M step, respectively.

The initial guess of sgRNA knockout efficiency

We demonstrated that the SSC (Spacer Scoring of CRISPR) algorithm accurately predicts sgRNA knockout efficiency from genomic sequence content [11]. For each sgRNA, SSC generates an efficiency score in the range (−2,2). We scale the score linearly to the range (0,1) as an initial guess of p(π i  = 1). If no initial estimates are given, MAGeCK-MLE starts with p(π i  = 1) = 1 for all sgRNAs.

The expectation step

In the E step, we re-estimate the posterior probability p(π i  = 1) and the current estimation of \( {\overrightarrow{\beta}}_g \):
$$ p\left({\pi}_i=1\Big|{x}_{ij},{\overrightarrow{\beta}}_g\ \right)=\frac{{\displaystyle {\prod}_j}p\left({x}_{ij}\left|{\pi}_i=1,\ {\overrightarrow{\beta}}_g\ \right)p\left({\pi}_i=1\right|{\overrightarrow{\beta}}_g\right)}{{\displaystyle {\prod}_j}p\left({x}_{ij}\Big|{\pi}_i=1,{\overrightarrow{\beta}}_g\right)p\left({\pi}_i=1\Big|{\overrightarrow{\beta}}_g\right) + {\displaystyle {\prod}_j}p\left({x}_{ij}\left|{\pi}_i=0,\ {\overrightarrow{\beta}}_g\ \right)p\left({\pi}_i=0\right|{\overrightarrow{\beta}}_g\right)\ } $$

The maximization step

In the M step, we maximize the values of \( {\overrightarrow{\beta}}_g \) based on the values of p(π i  = 1). To derive the formula for updating \( {\overrightarrow{\beta}}_g \), we write the probability of observing a read count x of x ij as:
$$ P\left({x}_{ij}=x\right)=P{\left({x}_{ij}=x\Big|{\pi}_i=1\right)}^{I\left({\pi}_i=1\right)}*P{\left({x}_{ij}=x\Big|{\pi}_i=0\right)}^{I\left({\pi}_i=0\right)} $$
where I(.) is an indicator function. Taking the logarithm on both sides of the equation, we get
$$ \log P\left({x}_{ij}=x\right)=I\left({\pi}_i=1\right) \log P\left({x}_{ij}=x\Big|{\pi}_i=1\right)+I\left({\pi}_i=0\right) \log P\left({x}_{ij}=x\Big|{\pi}_i=0\right) $$
In the EM algorithm, it can be approximated by replacing the indicator function I(π i  = 1) and I(π i  = 0) with the posterior probability of P(π i  = 1) and P(π i  = 0), respectively [41], using the results from the E step. Therefore, the log likelihood function from the mixture model can be written as:
$$ \begin{array}{l}{\displaystyle \sum_{i,j}} \log P\left({x}_{ij}=x\right)\hfill \\ {}={\displaystyle {\sum}_{i,j}P\left({\pi}_i=1\Big|{x}_{ij},{\overrightarrow{\beta}}_g\right) logP\left({x}_{ij}=x\Big|\ {\pi}_i=1\right)+P\left({\pi}_i=0\Big|{x}_{ij},{\overrightarrow{\beta}}_g\right) logP\left({x}_{ij}=x\Big|{\pi}_i=0\right)}\hfill \end{array} $$
Since NB distribution belongs to exponential family distributions, a fast algorithm exists for the maximum likelihood estimation of generalized linear models [42]. Taking the prior of \( {\overrightarrow{\beta}}_g \) into consideration, the objective function can be maximized using iteratively reweighted ridge regression, or weighted updates, the same the algorithm used in DESeq2 [37]. The update rule for calculating \( {\overrightarrow{\beta^t}}_g \) at step t of the iteration can be written as:
$$ \overrightarrow{\beta_g^t}={\left({\mathrm{D}}^{\hbox{'}\mathrm{T}}W\mathrm{D}\hbox{'}+\lambda I\right)}^{-1}{\mathrm{D}}^{\hbox{'}\mathrm{T}}W\overrightarrow{z^t} $$
Here, W is the diagonal matrix with its values given by w ii  = e i t /(1/μ i  + α i ), where e i t is the current estimate of the efficiency of sgRNA \( i:{e}_i^t=P\left({\pi}_i=1\Big|{x}_{ij},\overrightarrow{\beta_g^{t-1}}\right) \), λ is the regularization parameter in the ridge regression, and μ i is the current estimate of the mean of the NB variable:
$$ \begin{array}{c}\kern1em \overrightarrow{\mu^t}={s}_j exp\left(\overrightarrow{h^t}\right)\kern1em \\ {}\kern1em \overrightarrow{{\mathrm{h}}^{\mathrm{t}-1}}=\mathrm{D}\hbox{'}\overrightarrow{\beta_g^{t-1}}\kern1em \end{array} $$
(4)
\( {\overrightarrow{\mathrm{z}}}^t \) is the residue vector of the current estimate, with its ith element:
$$ {z}_i^t={\mathrm{h}}_{\mathrm{i}}^{t-1}+{e}_i^t\left({\mathrm{x}}_{\mathrm{i}}-{\mu}_i^t\right)/{\mu}_i^t $$

Here, x i is the read count of sgRNA i.

Convergence

The EM approach iterates the E step and the M step until it converges or reaches a predefined maximum number of iteration.

Statistical significance

The statistical significance of \( {\overrightarrow{\beta}}_g \) is calculated in both permutation and Wald test. In permutation test, MAGeCK-MLE shuffles all sgRNAs in a gene to generate empirical null distribution of \( {\overrightarrow{\beta}}_g \) . The number of shufflings is a parameter specified by the user, and the default value is set to be 2*(total number of genes). In the Wald test, MAGeCK-MLE compares the value of \( {\overrightarrow{\beta}}_g/SE\left({\overrightarrow{\beta}}_g\right) \) to the standard Normal distribution, where \( SE\left({\overrightarrow{\beta}}_g\right) \) is the standard error of \( {\overrightarrow{\beta}}_g \):
$$ \begin{array}{c}\hfill SE\left({\overrightarrow{\beta}}_g\right)=\sqrt{diag\Big(Cov\left(\left({\overrightarrow{\beta}}_g\right)\right)\ }\hfill \\ {}\hfill Cov\left(\left({\overrightarrow{\beta}}_g\right)\right)={\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}+\lambda I\right)}^{-1}\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}\right){\left({D}^{\hbox{'}T}W{D}^{\hbox{'}}+\lambda I\right)}^{-1}\hfill \end{array} $$

Here, \( diag\left(Cov\left({\overrightarrow{\beta}}_g\right)\right) \) are the diagonal elements of the covariance matrix of \( {\overrightarrow{\beta}}_g \).

Calculating the over-dispersion factor

The over-dispersion factor, α i , is calculated based on the mean and variance estimation algorithm used in MAGeCK [10] and VOOM [40]. We first calculate the fitted values of \( {\overrightarrow{\beta}}_g \), or \( {\widehat{\beta}}_g \), using the EM algorithm proposed before, with the over-dispersion factor set to a fixed value (for example, 0.01). Then the fitted means \( {\widehat{\mu}}_i \) are calculated using Equation (4), and the residual variances are calculated using the following equation:
$$ {{\widehat{\sigma}}_i}^2={\left({x}_i-{\widehat{\mu}}_i\right)}^2 $$
MAGeCK-MLE then models the sample residual variance \( {\widehat{\sigma}}^2 \) and fitted mean \( \widehat{\mu} \) using the same model as in MAGeCK [10]:
$$ {\widehat{\sigma}}^2=\widehat{\mu}+k{\widehat{\mu}}^b $$
Where k and b are learned from the fitted means and residual variances of all sgRNA read counts. The values of α i are then calculated based on the fitted values of sample residual variance \( {{\widehat{\sigma}}_f}^2 \) from this model:
$$ {\alpha}_i=\frac{{\widehat{\sigma}}_f^2-{\widehat{\mu}}_i}{{{\widehat{\mu}}_i}^2} $$

Availability

The MAGeCK-VISPR workflow is available open source at http://bitbucket.org/liulab/mageck-vispr under the MIT license.

Notes

Abbreviations

AML: 

acute myeloid leukemia

CDF: 

cumulative distribution function

CML: 

chronic myeloid leukemia

CRISPR: 

clustered regularly interspaced short palindromic repeats

D3: 

Data-Driven Documents

dCas9: 

dead Cas9

DMSO: 

dimethyl sulfoxide

EM: 

expectation-maximization

GeCKO: 

genome-scale CRISPR/Cas9 knockout

GLM: 

generalized linear model

GO: 

gene ontology

MAGeCK: 

Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout

MLE: 

maximum-likelihood estimation

NB: 

negative binomial

NGS: 

next-generation sequencing

PCA: 

principle component analysis

QC: 

quality control

RNAi: 

RNA interference

RRA: 

robust rank aggregation

SAM: 

Synergistic Activation Mediator

sgRNA: 

single-guide RNA

VISPR: 

VISualization of crisPR screens

Declarations

Acknowledgements

The authors would like to thank Michael I. Love, Clifford Meyer, Peng Jiang, Bo Li, and Graham McVicker for helpful discussions.

Funding

The project was supported by the NIH grant U01 CA180980 (to XSL), R01 HG008728 (to MB and XSL), Department of Defense Synergistic Idea Development Award PC140817 (to MB and XSL), R01 GM113242-01 (to JSL), NSF grant DMS-1120368 (to JSL), and the Claudia Adams Barr Award in Innovative Basic Cancer Research from the Dana-Farber Cancer Institute.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

(1)
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health
(2)
Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute
(3)
Division of Molecular and Cellular Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute
(4)
Broad Institute of MIT and Harvard
(5)
Department of Statistics, Harvard University
(6)
Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School
(7)
School of Life Science and Technology, Tongji University

References

  1. Wang T, Wei JJ, Sabatini DM, Lander ES. Genetic screens in human cells using the CRISPR-Cas9 system. Science. 2014;343:80–4.PubMed CentralPubMedView ArticleGoogle Scholar
  2. Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelsen TS, et al. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science. 2014;343:84–7.PubMed CentralPubMedView ArticleGoogle Scholar
  3. Zhou Y, Zhu S, Cai C, Yuan P, Li C, Huang Y, et al. High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature. 2014;509:487–91.PubMedView ArticleGoogle Scholar
  4. Koike-Yusa H, Li Y, Tan E-P, Velasco-Herrera MDC, Yusa K. Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. Nat Biotechnol. 2014;32:267–73.PubMedView ArticleGoogle Scholar
  5. Konermann S, Brigham MD, Trevino AE, Joung J, Abudayyeh OO, Barcena C, et al. Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex. Nature. 2015;517:583–8.PubMed CentralPubMedView ArticleGoogle Scholar
  6. Gilbert LA, Horlbeck MA, Adamson B, Villalta JE, Chen Y, Whitehead EH, et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell. 2014;159:647–61.PubMed CentralPubMedView ArticleGoogle Scholar
  7. Luo B, Cheung HW, Subramanian A, Sharifnia T, Okamoto M, Yang X, et al. Highly parallel identification of essential genes in cancer cells. Proc Natl Acad Sci U S A. 2008;105:20380–5.PubMed CentralPubMedView ArticleGoogle Scholar
  8. König R, Chiang C-Y, Tu BP, Yan SF, DeJesus PD, Romero A, et al. A probability-based approach for the analysis of large-scale RNAi screens. Nat Methods. 2007;4:847–9.PubMedView ArticleGoogle Scholar
  9. Diaz AA, Qin H, Ramalho-Santos M, Song JS. HiTSelect: a comprehensive tool for high-complexity-pooled screen analysis. Nucleic Acids Res. 2015;43:e16–6.PubMed CentralPubMedView ArticleGoogle Scholar
  10. Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014;15:554.PubMed CentralPubMedView ArticleGoogle Scholar
  11. Xu H, Xiao T, Chen C-H, Li W, Meyer C, Wu Q, et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015;25:1147–57.PubMedView ArticleGoogle Scholar
  12. Wu X, Scott DA, Kriz AJ, Chiu AC, Hsu PD, Dadon DB, et al. Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian cells. Nat Biotechnol. 2014;32:670–6.PubMed CentralPubMedView ArticleGoogle Scholar
  13. Köster J, Rahmann S. Snakemake - a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2.PubMedView ArticleGoogle Scholar
  14. Wittebolle L, Marzorati M, Clement L, Balloi A, Daffonchio D, Heylen K, et al. Initial community evenness favours functionality under selective stress. Nature. 2009;458:623–6.PubMedView ArticleGoogle Scholar
  15. Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009;10:48.PubMed CentralPubMedView ArticleGoogle Scholar
  16. Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol Syst Biol. 2014;10:733–3.PubMed CentralPubMedView ArticleGoogle Scholar
  17. Placke T, Faber K, Nonami A, Putwain SL, Salih HR, Heidel FH, et al. Requirement for CDK6 in MLL-rearranged acute myeloid leukemia. Blood. 2014;124:13–23.PubMed CentralPubMedView ArticleGoogle Scholar
  18. Röthlisberger B, Heizmann M, Bargetzi MJ, Huber AR. TRIB1 overexpression in acute myeloid leukemia. Cancer Genet Cytogenet. 2007;176:58–60.PubMedView ArticleGoogle Scholar
  19. Zhao L-J, Wang Y-Y, Li G, Ma L-Y, Xiong S-M, Weng X-Q, et al. Functional features of RUNX1 mutants in acute transformation of chronic myeloid leukemia and their contribution to inducing murine full-blown leukemia. Blood. 2012;119:2873–82.PubMedView ArticleGoogle Scholar
  20. Davies H, Bignell GR, Cox C, Stephens P, Edkins S, Clegg S, et al. Mutations of the BRAF gene in human cancer. Nature. 2002;417:949–54.PubMedView ArticleGoogle Scholar
  21. Prahallad A, Sun C, Huang S, Di Nicolantonio F, Salazar R, Zecchin D, et al. Unresponsiveness of colon cancer to BRAF(V600E) inhibition through feedback activation of EGFR. Nature. 2012;483:100–3.PubMedView ArticleGoogle Scholar
  22. Cheung HW, Du J, Boehm JS, He F, Weir BA, Wang X, et al. Amplification of CRKL induces transformation and epidermal growth factor receptor inhibitor resistance in human non-small cell lung cancers. Cancer Discov. 2011;1:608–25.PubMed CentralPubMedView ArticleGoogle Scholar
  23. VEGA. A Visualization Grammar. [https://vega.github.io].
  24. Bostock M, Ogievetsky V, Heer J. D3: Data-Driven Documents. IEEE Trans Vis Comput Graph. 2011;17:2301–9.PubMedView ArticleGoogle Scholar
  25. Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2015. Nucleic Acids Res. 2015;43(Database issue):D662–9.PubMed CentralPubMedView ArticleGoogle Scholar
  26. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinformatics. 2013;14:178–92.PubMed CentralPubMedView ArticleGoogle Scholar
  27. Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38(Web Server issue):W214–20.PubMed CentralPubMedView ArticleGoogle Scholar
  28. The Conda project [https://anaconda.org].
  29. Chen S, Sanjana NE, Zheng K, Shalem O, Lee K, Shi X, et al. Genome-wide CRISPR screen in a mouse model of tumor growth and metastasis. Cell. 2015;160:1246–60.PubMedView ArticleGoogle Scholar
  30. Parnas O, Jovanovic M, Eisenhaure TM, Herbst RH, Dixit A, Ye CJ, et al. A Genome-wide CRISPR Screen in Primary Immune Cells to Dissect Regulatory Networks. Cell. 2015;162:675–86.PubMedView ArticleGoogle Scholar
  31. Wang Q, Huang J, Sun H, Liu J, Wang J, Wang Q, et al. CR Cistrome: a ChIP-Seq database for chromatin regulators and histone modification linkages in human and mouse. Nucleic Acids Res. 2014;42(Database issue):D450–8.PubMed CentralPubMedView ArticleGoogle Scholar
  32. Diaz A, Nellore A, Song JS. CHANCE: comprehensive software for quality control and validation of ChIP-seq data. Genome Biol. 2012;13:R98.PubMed CentralPubMedView ArticleGoogle Scholar
  33. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.PubMedView ArticleGoogle Scholar
  34. Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32:896–902.PubMed CentralPubMedView ArticleGoogle Scholar
  35. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–40.PubMed CentralPubMedView ArticleGoogle Scholar
  36. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.PubMed CentralPubMedView ArticleGoogle Scholar
  37. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.PubMed CentralPubMedView ArticleGoogle Scholar
  38. Doench JG, Hartenian E, Graham DB, Tothova Z, Hegde M, Smith I, et al. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat Biotechnol. 2014;32:1262–7.PubMed CentralPubMedView ArticleGoogle Scholar
  39. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3–25.Google Scholar
  40. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29.PubMed CentralPubMedView ArticleGoogle Scholar
  41. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.Google Scholar
  42. Fox J. Applied Regression Analysis and Generalized Linear Models. London: SAGE Publications; 2015.Google Scholar

Copyright

© Li et al. 2015

Advertisement