gscreend: modelling asymmetric count ratios in CRISPR screens to decrease experiment size and improve phenotype detection

Imkeller, Katharina; Ambrosi, Giulia; Boutros, Michael; Huber, Wolfgang

doi:10.1186/s13059-020-1939-1

Method
Open access
Published: 02 March 2020

gscreend: modelling asymmetric count ratios in CRISPR screens to decrease experiment size and improve phenotype detection

Katharina Imkeller^1,2,
Giulia Ambrosi¹,
Michael Boutros¹ &
…
Wolfgang Huber ORCID: orcid.org/0000-0002-0474-2218²

Genome Biology volume 21, Article number: 53 (2020) Cite this article

8506 Accesses
22 Citations
35 Altmetric
Metrics details

Abstract

Pooled CRISPR screens are a powerful tool to probe genotype-phenotype relationships at genome-wide scale. However, criteria for optimal design are missing, and it remains unclear how experimental parameters affect results. Here, we report that random decreases in gRNA abundance are more likely than increases due to bottle-neck effects during the cell proliferation phase. Failure to consider this asymmetry leads to loss of detection power. We provide a new statistical test that addresses this problem and improves hit detection at reduced experiment size. The method is implemented in the R package gscreend, which is available at http://bioconductor.org/packages/gscreend.

Background

Genetic perturbation screens are a powerful tool to systematically probe gene function and genotype-phenotype relationships in many different cell types. Their applications include identification of genotype-specific vulnerabilities in human cancer cells [1–5], discovery of genes involved in drug resistance [6–8], and virus replication [9, 10].

Currently, the most widespread technology to induce specific genetic perturbations is based on CRISPR (clustered regularly interspaced short palindromic repeats)-associated enzymes. In this approach, DNA constructs encoding a guide RNA (gRNA) and the CRISPR-associated enzyme are stably integrated into cells, e.g., via lentiviral transduction. The gRNA directs the CRISPR-associated enzyme to its sequence-specific target site in the genome. To generate gene knockout perturbations, a common choice of enzyme is the endonuclease Cas9 (CRISPR associated 9) [6, 7, 11], which induces DNA cleavage at the genomic site it is directed to. Subsequent DNA repair via non-homologous end joining leads to frame-shift mutations and premature stop-codons, nonsense-mediated RNA decay, and finally gene knockout (CRISPR-KO). Alternatively, it is possible to introduce more subtle perturbations such as altered splicing patterns [12, 13] or quantitative modulation of gene expression [14]. To this end, modified CRISPR-associated enzymes which function as epigenetic modifiers [15–17], transcriptional modulators [18–20], or single-base editors [21, 22] are used.

Pooled screens enable the measurement of the effects of many genetic perturbations in parallel in a single experiment. To this end, a library of gRNAs is introduced into a pool of cells at a low multiplicity of infection such that no more than one gRNA is present in the vast majority of cells [23, 24]. The gRNA sequence simultaneously serves as a barcode that is used to trace which perturbation each cell carries. In the case of negative selection screens, the transduced cell pool is allowed to grow for several divisions during which the relative abundance of cells with a particular gRNA increases or decreases depending on the extent to which the targeted gene determines cell fitness. These effects are detected by amplifying, sequencing, and counting the gRNAs before (library or T0) and after (T1) the proliferation phase (Fig. 1a).

A typical genome-wide CRISPR screening library for a mammalian genome contains between 70,000 and 120,000 gRNAs [2, 6, 7, 11, 26, 27]. To ensure statistical power, each gRNA must be represented by a sufficient number of cells during each step of the screen. When designing screening experiments, it is convenient to assume that all gRNAs are present in the library at approximately the same relative frequencies, and the library composition is summarized by the mean gRNA abundance, also referred to as coverage or representation. This measure is then used to calculate the necessary size of an experiment [23, 24]. For a library of 100,000 gRNAs and a desired coverage of 500 for example, 50 million cells (500 times 100,000) must be cultured. Published recommendations on optimal library coverage selection range from 200 [28] to 500 [23]. Nagy et al. used computational simulations to investigate the impact of screen parameters on the robustness of screening results, highlighting how coverage and screen duration can influence signal-to-noise ratios [29]. Further optimizing such experimental choices is a major thrust of this work, since they have substantial consequences on the size, costs, and outcomes of an experiment.

To compare the gRNA abundances before and after the proliferation phase, a range of statistical models and computational tools are available [30–37]. Common approaches are to model the joint bivariate null distribution of the normalized counts before and after the proliferation phase, or the null distribution of a univariate summary statistic, the ratio of these counts, hereafter referred to as the “before/after ratio.” gRNAs whose data fall sufficiently outside the null distribution present evidence of a fitness effect of their target gene. Since it is common that each gene is targeted by multiple gRNAs, a subsequent step of the analysis consists of aggregating gRNA-level evidence to the gene level. This can be achieved for example using Bayesian hierarchical modelling [31, 35, 36] or robust rank aggregation [30] (see also Additional file 1: Table S1 for a short summary of the methods compared in this work).

For the null distribution modelling and hypothesis testing, approaches derived from RNA sequencing and differential gene expression analysis have been used [38, 39]. Here, we show that the distribution of the before/after ratios for negative controls is often asymmetric in CRISPR-KO screens; a similar observation has previously been reported for CRISPRi/CRISPRa screens [35]. Such asymmetry means that even in the absence of a fitness effect, a gRNA’s relative abundance x is more likely to randomly decrease to, say, x/q (q>1), rather than increase to xq during the screen. Failing to account for such asymmetry (as is done when using the RNA-seq-based tools, which are designed for data that do not exhibit asymmetry) leads to needlessly elevated false-positive rates and/or decreased detection power.

Here, we present a biology-based, generative model that explains the asymmetry of the before/after ratios in pooled CRISPR screens and mechanistically links it to specific steps in the screening experiment. Based on our model, we derive a statistical test that we implemented in the R package gscreend and that enables accurate phenotype detection at reduced experiment size compared to existing approaches. Moreover, through our model, we can calculate the minimal experiment size necessary for a given screening library and a required detection power, a point that has never been systematically addressed in any published CRISPR screening protocol.

Results

Before/after ratios from pooled genetic screens have an asymmetric null distribution

We studied the distributions of the gRNA counts at T0 and T1 in two pooled CRISPR-KO conducted in human cell lines [1, 2] (Fig. 1). After scaling normalization of the counts to the total counts at T0, we computed the logarithm of the ratio of the counts after and before the proliferation phase (logarithmic fold change, LFC, see the “Methods” section). We focused on two classes of gRNAs: (a) those that should not have a fitness effect because their sequence does not match any region in the human genome (Fig. 1b–d [1]) and (b) those that target genes that are not essential for cell fitness according to a study by Hart et al. [25] (Fig. 1e–g [2]). The sign of their LFCs was uncorrelated between replicates, in agreement with the assumption that the LFCs were due to random experimental variability and not due to target-dependent fitness effects (Fig. 1d, g). However, the distribution of LFCs was not symmetric, in particular at the tails: values of LFC <−1 (strongly decreased abundance) were approximately 5–10% more frequent than those with LFC >+1 (strongly increased abundance) (Fig. 1b–c and e–f). These results are qualitatively in accordance with those of Daley et al. [35].

Computational simulation of pooled CRISPR screens

To investigate the origin of this asymmetry and possible impact of experimental design parameters, we developed a quantitative model and computational simulation of pooled CRISPR screens. The state space of the model is the tuple of integer counts of the gRNAs, which the model tracks as a function of time throughout the different steps of the screen (Fig. 2a). The temporal evolution of the state is described by endomorphic functions simulating subsampling during transduction, cell splitting, and sequencing as well as exponential cell growth according to a gRNA-specific growth rate. In our simulations, we considered screens performed with a library of 50,000 gRNAs (targeting 12,500 genes with 4 independent gRNAs per gene). For 10% of the genes, the knockout leads to a growth defect, and for 1% to increased growth. Table 1 summarizes the simulation parameters. A detailed description of the simulation algorithm is provided in the “Methods” section.

Table 1 Parameters of CRISPR screen simulation

Full size table

Plasmid library is a better reference sample than T0 cell pool

We first investigated the effect of the choice of reference sample. Previous publications used gRNA counts from either the plasmid library or the T0 cell pool as reference (Fig. 1a) [1, 2, 27, 40, 41], and it is unclear to what extent this choice influences the analysis outcome. Time point T0 is after the antibiotics selection of cells that were successfully transduced, in other words, up to four cell doublings after transduction. Such selection is necessary because at typical multiplicities of infection, only a fraction of cells is infected. In our simulations, we observed that counts of gRNAs targeting essential genes were already decreased at T0, especially for fast growing cells (Additional file 1: Figure S1A). To confirm this experimentally, we transduced pools of Cas9 expressing HCT116 and RKO cells with a genome-wide CRISPR library and selected the successfully transduced cells for 4 days, similar to the period before T0 in a screen. We sequenced the gRNAs in the plasmid library and at T0 and compared their normalized abundances. Similar to the prediction from the simulation, gRNAs targeting essential genes [42] had reduced abundance at T0 (Additional file 1: Figure S1B). This result implies that plasmid library rather than T0 sequencing should be used as a reference to avoid premature under-representation of gRNAs targeting essential genes.

The asymmetry of before/after ratios is caused by cell splitting during the proliferation phase

We next investigated the effect of experimental parameters on the distribution of before/after ratios for gRNAs without effect on cell fitness (Fig. 2b–e). The key parameter for cell culture during a screen is the mean gRNA coverage, which is reflected in the number of cells that are seeded after every round of splitting. We found that the smaller the coverage during cell splitting, the greater the asymmetry of before/after ratios (Fig. 2b–c). Higher levels of asymmetry lead to impairment of phenotype detection by MAGeCK-RRA [30], a current state-of-the-art analysis tool, which lost 10% of recall at 95% precision when, for example, reducing the cell splitting coverage from 400 to 100 (Fig. 2d). Similarly, the asymmetry increased when using faster growing cell lines, as we observed in our simulations (Fig. 2e) and in experimental datasets (Additional file 1: Figure S2) [40, 43]. We also tested the effect of other parameters, such as coverage during transduction or polymerase chain reaction (PCR). These, however, only marginally influenced the asymmetry of before/after ratios in our simulations (Additional file 1: Figure S3). Decreasing the cell splitting coverage led to up to 20% of gRNAs with LFC <−1, whereas for similar changes in PCR or transduction coverage, this fraction was only 3% (Fig. 2c and Additional file 1: Figure S3).

We conclude that transduction and PCR do not cause major technical biases in the data and that it is better to sequence the gRNA pool in the plasmid library rather than at T0. The observed asymmetry however can be mechanistically explained by multiple rounds of cell splitting bottlenecks and exponential growth (Fig. 2f). With every round of exponential growth followed by random sampling of cells, the distribution of gRNA abundances gets wider, i.e., there are more and more gRNAs that are underrepresented in comparison with the mean gRNA coverage. We confirmed the gradual broadening of the abundance distribution of gRNA targeting non-essential genes [25] in a published dataset from a CRISPR-knockout screen performed in HCT116 cells (Fig. 2g) [2].

Wide initial gRNA abundance distributions increase asymmetry of before/after ratios

Since the observed asymmetry is caused by broadening of the gRNA abundance distribution, we hypothesized that the width of the gRNA abundance distribution in the plasmid library also influences the data quality of the screen. A measure of the width of this distribution, i.e., the difference in abundance between low and high abundant gRNAs, is the ratio between the 90 and 10% percentiles. This measure, elsewhere also named “skew ratio” [23], will hereafter be referred to as “distribution width.” If, for example, the most abundant 10% gRNAs of a library have an abundance higher than 500 whereas the least abundant 10% have less than 100 counts, the distribution width is 5.

We performed simulations starting from three different gRNA libraries with varying distribution width (Fig. 3a). Our simulations showed that with higher width, the reproducibility between replicates decreased (Fig. 3b) and at the same time the frequency of gRNAs with LFC < −1 increased (Fig. 3c). Experimental data from screens conducted using plasmid libraries with different distribution widths confirmed this finding (Fig. 3d–f) [2, 27, 40, 41, 43, 44]. Using plasmid libraries with narrower gRNA abundance distributions thus increases data quality by reducing the asymmetry of the distribution of before/after ratios.

Furthermore, we also found that the gRNA sequence composition of a library correlates with its width and that gRNAs with specific sequence properties are more likely to be over- or underrepresented (Additional file 1: Figure S4). To show this, we selected five datasets from published CRISPR libraries [1, 27, 40, 43–45]. These libraries have different distribution widths ranging from 2.4 to 8.8 (Additional file 1: Figure S4A-B). To examine the sequence composition, we generated probability sequence motifs for the least and most abundant gRNAs (Additional file 1: Figure S4C) [46]. Wider libraries tend to have poly-G-stretches in low abundant gRNAs and poly-T-stretches high abundant gRNAs (Additional file 1: Figure S4D). This is probably due to sequence-specific biases during the generation of the plasmid library, for example during synthesis or PCR amplification of gRNAs.

New statistical method for improved phenotype detection

We showed that before/after ratio distributions in pooled CRISPR screens are asymmetric due to technical artifacts arising during the cell proliferation phase. This asymmetry is influenced not only by cell splitting parameters but also by the width of the gRNA abundance distribution in the plasmid library. In principle, it would be possible to eliminate this asymmetry by using plasmid libraries with minimal distribution width and to perform the screen at very high coverage. However, since this is generally neither feasible nor economically reasonable, we developed a new statistical test that accounts for the asymmetric null distribution. The underlying idea of our method is to use a skew normal distribution to model the LFC null distribution.

The workflow of our new analysis method gscreend is depicted in Fig. 4a. After scaling of gRNA counts and calculation of LFCs, the data is split into slices according to the gRNA abundance in the reference sample (e.g., plasmid library). We introduce this stratification since it allows the parameters of the null distribution to be different for gRNAs with low and high abundance, consistent with what we observed in datasets. We model the LFCs in each stratum as a mixture of a parametric null distribution, the skew normal, and an unspecified alternative distribution [47, 48]. The first mixture component corresponds to gRNAs without fitness effect, the second to those with effect, where we assume that these are only a minority. gscreend uses least quantile of squares regression [49] to fit the null distribution to the LFCs in each stratum. Least quantile of squares regression fits a model by only taking into account a defined proportion of residuals, e.g., those between the 10 and 90% percentiles. In contrast to the commonly used least sum of squares regression, it is thus more robust to outliers. In the gscreend workflow, the resulting null models for every stratum are used to calculate p values, which are then employed to rank the gRNAs. Subsequently, robust rank aggregation [30, 50] is applied to aggregate the ranked gRNA list to the gene level.

We first tested how well gscreend performed in accurately ranking genes in experimental datasets. Using a published list of essential and non-essential genes [25, 42], we calculated the recall at 95% precision (as in Fig. 2d) of our and other tools [30, 31, 33, 35, 37]. Additional file 1: Table S1 summarizes the different statistical concepts underlying the six methods. gscreend outperformed MAGeCK, ScreenBEAM, CRISPhieRmix, and CRISPRBetaBinom when ranking genes in a CRISPR-knockout screen performed in HCT116 cells (Fig. 4b) [2]. BAGEL was the only tool that had a better precision-recall performance than gscreend on these data. However, its algorithm was trained on the same benchmark set of essential and non-essential genes that we used here to calculate precision-recall statistics, which might explain some of this performance. Indeed, when ranking components of the ribosome, whose knockout is likely to be lethal, gscreend outperformed BAGEL, MAGeCK, ScreenBEAM, CRISPhieRmix, and CRISPRBetaBinom, especially within the 1000 lowest ranked genes (Fig. 4c). In order to illustrate some examples, we highlighted the results for five selected genes (Fig. 4d, e). MRPL34 and MRPS12 (components of the mitochondrial ribosome) are detected with low rank only by gscreend, although their gRNA abundance profile indicates that they are truly essential, because two of the corresponding gRNAs are strongly depleted at T18 in all three replicates (Fig. 4e). The other three genes NDUFAF3, PRRC2A, and UVRAG are assigned low ranks in BAGEL or MAGeCK but remain above the 1% false discovery rate (FDR) threshold in the gscreend results (Fig. 4d). Their gRNA abundance profile indicates that the observed negative LFCs are technical artifacts, because they were not reproduced between replicates (Fig. 4e). Taken together, these results indicate that gscreend delivers superior accuracy in ranking and identifying essential genes in pooled negative selection screens.

We also investigated whether our method is robust against different levels of the asymmetry of before/after ratios by simulations. Similar to what we found using the experimental data, gscreend had a better ranking accuracy than the other tools when detecting genes that either increase or decrease cell fitness (Fig. 4f, g). When increasing the asymmetry by reducing cell splitting coverage or increasing library distribution width, the method maintained a better ranking accuracy than MAGeCK-RRA (Fig. 4h). gscreend enables reduction of the cell splitting coverage by approximately 50% for a library distribution width of 7.5: using 300 (gscreend) instead of 600 (MAGeCK) mean gRNA coverage maintained at least 95% recall at 99% precision (Fig. 4h). For libraries with larger distribution widths, the gain in accuracy is even more substantial.

To further assess the relevance of these findings for the analysis of experimental data, we compared the performance of gscreend and MAGeCK on a series of datasets with highly different experimental setups. The screens performed by Hart et al. [2] were conducted with a coverage of 200 and a library distribution width of around 7.8; according to our analysis, these parameters lead to high asymmetry and high levels of noise. In contrast, screens from the DepMap consortium [40, 43] were conducted at a coverage of 500 and a library width of around 4.6, which represents a more favorable experimental setup. When comparing precision-recall performance, and more specifically recall at 95% precision, gscreend outperforms MAGeCK on both datasets (Fig. 4i). However, the difference was small for the DepMap data, while it was more substantial for the Hart et al. dataset. These results show that gscreend can provide tangible improvements on real data, although the improvements are less pronounced for data produced from optimal experimental designs.

Implications for the design of screening experiments

Based on the findings reported above, we suggest that when designing a screen, the distribution width of the gRNA plasmid library should first be measured. Based on this measure, our simulation tool (Fig. 4h, left panel), can then be used to predict the corresponding optimal coverage (summarized in Table 2). Libraries with a narrow width can be screened at lower coverage than wide ones to achieve the same signal-to-noise ratio, since the impact of asymmetric loss is smaller. This, in turn, may have a significant impact on the costs and effort associated with the experiment.

Table 2 Recommended screening coverage for different library distributions

Full size table

Discussion

Accurately detecting phenotypes in pooled genetic perturbation screens is key to generating hypotheses that justify follow-up. Screens that correctly distinguish all genes that negatively or positively regulate cell fitness can be used not only to identify the strongest “hits,” but also to measure subtle differences in growth rate and thus map whole pathways and potentially identify mechanisms.

To achieve high data quality and accurate analysis, we need to understand how the experimental design influences the results. Previously reported simulations of CRISPR-based screens highlighted the importance of coverage for reducing the signal-to-noise ratio [29]. Our study is the first to systematically explore the influence of experimental design, including the quality of the gRNA library—as measured by the library distribution width—on phenotype detection in pooled screens. Given a certain gRNA library distribution width, our method provides a quantitative prescription for the choice of cell splitting coverage. We show that gRNA coverage during PCR and transduction, provided it is in the same range as the cell splitting coverage, only marginally influences data quality. We also find that screens are best analyzed when using plasmid library sequencing as reference. We do not discuss the influence of the multiplicity of infection during viral transduction, as there is already literature and a good model available to address this point [24]. Our most important novel finding is that the asymmetry of the distribution of before/after ratios is caused during the proliferation phase of pooled negative selection screens. Multiple consecutive rounds of cell splitting and exponential growth gradually lead to random loss of low abundant gRNAs.

Using this understanding of the asymmetric null distribution of before/after ratios, we developed a new statistical test that improves phenotype detection. gscreend outperforms existing analysis methods, which rely on the assumption that the null distribution is symmetric. From the point of view of screen design, our method enables reduction of experiment size by up to 50% compared to other tools, because it maintains high analysis accuracy throughout a broad range of experimental settings. Especially for experiments that are limited by their size because of limited supply of cells, for example in primary cell cultures [39, 51], our method may help to improve phenotype detection.

Our results also provide indications on how to optimize the experimental design by choosing the screening coverage according to the width of the plasmid library (Table 2). Intriguingly, the width of the library distribution is the limiting parameter that dictates the minimal size of a screening experiment. It would thus be possible to strongly reduce the experiment size by using a library with a narrower distribution. Our analyses indicate that gRNAs with specific sequence characteristics are likely to be over- or underrepresented in gRNA libraries obtained using arrayed synthesis approaches and cloning. We hypothesize that the broadening of library distribution is due to sequence-specific differences in synthesis or amplification efficiency. A recently published approach to synthesize covalently-closed-circular-synthesized (3Cs) gRNA libraries may thus be a promising technology for substantial reduction of library width and experiment size [52].

Finally, the discovery of sequence specific representation differences of gRNAs in a library also has important implications for the evaluation of their gene knockout efficiency. gRNAs with specific sequence properties might seem more efficient than others simply because they are less abundant in the library and thus more likely to suffer from the here described asymmetric loss phenomenon [27, 53].

Conclusion

We conclude that the asymmetry of the before/after ratio distribution in pooled CRISPR screens is primarily caused by insufficient coverage of gRNAs during the cellular growth phase of a screen. Our results can be used to predict necessary experiment sizes, which are most importantly dictated by the width of the plasmid library. Our R package gscreend takes into account the asymmetry of the null distribution and improves phenotype detection at reduced experiment size.

Methods

Experimental datasets

The following datasets from published CRISPR knockout screens were used: screen in KBM7 cells [1] (Fig. 1); screen with TKO library in HCT116 cells, time points T08, T15, and T18 [2] (Figs. 1, 2, 4); screen in mouse ESC using mouse genome-wide libraries V1 and V2 [27] (Fig. 3). gRNA count data from the DepMap project [40, 43] was downloaded together with a dataset of cell doubling times [3] (Fig. 3f, 4i, Additional file 1: Figure S2). Data from project Score was downloaded from the project data repository [41] (Fig. 3f).

Data from library and T0 sequencing used in Additional file 1: Figure S1 was collected during a CRISPR screen in HCT116 and RKO cells. The 90k Toronto human Knockout pooled library (TKO) was a gift from Dr. Jason Moffat (1000000069, Addgene). Plasmid library was amplified using ElectroMAXTM Stbl4TM cells (Invitrogen) accordingly to the manufacturer’s protocol. Library vector was transfected into HEK293T cells (ATCC) with TransIT-LT1 (Mirus Bio) transfection reagent along with psPAX2 (12260, Addgene) and pMD2.G (12259, Addgene) packaging plasmids to produce lentivirus. HCT116 and RKO cells (ATCC) stably expressing Cas9 (73310, Addgene) were infected in the presence of 8 μg/ml polybrene (Merck Millipore) with the 90k TKO gRNAs library at a multiplicity of infection (MOI) equal to 0.3 such that each gRNA was present in 500 cells on average. The day after, puromycin-containing medium was added to the infected cells for 48 h. On day 4 after transduction, a portion of cells were harvested as T0 time point. Genomic DNA from cell pellets was extracted using QIAamp DNA Blood Maxi kit (Qiagen). To amplify the gRNA sequences, a total of 140 PCR reactions were performed using 1 μg of genomic or plasmid library DNA each (250-fold coverage), Q5 Hot Start HF polymerase (NEB), and primers harboring the Illumina TruSeq adapter sequences. PCR products were purified using DNA Clean and Concentrator TM-100 (Zymo Research) and MagSi-NGSprep Plus beads (Steinbrenner). Sample concentrations were measured using Qubit HS DNA Assay (Thermo Fisher). Library amplicon size was verified using DNA High Sensitivity Assay on a BioAnalyzer 2100 (Agilent) and then sequenced on a NextSeq (Illumina) by 75 bp single-end sequencing and addition of 25% PhiX control v3 (Illumina). gRNAs were counted using the count function with automatic sequence trimming provided by MAGeCK [30].

Simulation of pooled CRISPR screens

We simulate a complete pooled CRISPR-knockout screen, providing output files that represent gRNA counts after sequencing of the plasmid library and T0 and T1 samples (see also Fig. 2a). The simulation depends on several parameters that reflect the experimental setup (see Table 1).

In a first step, the abundance n_lib,g of every gRNA g (where g=1,…,N_tot) in the plasmid library is sampled from a lognormal distribution LN(μ,σ), where μ=5 and σ is chosen to match the user-specified library distribution width L. We chose μ=5 because resulting distributions resemble those seen in experimental data. The sequencing counts \(n_{\textsf {lib}, g}^{\textsf {seq}}\) are obtained by making C_PCRN_tot draws from the multivariate hypergeometric distribution with probabilities \(p_{g} = n_{\textsf {lib}, g}/\sum _{g} n_{\textsf {lib}, g}\). This is repeated N_libpcr times, to model the technical replicates.

In the next step, the abundance of gRNAs in the transduced cell pool n_trans,g is obtained by making C_PCRN_tot draws from the multivariate hypergeometric distribution with probabilities \(p_{g} = n_{\textsf {lib}, g}/\sum _{g} n_{\textsf {lib}, g}\).

The pool of gRNAs of total size N_tot is partitioned into three sets: gRNAs without effect on cell fitness (Γ_neutral), gRNAs increasing cell fitness (Γ_pos), and gRNAs decreasing cell fitness (Γ_neg). The sets Γ_neutral, Γ_pos, and Γ_neg have respective sizes N_neutral, N_pos, and N_neg such that N_neg=ϕ_negN_tot, N_pos=ϕ_posN_tot and N_neg+N_pos+N_neutal=N_tot. gRNAs from the different categories are assigned to essential, non-essential, or growth-suppressing genes according to N_gRNA.

In general, the cell proliferation-induced change in abundance of gRNA g between times t and t+Δ_t can be modeled as n_g(t+Δ_t)=e^βn_g(t), where β is the baseline cellular growth factor between two splittings and Δ_t the time between two splittings. β for a specific cell doubling time τ can thus be calculated as \(\beta = \log \left (2^{\frac {\Delta _{t}}{\tau }}\right)\).

A gRNA specific growth rate β_g is then derived from β_baseline such that:

β_g=β for every g∈Γ_neutral,

β_g=β(1+ε) for every g∈Γ_pos,

β_g=β(1−ε) for every g∈Γ_neg,

where ε is randomly chosen from 0, 0.01, 0.02,...,0.2.

The gRNA abundances at time t0 are calculated from the abundances in the transduced cell pool as:

\(\phantom {\dot {i}\!}n_{\textsf {t0},g} = e^{\beta _{g}}n_{\textsf {trans},g}\) (real numbers are converted to integer by only taking the integer part).

The sequencing counts from T0 \(n_{\textsf {t0}, g}^{\textsf {seq}}\) are obtained by making C_PCRN_tot draws from the multivariate hypergeometric distribution with probabilities \(p_{g} = n_{\textsf {t0}, g}/\sum _{g} n_{\textsf { t0}, g}\). This is repeated N_biopcr times, to model the technical replicates.

Next, the proliferation phase of the screen is simulated N_bio independent times to model the biological replicates. For i=1…N_split, the gRNA abundances after cell splitting n_i,split,g are obtained by making C_cellsN_tot draws from the multivariate hypergeometric distribution with probabilities \(p_{g} = n_{\textsf {i}, g}/\sum _{g} n_{\textsf {i}, g}\). This random sampling step is followed by an exponential growth step \(\phantom {\dot {i}\!}n_{\textsf {i+1},g} = e^{\beta _{g}}n_{\textsf {i,split},g}\). After completion of all cell splittings, the gRNAs in all biological replicates (time point T1) are sequenced by making C_PCRN_tot draws from the multivariate hypergeometric distribution with probabilities \(p_{g} = n_{{N_{\textsf {split}}}, g}/\sum _{g} n_{{N_{\textsf {split}}}, g}\). This is repeated N_biopcr times, to model the technical replicates.

For the analyses shown in Figs. 2b–e, 3a–c, and 4f–h, C_PCR and C_virus are chosen as indicated in the following table. The values of C_PCR and C_virus are chosen so that the 10% percentile of low abundant gRNAs in the library have a coverage of 100 fold.

Normalization and LFC calculation

Counts from experimental data were normalized using size normalization to the total read counts of the reference sample. This was not necessary for simulated datasets, because these already had the same read counts. For a given gRNA with count n_lib in the library and n₁ at time point T1, the log fold change was calculated as \(\textsf {LFC} = \text {log}2\left (\frac {n_{1} + 1}{n_{lib} + 1}\right)\). Pseudo-counts had to be added for division and log transformation since some of the low abundant gRNAs had 0 counts in one or more of the replicates.

Library width calculation

The width of a distribution of gRNA abundances can be quantified by calculating the ratio between the 90 and 10% percentile of the distribution [23]: \(\textsf {library width} = \frac {\text {percentile}_{90}(n_{\textsf {lib,g}})}{\text {percentile}_{10}(n_{\textsf {lib,g}})}\). n_lib,g is the distribution of gRNA abundances in the library.

gRNA sequence composition

The sequence probability logos in Additional file 1: Figure S4 were generated using the output of the plogo online tool [46] and R. gRNAs were ranked according to their abundance and the sequence logos generated for the lower and upper 1% and 5% of gRNAs.

gscreend method

gscreend is designed to account for asymmetric distribution of before/after ratios in pooled genetic perturbation screens (see also Fig. 4a).

gscreend takes (non-normalized) gRNA counts from several samples as its input. One of these is the reference sample (e.g., the library or T0); the others are one or several replicates of a post-screen time point (e.g. T1). The counts are scaled (a.k.a. normalized) to the total counts of the reference sample. Log2 fold changes are calculated as described above. The data are split into slices according to the gRNA abundance in the reference sample; the current implementation uses 10 slices split at the 10%, 20%, …quantiles. We use this stratification because the null distributions of the fold changes depend on it and are fit separately in each stratum.

We model the overall LFC data as a mixture of a parametric null distribution, the skew normal, and an unspecified alternative distribution [47, 48]. The first mixture component corresponds to gRNAs without fitness effect, the second to those with effect, and we will assume that these are only a minority. We use the R package fGarch for computations involving the skew normal distribution and use least quantile of square regression [49] on a 10–90% percentile of the log-likelihood to infer the model parameters from the distribution of LFCs (function lbfgs from R package nloptr).

In the next step, for every stratum, p values are calculated for every gRNA. The gRNAs are ranked based on their pvalues (if there are multiple replicates, each gRNA gets as many ranks). On this ranking, gscreend uses an α-RRA (robust rank aggregation) algorithm with an α cutoff of 5% to aggregate the data to the gene level [30, 50]. Gene level LFCs are calculated by averaging the LFCs over all gRNAs belonging to the gene.

Comparison of analysis tools

Results from the analysis of simulated and experimental data using the different analysis tools were compared as follows:

gscreend analysis was performed with 10–90% percentile for least quantile of square method and 5% threshold for α-RRA algorithm. Genes were ranked according to the p value and for genes with the same p value according to their mean LFC over all corresponding gRNAs.
MAGeCK [30] analysis was performed using the RRA algorithm, without normalization to controls. Genes were ranked according to the rank provided by MAGeCK.
BAGEL [33] analysis was performed only on experimental data because the algorithms need a list of essential and non-essential genes as training sets. Creating this type of list on a set of simulated data would be arbitrary, since its quality cannot be compared to the currently available lists of essential and non-essential genes. BAGEL analysis was run without removal of low counts. Ranking of genes was performed based on Bayes factors.
ScreenBEAM [31] analysis was performed without removal of low counts. Genes were ranked according to p values.
CRISPhieRmix [35] analysis was performed according to the software default settings. The packages take LFC data, which was calculated as described in the above LFC calculation section. Genes were ranked based on the genescore returned by the CRISPhieRmix analysis.
CRISPRBetaBinomial [37] results were ranked according to the fdr_pa parameter, which corresponds to enrichment in the after-screen time point at the gene level. For genes with identical fdr_pa, rank was attributed according to LFC.

References

Wang T, Birsoy K, Hughes NW, Krupczak KM, Post Y, Wei JJ, Lander ES, Sabatini DM. Identification and characterization of essential genes in the human genome. Science. 2015; 350(6264):1096–101. https://doi.org/10.1126/science.aac7041.
Article CAS PubMed PubMed Central Google Scholar
Hart T, Chandrashekhar M, Aregger M, Steinhart Z, Brown KR, MacLeod G, Mis M, Zimmermann M, Fradet-Turcotte A, Sun S, Mero P, Dirks P, Sidhu S, Roth FP, Rissland OS, Durocher D, Angers S, Moffat J. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015; 163(6):1515–26. https://doi.org/10.1016/j.cell.2015.11.015.
Article CAS PubMed Google Scholar
Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov G, Cowley GS, Gill S, Harrington WF, Pantel S, Krill-Burger JM, Meyers RM, Ali L, Goodale A, Lee Y, Jiang G, Hsiao J, Gerath WFJ, Howell S, Merkel E, Ghandi M, Garraway LA, Root DE, Golub TR, Boehm JS, Hahn WC. Defining a cancer dependency map. Cell. 2017; 170(3):564–57616. https://doi.org/10.1016/J.CELL.2017.06.010.
Article CAS PubMed PubMed Central Google Scholar
Wang T, Yu H, Hughes NW, Liu B, Kendirli A, Klein K, Chen WW, Lander ES, Sabatini DM. Gene essentiality profiling reveals gene networks and synthetic lethal interactions with oncogenic Ras. Cell. 2017; 168(5):890–90315. https://doi.org/10.1016/j.cell.2017.01.013.
Article CAS PubMed PubMed Central Google Scholar
Steinhart Z, Pavlovic Z, Chandrashekhar M, Hart T, Wang X, Zhang X, Robitaille M, Brown KR, Jaksani S, Overmeer R, Boj SF, Adams J, Pan J, Clevers H, Sidhu S, Moffat J, Angers S. Genome-wide CRISPR screens reveal a Wnt–FZD5 signaling circuit as a druggable vulnerability of RNF43-mutant pancreatic tumors. Nat Med. 2017; 23(1):60–8. https://doi.org/10.1038/nm.4219.
Article CAS PubMed Google Scholar
Wang T, Wei JJ, Sabatini DM, Lander ES. Genetic screens in human cells using the CRISPR-Cas9 system. Science. 2014; 343(6166):80–4. https://doi.org/10.1126/SCIENCE.1246981.
Article CAS PubMed Google Scholar
Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelson T, Heckl D, Ebert BL, Root DE, Doench JG, Zhang F. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science. 2014; 343(6166):84–7. https://doi.org/10.1126/science.1247005.
Article CAS PubMed Google Scholar
Shi J, Wang E, Milazzo JP, Wang Z, Kinney JB, Vakoc CR. Discovery of cancer drug targets by CRISPR-Cas9 screening of protein domains. Nat Biotechnol. 2015; 33(6):661–7. https://doi.org/10.1038/nbt.3235.
Article CAS PubMed PubMed Central Google Scholar
Kim HS, Lee K, Bae S, Park J, Lee C-K, Kim M, Kim E, Kim M, Kim S, Kim C, Kim J-S. CRISPR/Cas9-mediated gene knockout screens and target identification via whole-genome sequencing uncover host genes required for picornavirus infection. J Biol Chem. 2017; 292(25):10664–71. https://doi.org/10.1074/jbc.M117.782425.
Article CAS PubMed PubMed Central Google Scholar
Han J, Perez JT, Chen C, Li Y, Benitez A, Kandasamy M, Lee Y, Andrade J, TenOever B, Manicassamy B. Genome-wide CRISPR/Cas9 screen identifies host factors essential for influenza virus replication. Cell Rep. 2018; 23(2):596–607. https://doi.org/10.1016/j.celrep.2018.03.045.
Article CAS PubMed PubMed Central Google Scholar
Koike-Yusa H, Li Y, Tan E-P, Velasco-Herrera MDC, Yusa K. Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. Nat Biotechnol. 2014; 32(3):267–73. https://doi.org/10.1038/nbt.2800.
Article CAS PubMed Google Scholar
Charton K, Suel L, Henriques SF, Moussu J-P, Bovolenta M, Taillepierre M, Becker C, Lipson K, Richard I. Exploiting the CRISPR/Cas9 system to study alternative splicing in vivo: application to titin. Hum Mol Genet. 2016; 25(20):280. https://doi.org/10.1093/hmg/ddw280.
Article CAS Google Scholar
Gapinske M, Luu A, Winter J, Woods WS, Kostan KA, Shiva N, Song JS, Perez-Pinera P. CRISPR-SKIP: programmable gene splicing with single base editors. Genome Biol. 2018; 19(1):107. https://doi.org/10.1186/s13059-018-1482-5.
Article PubMed PubMed Central CAS Google Scholar
Horlbeck MA, Gilbert LA, Villalta JE, Adamson B, Pak RA, Chen Y, Fields AP, Park CY, Corn JE, Kampmann M, Weissman JS. Compact and highly active next-generation libraries for CRISPR-mediated gene repression and activation. eLife. 2016; 5:19760. https://doi.org/10.7554/eLife.19760.
Article CAS Google Scholar
Hilton IB, D’Ippolito AM, Vockley CM, Thakore PI, Crawford GE, Reddy TE, Gersbach CA. Epigenome editing by a CRISPR-Cas9-based acetyltransferase activates genes from promoters and enhancers. Nat Biotechnol. 2015; 33(5):510–7. https://doi.org/10.1038/nbt.3199.
Article CAS PubMed PubMed Central Google Scholar
Liu XS, Wu H, Ji X, Stelzer Y, Wu X, Czauderna S, Shu J, Dadon D, Young RA, Jaenisch R. Editing DNA methylation in the mammalian genome. Cell. 2016; 167(1):233–24717. https://doi.org/10.1016/J.CELL.2016.08.056.
Article CAS PubMed PubMed Central Google Scholar
Morita S, Noguchi H, Horii T, Nakabayashi K, Kimura M, Okamura K, Sakai A, Nakashima H, Hata K, Nakashima K, Hatada I. Targeted DNA demethylation in vivo using dCas9–peptide repeat and scFv–TET1 catalytic domain fusions. Nat Biotechnol. 2016; 34(10):1060–5. https://doi.org/10.1038/nbt.3658.
Article CAS PubMed Google Scholar
Gilbert LA, Horlbeck MA, Adamson B, Villalta JE, Chen Y, Whitehead EH, Guimaraes C, Panning B, Ploegh HL, Bassik MC, Qi LS, Kampmann M, Weissman JS. Genome-scale CRISPR-mediated control of gene repression and activation. Cell. 2014; 159(3):647–61. https://doi.org/10.1016/J.CELL.2014.09.029.
Article CAS PubMed PubMed Central Google Scholar
Konermann S, Brigham MD, Trevino AE, Joung J, Abudayyeh OO, Barcena C, Hsu PD, Habib N, Gootenberg JS, Nishimasu H, Nureki O, Zhang F. Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex. Nature. 2015; 517(7536):583–8. https://doi.org/10.1038/nature14136.
Article CAS PubMed Google Scholar
Chavez A, Scheiman J, Vora S, Pruitt BW, Tuttle M, P R Iyer E, Lin S, Kiani S, Guzman CD, Wiegand DJ, Ter-Ovanesyan D, Braff JL, Davidsohn N, Housden BE, Perrimon N, Weiss R, Aach J, Collins JJ, Church GM. Highly efficient Cas9-mediated transcriptional programming. Nat Methods. 2015; 12(4):326–8. https://doi.org/10.1038/nmeth.3312.
Article CAS PubMed PubMed Central Google Scholar
Nishida K, Arazoe T, Yachie N, Banno S, Kakimoto M, Tabata M, Mochizuki M, Miyabe A, Araki M, Hara KY, Shimatani Z, Kondo A. Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science. 2016; 353(6305):8729. https://doi.org/10.1126/science.aaf8729.
Article CAS Google Scholar
Komor AC, Kim YB, Packer MS, Zuris JA, Liu DR. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature. 2016; 533(7603):420–4. https://doi.org/10.1038/nature17946.
Article CAS PubMed PubMed Central Google Scholar
Joung J, Konermann S, Gootenberg JS, Abudayyeh OO, Platt RJ, Brigham MD, Sanjana NE, Zhang F. Genome-scale CRISPR-Cas9 knockout and transcriptional activation screening. Nat Protoc. 2017; 12(4):828–63. https://doi.org/10.1038/nprot.2017.016.
Article CAS PubMed PubMed Central Google Scholar
Doench JG. Am I ready for CRISPR? A user’s guide to genetic screens. Nat Rev Genet. 2018; 19(2):67–80. https://doi.org/10.1038/nrg.2017.97.
Article CAS PubMed Google Scholar
Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol Syst Biol. 2014; 10(7):733. https://doi.org/10.15252/msb.20145216.
Article PubMed PubMed Central Google Scholar
Sanjana NE, Shalem O, Zhang F. Improved vectors and genome-wide libraries for CRISPR screening. Nat Methods. 2014; 11(8):783–4. https://doi.org/10.1038/nmeth.3047.
Article CAS PubMed PubMed Central Google Scholar
Tzelepis K, Koike-Yusa H, De Braekeleer E, Li Y, Metzakopian E, Dovey OM, Mupo A, Grinkevich V, Li M, Mazan M, Gozdecka M, Ohnishi S, Cooper J, Patel M, McKerrell T, Chen B, Domingues AF, Gallipoli P, Teichmann S, Ponstingl H, McDermott U, Saez-Rodriguez J, Huntly BJP, Iorio F, Pina C, Vassiliou GS, Yusa K. A CRISPR dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukemia. Cell Rep. 2016; 17(4):1193–205. https://doi.org/10.1016/j.celrep.2016.09.079.
Article CAS PubMed PubMed Central Google Scholar
Aregger M, Chandrashekhar M, Tong AHY, Chan K, Moffat J. Pooled lentiviral CRISPR-Cas9 screens for functional genomics in mammalian cells. In: Methods in Molecular Biology, vol 1869. New York: Humana Press: 2019. p. 169–88. https://doi.org/10.1007/978-1-4939-8805-1_15.
Google Scholar
Nagy T, Kampmann M. CRISPulator: A discrete simulation tool for pooled genetic screens. BMC Bioinformatics. 2017; 18(1):1–12. https://doi.org/10.1186/s12859-017-1759-9.
Article CAS Google Scholar
Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS, Brown M, Liu XS. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014; 15(12):554. https://doi.org/10.1186/s13059-014-0554-4.
Article PubMed PubMed Central CAS Google Scholar
Yu J, Silva J, Califano A. ScreenBEAM: a novel meta-analysis algorithm for functional genomics screens via Bayesian hierarchical modeling. Bioinformatics. 2015; 32(2):556. https://doi.org/10.1093/bioinformatics/btv556.
Article CAS Google Scholar
Diaz AA, Qin H, Ramalho-Santos M, Song JS. HiTSelect: a comprehensive tool for high-complexity-pooled screen analysis. Nucleic Acids Res. 2015; 43(3):16. https://doi.org/10.1093/nar/gku1197.
Article CAS Google Scholar
Hart T, Moffat J. BAGEL: A computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics. 2016; 17(1):1–7. https://doi.org/10.1186/s12859-016-1015-8.
Article CAS Google Scholar
Jia G, Wang X, Xiao G. A permutation-based non-parametric analysis of CRISPR screen data. BMC Genomics. 2017; 18(1):545. https://doi.org/10.1186/s12864-017-3938-5.
Article PubMed PubMed Central CAS Google Scholar
Daley TP, Lin Z, Lin X, Liu Y, Wong WH, Qi LS. CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens. Genome Biol. 2018; 19(1):159. https://doi.org/10.1186/s13059-018-1538-6.
Article PubMed PubMed Central CAS Google Scholar
Allen F, Khodak A, Behan F, Iorio F, Yusa K, Garnett M, Parts L. JACKS: joint analysis of CRISPR/Cas9 knockout screens. Genome Res. 2019; 29:464–71. https://doi.org/10.1101/gr.238923.118.
Article CAS PubMed PubMed Central Google Scholar
Jeong H-H, Kim SY, Rousseaux MWC, Zoghbi HY, Liu Z. Beta-binomial modeling of CRISPR pooled screen data identifies target genes with greater sensitivity and fewer false negatives. Genome Res. 2019; 29(6):999–1008. https://doi.org/10.1101/gr.245571.118.
Article CAS PubMed PubMed Central Google Scholar
Zhou Y, Zhu S, Cai C, Yuan P, Li C, Huang Y, Wei W. High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature. 2014; 509(7501):487–91. https://doi.org/10.1038/nature13166.
Article CAS PubMed Google Scholar
Parnas O, Jovanovic M, Eisenhaure TM, Herbst RH, Dixit A, Ye CJ, Przybylski D, Platt RJ, Tirosh I, Sanjana NE, Shalem O, Satija R, Raychowdhury R, Mertins P, Carr SA, Zhang F, Hacohen N, Regev A. A genome-wide CRISPR screen in primary immune cells to dissect regulatory networks. Cell. 2015; 162(3):675–86. https://doi.org/10.1016/J.CELL.2015.06.059.
Article CAS PubMed PubMed Central Google Scholar
DepMap; Broad. DepMap Achilles 19Q1 Public. figshare. 2019; Fileset. https://doi.org/10.6084/m9.figshare.7655150.
Behan FM, Iorio F, Picco G, Gonçalves E, Beaver CM, Migliardi G, Santos R, Rao Y, Sassi F, Pinnelli M, Ansari R, Harper S, Jackson DA, McRae R, Pooley R, Wilkinson P, van der Meer D, Dow D, Buser-Doepner C, Bertotti A, Trusolino L, Stronach EA, Saez-Rodriguez J, Yusa K, Garnett MJ. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature. 2019; 568(7753):511–6. https://doi.org/10.1038/s41586-019-1103-9.
Article CAS PubMed Google Scholar
Hart T, Tong AHY, Chan K, Van Leeuwen J, Seetharaman A, Aregger M, Chandrashekhar M, Hustedt N, Seth S, Noonan A, Habsid A, Sizova O, Nedyalkova L, Climie R, Tworzyanski L, Lawson K, Sartori MA, Alibeh S, Tieu D, Masud S, Mero P, Weiss A, Brown KR, Usaj M, Billmann M, Rahman M, Costanzo M, Myers CL, Andrews BJ, Boone C, Durocher D, Moffat J. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 Genes|Genomes|Genetics. 2017; 7(8):2719–27. https://doi.org/10.1534/g3.117.041277.
Article CAS PubMed PubMed Central Google Scholar
Meyers RM, Bryan JG, McFarland JM, Weir BA, Sizemore AE, Xu H, Dharia NV, Montgomery PG, Cowley GS, Pantel S, Goodale A, Lee Y, Ali LD, Jiang G, Lubonja R, Harrington WF, Strickland M, Wu T, Hawes DC, Zhivich VA, Wyatt MR, Kalani Z, Chang JJ, Okamoto M, Stegmaier K, Golub TR, Boehm JS, Vazquez F, Root DE, Hahn WC, Tsherniak A. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat Genet. 2017; 49(12):1779–84. https://doi.org/10.1038/ng.3984.
Article CAS PubMed PubMed Central Google Scholar
Ong SH, Li Y, Koike-Yusa H, Yusa K. Optimised metrics for CRISPR-KO screens with second-generation gRNA libraries. Sci Rep. 2017; 7(1):1–10. https://doi.org/10.1038/s41598-017-07827-z.
Article CAS Google Scholar
Sidik SM, Huet D, Ganesan SM, Huynh MH, Wang T, Nasamu AS, Thiru P, Saeij JPJ, Carruthers VB, Niles JC, Lourido S. A genome-wide CRISPR screen in toxoplasma identifies essential apicomplexan genes. Cell. 2016; 166(6):1423–143512. https://doi.org/10.1016/j.cell.2016.08.019.
Article CAS PubMed PubMed Central Google Scholar
O’Shea J. P, Chou MF, Quader SA, Ryan JK, Church GM, Schwartz D. PLogo: a probabilistic approach to visualizing sequence motifs. Nat Methods. 2013; 10(12):1211–2. https://doi.org/10.1038/nmeth.2646.
Article PubMed CAS Google Scholar
Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004; 99:96–104.
Article Google Scholar
Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics. 2008; 9(1):303. https://doi.org/10.1186/1471-2105-9-303.
Article PubMed PubMed Central Google Scholar
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection.Wiley Ser Probab Stat; 1987, p. 329. https://doi.org/10.1002/0471725382.
Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012; 28(4):573–80. https://doi.org/10.1093/bioinformatics/btr709.
Article CAS PubMed PubMed Central Google Scholar
Shifrut E, Carnevale J, Tobin V, Roth TL, Woo JM, Bui CT, Li PJ, Diolaiti ME, Ashworth A, Marson A. Genome-wide CRISPR screens in primary human T cells reveal key regulators of immune function. Cell. 2018; 175(7):1958–197115. https://doi.org/10.1016/j.cell.2018.10.024.
Article CAS PubMed PubMed Central Google Scholar
Wegner M, Diehl V, Bittl V, de Bruyn R, Wiechmann S, Matthess Y, Hebel M, Hayes MG, Schaubeck S, Benner C, Heinz S, Bremm A, Dikic I, Ernst A, Kaulich M. Circular synthesized CRISPR/Cas gRNAs for functional interrogations in the coding and noncoding genome. eLife. 2019; 8. https://doi.org/10.7554/eLife.42549.
Chen C-H, Xiao T, Xu H, Jiang P, Meyer CA, Li W, Brown M, Liu XS. Improved design and analysis of CRISPR knockout screens. Bioinformatics (June). 2018:1–7. https://doi.org/10.1093/bioinformatics/bty450.
Imkeller K. Simulation of pooled screens. Github. 2019. https://github.com/imkeller/simulate_pooled_screen. Accessed 31 Jan 2020.
Imkeller K, Huber W. gscreend - analysis of pooled CRISPR screens. Bioconductor. 2019. http://bioconductor.org/s/gscreend. Accessed 31 Jan 2020.

Download references

Acknowledgements

We would like to thank Sabine Ottinger for the help with library preparation and acknowledge the support provided by the Genomics Core Facility of the European Molecular Biology Laboratory (Heidelberg, Germany).

Review history

The review history is available as Additional file 2.

Funding

This work was supported by the NCT 3.0 Integrative Projects in Basic Cancer Research Program and the DFG Sonderforschungsbereich 1324.

Author information

Authors and Affiliations

German Cancer Research Center (DKFZ) and Heidelberg University, Heidelberg, Germany
Katharina Imkeller, Giulia Ambrosi & Michael Boutros
European Molecular Biology Laboratory, Heidelberg, Germany
Katharina Imkeller & Wolfgang Huber

Authors

Katharina Imkeller
View author publications
You can also search for this author in PubMed Google Scholar
Giulia Ambrosi
View author publications
You can also search for this author in PubMed Google Scholar
Michael Boutros
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Huber
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors’ contributions

KI implemented the model and performed the analyses. GA performed the screening experiment. MB and WH designed and supervised the research. KI and WH wrote the paper. All authors read and approved the final manuscript.

Authors’ information

Twitter handles: @K_Imkeller (Katharina Imkeller), @Michael_Boutros (Michael Boutros), @wolfgangkhuber (Wolfgang Huber).

Corresponding author

Correspondence to Wolfgang Huber.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Table S1 and Figures S1–S4.

Additional file 2

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Imkeller, K., Ambrosi, G., Boutros, M. et al. gscreend: modelling asymmetric count ratios in CRISPR screens to decrease experiment size and improve phenotype detection. Genome Biol 21, 53 (2020). https://doi.org/10.1186/s13059-020-1939-1

Download citation

Received: 18 August 2019
Accepted: 19 January 2020
Published: 02 March 2020
DOI: https://doi.org/10.1186/s13059-020-1939-1

gscreend: modelling asymmetric count ratios in CRISPR screens to decrease experiment size and improve phenotype detection

Abstract

Background

Results

Before/after ratios from pooled genetic screens have an asymmetric null distribution

Computational simulation of pooled CRISPR screens

Plasmid library is a better reference sample than T0 cell pool

The asymmetry of before/after ratios is caused by cell splitting during the proliferation phase

Wide initial gRNA abundance distributions increase asymmetry of before/after ratios

New statistical method for improved phenotype detection

Implications for the design of screening experiments

Discussion

Conclusion

Methods

Experimental datasets

Simulation of pooled CRISPR screens

Normalization and LFC calculation

Library width calculation

gRNA sequence composition

gscreend method

Comparison of analysis tools

References

Acknowledgements

Review history

Funding

Author information

Authors and Affiliations

Contributions

Authors’ contributions

Authors’ information

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1

Additional file 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us