We developed a novel web-based tool, CRISPOR (http://crispor.org), to assist with guide selection in 120 genomes, including plants and many emerging model organisms, and pre-calculated results for all human coding exons as a UCSC Genome Browser track. To evaluate off-target prediction accuracy, we took advantage of eight recently published studies that detected and quantified off-target cleavage sites [2, 3, 7, 26–29] (summarized in Additional file 1: Table S1) and from these collected 650 off-target sequences that were experimentally identified for 31 different guides (Additional file 2). The assays differed mostly in sensitivity (Additional file 3: Figure S1; Additional file 4: Table S2, and Additional file 5: Table S3). Two studies [3, 28] did not validate identified off-targets with PCR amplicon sequencing in the same cell type and may include false positives.
We noticed two outliers, VEGFA_site2 and HEK293_sgRNA4, from the study by Tsai et al. [3]. The two guides are responsible for 151 and 133 off-targets, respectively. Together they account for 44 % (284/650) of all off-target sequences in our dataset and 71 % (84/119) of the off-targets with five or more mismatches. They also have the highest GC content in the Tsai et al. data, 80 % and 75 %, respectively (Additional file 6: Figure S2). A relationship between GC content and specificity is known from siRNA design [30] and would explain the previously observed difficulty to target GC-rich genes [2, 8, 31] and quadruplex-forming sequences [32]. Of all four million unique -NGG guide sequences in human coding exons, the ones with a GC content >75 % constitute only 13 %, so they can usually be avoided. We therefore removed these two guides from further analysis.
One issue with the remaining data was the sensitivity of the assays. The two assays using targeted sequencing of predicted sites reported off-targets with a modification frequency lower than 0.001 % [7, 29] while all whole-genome assays estimated their sensitivity at around 0.1–0.2 % [2, 3, 33] (Additional file 1: Table S1; [26, 28] did not report sensitivity). This means that the rare off-targets found in targeted sequencing studies cannot be compared with those from whole-genome assays. We therefore chose to analyze only off-targets that can be detected with whole-genome assays, with a modification frequency >0.1 %.
Of the remaining 225 off-targets, most (88.4 %) had up to four mismatches relative to the guide (Fig. 1). All others had five or six mismatches but with low modification frequencies, <3 % or <1.1 %, respectively. Most of these were found by Frock et al. [28], a study that seems to favor more degenerate off-targets and did not validate them with PCR. Allowing indels (“bulges”) in the alignment would have made a difference only for two off-targets out of 225, with cleavage frequencies of 0.1 % and 0.2 %, as previously observed [2, 28] (Additional file 1). In addition, the ranking of the guides by MIT specificity score (see below) was largely unchanged when increasing the number of mismatches beyond four (Additional file 7: Figure S3). Therefore, CRISPOR does not allow indels, ranks guides based on potential off-targets with up to four mismatches, and allows five mismatches for a detailed analysis of a single guide.
It has been reported that the off-target predictors on the CRISPR Design website (http://crispr.mit.edu) and Ecrisp [10] failed to detect many off-target sites [3, 34], including off-targets with a single mismatch from the guide. In contrast, we confirmed that the BWA [35] sequence search algorithm used in CRISPOR as well as the novel algorithm in the recently published CasOffFinder [11] were able to find all validated off-targets (Additional file 8: Table S4), demonstrating that this is only a software issue and limited to certain tools. For example, in the case of the EMX1 guide, CRISPOR and CasOffFinder predict 1288 off-targets with up to four mismatches while the MIT site predicts only 334 and as a result does not find five out of 15 validated off-targets, one of which has only two mismatches and a >20 % modification frequency confirmed by two different assays (Additional file 4: Table S2 and Additional file 5: Table S3).
In order to rank potential off-targets, many prediction tools calculate a score based on the position of the mismatches to the guide sequence. Initially, systematic testing of the effect of mismatches led to a weight for each possible nucleotide change at each position and a formula to combine these into a score [7]. The score of the MIT website (http://crispr.mit.edu/about) is based on these data but reduced to one weight per position. The off-target predictors CCTop [36] and CROP-IT [37] independently devised heuristics based on the distances of the mismatches to the protospacer adjacent motif (PAM). The more recent CFD score [34] is based on the biggest dataset to date, cleavage data obtained by infecting cells with a lentiviral library containing thousands of guides targeting the CD33 gene for all PAMs, including guides for all possible nucleotide mismatches and 1-bp indels at all positions. In addition, all scores except CCTop also include a penalty for mismatches located close to each other.
For off-targets with up to four mismatches, receiver-operating characteristic analysis (ROC; Fig. 2) of these four algorithms shows that the CFD score distinguishes best between validated and false-positive off-targets, with an area under the curve (AUC) of 0.91. The MIT score as calculated by the CRISPOR website is slightly less discriminative with an AUC of 0.87. As expected, when calculated by the MIT site itself, the AUC of the MIT score is a lot lower because this tool misses many off-target alignments in the genome. The ROC plot also shows that adding a minimal CFD off-target score of 0.023 decreases false positives by 57 % while reducing true positives by only 2 %. At this cutoff, no off-targets with a modification frequency >1 % are missed (data not shown).
We next examined the ranking of guides by specificity. The MIT scores of all potential off-targets of a guide can be summarized into the “guide specificity score” defined by [7], which ranges from 0–100 (100 = best). Figure 3a shows that higher specificity scores are generally associated with fewer off-target sites and lower off-target modification frequencies, as expected. In contrast, a few guides had unusually strong off-targets, illustrating that the scoring model could still be improved, possibly by using the CFD off-target score or taking into account the chromatin context [3, 7]. However, a single score for guide specificity may not always be valuable. For example, intergenic off-targets may be considered a minor issue for functional studies in cultured cells. When transgenic animals are back-crossed, off-targets on a different chromosome will not co-segregate with the mutation of interest and may often be acceptable. Therefore, while CRISPOR shows the MIT specificity score as an indicator of guide quality, all potential off-targets are annotated and shown for detailed inspection.
We ranked the four million unique guide sequences in human coding regions by MIT specificity score. We observed that the guides tested in the eight off-target studies exhibit relatively low specificity scores relative to the genome average (Fig. 3b). The relatively low specificity scores make the high number of off-targets that were found less surprising. As a result, there is currently limited data on guides with high specificity scores that are more relevant when designing an experiment. Figure 3b shows that the more specific guide RNAs that were tested as well as about 30 % of the guide sequences in human coding regions exceed a specificity score of 50. Therefore, the CRISPOR website highlights guides with a minimum MIT specificity score of 50. With the MIT website, as it misses some off-targets, the cutoff should be higher, around 70–80 (Additional file 9: Figure S4).
In addition to off-target cleavage, we evaluated predictions of on-target efficiency, including eight different scoring models and two heuristics. For this purpose, we collected activity data for more than 19,000 guides, including data sets used to build the scoring models [6, 8, 23–25, 34, 38, 39] and from independent studies in cultured cells and ascidian oocytes and from zebrafish screens [31, 40–43]. Additional file 10: Table S5 summarizes the studies and the different assay types.
For datasets where replicates are available, the Spearman correlation is in the range 0.71–0.77 (Additional file 11: Table S6; Hct116, mouse embryonic stem cells) for the same assay in the same cell type. This gives an indication of the quality of the data and suggests that a correlation of about 0.7 constitutes an upper limit of any prediction. For some datasets, the assay was repeated in a different cell type. In these cases, the correlations were almost identical for some cell type combinations (e.g., 0.75 for Hl60/Kbm7 [38]; Additional file 11: Table S6) and lower for others (0.53–0.63 for Rpe1 cells [41]). If these lower correlations are due to differences in the chromatin state, this suggests that its influence varies and is relatively modest, at most 10–20 % of the rank correlation.
The heat map in Fig. 4 shows that on independent datasets, those not used to train any algorithm such as Hart et al. [41], current predictions achieve Spearman correlations of 0.341–0.436 (see Additional file 12 for plots of individual data points). In cases when algorithms are applied to their own training dataset the correlations are higher, but this is an artifact, known as algorithmic overfitting; we show the corresponding correlation values in grey in Fig. 4.
We observed that the quality of the assay is an important parameter. For example, for a dataset obtained with Surveyor Nuclease, we found no significant correlation between guide activity and any of the scores (see “Liu” in Additional file 13: Figure S5). Another example is the Housden et al. score [44], which did not predict well the activity in any dataset, including its own. This may be due again to the accuracy of the activity measurements or a result of the statistical model used by Housden et al., a weight matrix. The dataset “Wang 2015” was designed with a scoring algorithm and shows very little signal. The dataset “Eschstruth” is very small and includes several guides that were selected based on very high Doench scores. In the Chari et al. study [24], the dataset from K562 cells was not correlated with two replicates of the same assay in HEK293T cells, so we only used the HEK293T dataset, like Chari et al. themselves. We do not show these five datasets in Fig. 4 but instead in Additional file 13: Figure S5; the raw data are included in Additional file 14.
Figure 4 shows that scores trained on mammalian cell lines work surprisingly well in other organisms, even in non-vertebrate ones, like Ciona intestinalis, C. elegans, and, to some extent, Drosophila, though in the latter only limited data are available. In contrast, the Moreno-Mateos score, an algorithm trained on zebrafish assays, does not translate well to all other datasets and vice versa. This is consistent with previous reports that the Doench score is not accurate in zebrafish [31, 40]. For this organism, guides are made by in vitro transcription with the T7 promoter and injected into eggs rather than expressed from exogenous DNA in cells from a U6 promoter. Without constant expression of the guide from a plasmid, the stability of guide RNA starts to play a bigger role [39]. Possible explanations for the difference in algorithm performance are, therefore, that RNA stability or the promoter leads to differences in guide activity. By excluding artifacts (grey) in Fig. 4 and taking this separation into account, one can hypothesize that the Fusi/Doench score performs best in U6 promoter-based assays and Moreno-Mateos best in assays based on delivery of guide RNAs produced by T7 in vitro transcription.
To confirm this observation and to rule out an influence of the organism or the assay itself, we analyzed data from our own labs in the same way (Fig. 5). We tested two series of guides in cell cultures with two different assays (“K562-lacZ rank” and “U20S/MEF/C6-T7 endo”, 24 and 49 guides, respectively), injected one series of guides in zebrafish one-cell embryos (“Zebrafish-seq”, 163 guides) and another series in mouse embryos (“Mouse in vivo Seq”, 30 guides) (Additional file 15: Table S7). The data confirmed that zebrafish and cell culture results differ and most importantly they showed that the mouse in vivo data, using in vitro transcribed guide RNA, correlates best with the zebrafish-based predictor (Spearman P value 0.019; see Additional file 12 for P values and Additional file 14 for all frequencies and prediction scores where these data sets are called “Schoenig”, “Concordet”, “Shkumatava”, and “Teboul”, respectively).
Correlation of the prediction score with observed absolute activity may disadvantage some algorithms. We therefore performed precision-recall curve analysis (Additional file 16: Figure S6) and also calculated precision/recall based on the overlap of the top quartile of the predictions with the top quartile of measured activity (Additional file 17: Figure S7). For the latter, we added two heuristics described by [6, 25], GC content in the last four base pairs and whether the guide ends with -GG. The results overall correspond to the performance as measured by correlation values; the Fusi/Doench and Moreno-Mateos scores perform best on the large datasets and depend on the expression system.
Two prediction schemes can reach a relatively high precision: the Wong score [45] for cell cultures (U6 promoter) and the -GG rule for T7 in vitro transcription. However, their recall is relatively low; in the Doench 2014 dataset, for example, only 12.8 % of guides have a Wong score that is not zero and 13.2 % end with -GG.
CRISPOR calculates all currently available scores and lets the user select the most suitable one for the particular assay/model organism. Based on Fig. 4, we recommend the Fusi/Doench score for guides expressed from a U6 promoter and the Moreno-Mateos score for experiments where guides are produced by T7 in vitro transcription. As an additional ranking criterion, when there is a large set of possible guides to pick from, the Wong score and -GG rule predict well efficient U6- and in vitro-transcribed guides, respectively.
Are correlations of around 0.4 high enough to reduce the number of guides in practice? To demonstrate that the efficiency scores are useful not only when designing thousands of guides for genome-wide screens [38] but also in a more common genome editing project of just a few loci, we evaluated the prediction performance on the data from our labs shown in Fig. 5. For two datasets, we have screened multiple guides per locus to select the most efficient one and evaluated post hoc how much time could have been saved by using the appropriate prediction algorithm.
In the K562 cell culture dataset, three guides each from eight loci in human, mouse, and rat were tested with an in vitro assay [46] (dataset “Schönig” in Additional file 14). For six out of eight loci, the highest Fusi/Doench score did predict the guide with the strongest cleavage (P = 0.032). In another set of 104 guides from 11 zebrafish loci (“Shkumatava” in Additional file 14), taking only two guides with the highest Moreno-Mateos score from each locus would have reduced the number of injections from 104 to 22 and still identified one of the top two guides for nine out of 11 loci (P = 0.024; no other score was significant). In both cases, a second round of screening would have been required, but the number of guides to screen could have been reduced by a third. In the case of the zebrafish screen, which are typically more time-consuming than cell culture assays, we estimate that we could have saved 250 h of work by using the Moreno-Mateos score. In addition and especially in mice, the ability of predicting guide RNA activity is a significant advance in terms of animal welfare as fewer animals will be required to create mutants.