Network analysis of gene essentiality in functional genomics experiments

Many genomic techniques have been developed to study gene essentiality genome-wide, such as CRISPR and shRNA screens. Our analyses of public CRISPR screens suggest protein interaction networks, when integrated with gene expression or histone marks, are highly predictive of gene essentiality. Meanwhile, the quality of CRISPR and shRNA screen results can be significantly enhanced through network neighbor information. We also found network neighbor information to be very informative on prioritizing ChIP-seq target genes and survival indicator genes from tumor profiling. Thus, our study provides a general method for gene essentiality analysis in functional genomic experiments (http://nest.dfci.harvard.edu). Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0808-9) contains supplementary material, which is available to authorized users.

For gene essentiality prediction, the outcome from CRISPR experiments is used as gold standard. Significant gene hits are called by MAGeCK 0.5 with default parameters and FDR threshold 0.05. Only negatively selected gene hits were considered as gold standard, because most significant gene hits are negatively selected in collected CRISPR experiments (Supplementary Figure  S1). For gold standard control set, we extracted the same number of genes ranked on bottom by MAGeCK in negative selection.
For comparison with DREAM, we used the top 3 ranked methods in Sub-challenge 1, where any features can be used for gene essentiality prediction. The best-ranked method used kernelbased representation learning method to reflect dependency among essentiality scores of different genes. This team optimized the kernel and prediction parameter to best reflect the gene-wise similarity in DREAM challenge data. The second ranked method used multiple kernel learning, random forest and kernel ridge representation to boost the ensemble classification accuracy beyond each individual component. This method also incorporated prior knowledge regarding gene interaction networks to improve prediction performance. The third ranked method used Gaussian process regression with carefully tuned parameters to fit the DREAM challenge gene essentiality. For all cell lines included in DREAM challenge, only A375 cell line has CRISPR data collected. So, we only compared the performance on A375 cell.
For comparison with PinnacleZ [1], we only used STRING interactions with confidence score larger than 0.6, because the software crashed after loading the entire STRING network. We took all default parameters in running and grouped all subnetwork markers together as predicted essential genes.
For method of network smoothing, the authors code release (called NBS) depends on an older version of MATLAB 2012 and we cannot run the code with our MATLAB 2015. Thus, we reimplemented the procedure in the same way as described [2] (source code available at: http://nest.dfci.harvard.edu/download/network smoothing.R).
For comparison with NePhe on CRISPR screen quality enhancement [3], we only used STRING interactions with confidence larger than 0.6, because the software takes more than 60G memory and crash if the whole STRING network is used. Since the input of NePhe is a gene list, we used the negative selection hits from MAGeCK. All default parameters were used in running.
For each gene, we first estimate the study bias of KEGG by counting the number of times that this gene is included in any KEGG pathway annotations. For a give profile of gene scores, we run a logistic regression for each KEGG pathway annotation. The response variable is 1 or 0, indicating whether a gene is annotated in that KEGG pathway. The covariates matrix included the gene scores and gene study bias as background. The Ward test is used to test the significance of given gene scores. The Ward test P-values are converted to FDRs by Benjamini-Hochberg procedure and 0.05 is used as threshold to find significantly enriched KEGG pathways. The NEST scores are converted to relative rank percentiles from 0 to 1. The rank percentile values are shown for essential genes and non-essential genes. (C) The HL60 expression profile was collected from the ENCODE, the A375 expression profile was from the CCLE project and K562 H3K27ac profile was collected from Roadmap project. In each cohort, we calculated the NEST scores for all cell lines profiled and computed their prediction power of gene essentiality in CRISPR screen by Wilcoxon rank-sum test. The rank-sum Z-scores were ranked and the relative rank of each cell line is shown.  (Table 1). Based on the above logistic regression, the essentiality probability is predicted for each gene as fitted value, which combined effects from all covariates. Using CRISPR outcome as gold standard, the prediction performance is tested by ROC curves and shown for  Figure 4C in main text. The GBM patient expression and clinical data are from Gavendeel et al [6]. For each gene, we calculated a death risk Z-score by Cox-PH model from either gene expression or NEST scores. We compared the distribution of Z-scores for oncogenes (Onco) and tumor suppressor genes (TSG) based on the annotation from Vogelstein et al [7]. The bottom and top of the boxes are the 25th and 75th percentiles (interquartile range). Whiskers on the top and bottom represent the maximum and minimum data points within the range represented by 1.5 times the inter-quartile range. The distribution of Z-scores is compared between Oncogene and TSG category by Wilcoxon rank-sum test and three stars represents P -value < 0.001.