The software used in this publication is available on GitHub (https://github.com/igm-team/subrvis), released under the MIT license.
Defining the domains
We define each gene’s protein-coding sequence based on its Consensus Coding Sequence (CCDS release 15, accessed November 2013) entry [5]. In order to avoid multiple gene definitions, genes with multiple CCDS transcripts are assigned the gene’s canonical transcript, as this is the longest and most encompassing transcript. Next, each gene’s CCDS entry is translated into protein sequence and aligned to CDD (version 3.11) [6] using RPS-BLAST (version 2.2.28+). No multi-domains in CDD were used, to avoid grouping multiple single domains into one sub-region in our domain definitions. Only alignments with maximal E-value of 1e-2 were considered. When two domains overlapped in the RPS-BLAST results, the domain with the better alignment score was kept.
Calculating subRVIS Scores
subRVIS scores were calculated for the domains, exons, 100 permuted domains, and 100 permuted exons. The scores were calculated using the NHLBI Exome Sequencing Project (ESP) [8], as described in [1]. Each position in each sub-region is first checked for adequate coverage. Only positions with ≥10× average coverage in the ESP were considered. ESP variant calls were further filtered to only retain variants with a ‘PASS’ filter status. Following this, using the ESP variant calls that qualify based on both these criteria, the tally of all variants per each sub-region is regressed against the count of common (>0.1 % minor allele frequency) non-synonymous variants in the sub-region, as per [1]. The studentized residual of each sub-region is its score. Thus, the subRVIS score quantifies the departure of the observed number of common non-synonymous variants from the expectation given the total number of variants in each genomic region. One of the outcomes of using the residuals from this regression as the score is that they are, by definition, orthogonal to the tally of all variants in their corresponding sub-region and thus orthogonal to the overall distribution of variants in that genomic region. A total of 89,335 domain subRVIS scores (Additional file 10) and 185,355 exon subRVIS scores (Additional file 11) were generated. Y and MT chromosome genes were not assessed.
Score prediction test dataset
To test the utility of subRVIS scores in predicting which regions are more likely to carry pathogenic variants, we required a database of known pathogenic variants. We combined data from ClinVar (accessed June 2015) [10] and HGMD (release 2015.1) [11], filtering for ClinVar entries labeled ‘Pathogenic’ and HGMD entries tagged as ‘DM’ (disease causing mutation).
We then ran Variant Effect Predictor (version 73) [27] and filtered for canonical variants that were labeled as ‘missense_variant’ and were not labeled as any of the following: ‘incomplete_terminal_codon_variant’, ‘splice_region_variant’, ‘stop_gained’, and ‘stop_lost’.
Calculating mutation rates
As we are testing against raw counts of pathogenic mutations, we required a covariate to account for the difference in counts that are due to sequence mutability and unrelated to intolerance. For this, we calculated the mutation rates for each sub-region (Additional file 10 and Additional file 11) based on its sequence composition [23]. This calculation is not based on any other data used in this manuscript.
Regional score gene-specific prediction test model
To test whether the subRVIS scores are predictive at the single gene level, we designed and implemented a permutation test that predicts the distribution of reported pathogenic variants within a single gene using a set of scores. Genes with less than two regions or less than one reported pathogenic variant were not assessed.
Let n
g
be the number of regions in gene \( \mathit{\mathsf{g}} \). Let \( {\mathit{\mathsf{Y}}}_{\mathit{\mathsf{g}}} \) be a vector of length n
g
containing the counts of reported pathogenic variants across sub-regions in gene \( \mathit{\mathsf{g}} \). Let \( {\mathit{\mathsf{Z}}}_{\mathit{\mathsf{g}}} \) be a vector of length n
g
containing the mutation rates across sub-regions in gene \( \mathit{\mathsf{g}} \), based on sequence composition [23]. Let \( {\mathit{\mathsf{X}}}_{\mathit{\mathsf{g}}} \) be a vector of length n
g
containing the intolerance scores across sub-regions in gene \( \mathit{\mathsf{g}} \).
For each gene, we then calculated the expected distribution of pathogenic variants within the gene based on the sub-region mutation rates and the total number of reported pathogenic variants within the gene:
$$ {E}_{\mathsf{g}i}=\left(\sum_i^{n_g}{Y}_{\mathsf{g}i}\right)\ast \left(\frac{Z_{\mathsf{g}i}}{{\displaystyle {\sum}_i^{n_g}}{Z}_{\mathsf{g}i}}\right). $$
Thus, \( {\mathit{\mathsf{E}}}_{\mathit{\mathsf{g}}} \) is a vector of the expected number of reported pathogenic variants in each sub-region based on the gene’s mutation rates and total number of reported pathogenic variants.
For each sub-region we subtracted the expected number of pathogenic variants \( \left({\mathit{\mathsf{E}}}_{\mathit{\mathsf{g}}}\right) \) from the observed number of pathogenic variants \( \left({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{g}}}\right) \). This vector denotes the departure of each sub-region’s tally of reported pathogenic variants from its expected tally. This vector is designated \( {\mathit{\mathsf{D}}}_{\mathit{\mathsf{g}}} \).
We calculated a score per gene, designated \( {\mathit{\mathsf{C}}}_{\mathit{\mathsf{g}}} \), denoting both the departure of the reported pathogenic variants distribution from the expectation and the relationship with the intolerance score:
$$ {C}_{\mathsf{g}}=cov\left({D}_{\mathsf{g}},{X}_{\mathsf{g}}\right). $$
We expect that a lower intolerance score will correspond with higher pathogenic variant counts than expected, and therefore we expect the covariance to be negative. To test this, we performed a permutation test, where each permutation has a different distribution of the reported pathogenic variants.
For each permutation, we drew the distribution of the reported pathogenic variants from a multinomial distribution, where the number of trials is the total tally of reported pathogenic variants for the gene and the probability for each sub-region is the fraction of the gene’s mutation rate that it occupies \( \left(\frac{Z_{\mathsf{g}i}}{{\displaystyle {\sum}_i^{n_g}}{Z}_{\mathsf{g}i}}\right) \). Following this, we calculated the departure from expectation and the covariance as described above.
We repeated this n
p
= 20,000 times. To test how the intolerance score prediction for the true distribution of pathogenic variants compares to the prediction for the permuted distributions, we counted how many times out of the n
p
= 20,000 permutations the permuted covariance is smaller than or equal to the true covariance. This count is designated G. The permutation P value is calculated by the following equation: (G + 1)/(n
p
+ 1).
Regional score genome-wide prediction test model
To test whether the regional scores are predictive at the level of the genome, we designed and implemented a model that predicts the presence or absence of reported pathogenic variants at each sub-region using a set of scores.
As the presence or absence of reported pathogenic variants in a sub-region will depend greatly on that sub-region’s mutability, what we are trying to determine in this model is whether our scores can predict the presence of pathogenic variation in a sub-region after accounting for the region’s mutability. We divide any score predictors (subRVIS, subGERP, genic RVIS) within this model by their standard deviation in order to allow for the interpretation of the effect sizes in terms of standard deviations.
Previously, we were considering a single gene indexed by g. Here, we are considering genome-wide analysis, therefore we dropped the g from notation. Specifically, let n be the number of sub-regions across all the genes in the genome. Let Y be a vector of length n with components (Y
i
, i = 1,…, n) corresponding to each sub-region taking on either a 1 or a 0, respectively, denoting presence or absence of at least one non-LoF pathogenic variant within the sub-region. Let Z be a vector of length n containing the sub-regions’ mutation rates, based on sequence composition [23]. Let X be a vector of length n containing the sub-regions’ intolerance scores, scaled across all sub-regions by dividing each intolerance score by the standard deviation of all the sub-regions’ intolerance scores.
To evaluate the relationship between the score and the presence of pathogenic variants we fit the following logistic regression model:
$$ logit\left( \Pr \left({Y}_i=1\right)\right)=\alpha +{\beta}_1\ast \log \left({Z}_i\right)+{\beta}_2{X}_i, $$
where i=1,...,n.
Note that β
2 captures the strength of the relationship between X, the intolerance scores, and Y, while adjusting for regional mutability. We refer to β
2 as the ‘score effect size’ and report it in the text as a metric for how well a given model performs.
Next, we wanted to assess how the model performs on negative test sets. For each intolerance score genome-wide test we generated 1,000 resampled response vectors. To create the resampled response vectors, we resampled the 0 s and 1 s within the true response vector (Y) so that sub-regions with a larger mutation rate (Z) were more likely to be assigned a 1. Specifically, we resampled Y without replacement with sampling probabilities given by \( \frac{Z}{{\displaystyle {\sum}_i^n}{Z}_i} \). The idea is that under neutrality the more mutable a region is the more likely it is to carry mutations. By using this resampling method, we preserve the number of regions containing pathogenic variants and therefore the number of zeros and ones in our response vector remains the same.
Using the same genome-wide assessment that we used for the observed data, we tested the intolerance scores’ prediction against each of the 1,000 resampled response vectors. This resulted in a vector of negative test set P values. To test how the observed set of reported pathogenic variants compares to that obtained by resampling, we enumerated, across all (R = 1,000) resampled datasets the number of times (C) the P value from the resampled data analysis is larger than the P value obtained from the observed data analysis. We define our resampling P value as: (R – C + 1)/(R + 1).
AIC comparisons
To compare AICs between two models, we first identify the model with the lower AIC, representing the model estimated to have less information loss. We designated this AICmin and we designated the other AIC as AICmax. To calculate the relative probability that AICmax is the model that minimizes information loss (designated here as p), we calculate:
$$ p= \exp \left(\frac{AI{C}_{min}- AI{C}_{max}}{2}\right). $$
The resulting value indicates that the probability that AICmax minimizes the information loss from AICmin is p. A high p indicates that AICmax may have less information loss than AICmin, while a low p indicates that it is unlikely that AICmax minimizes information loss in comparison to AICmin.
Region permutation test
To test our model on randomly permuted regions, we performed the following:
-
1.
For each gene we took into account the sizes of each of its sub-regions.
-
2.
We permuted the sizes of each gene’s sub-regions, resulting in a set of the same sub-regions in a random order.
-
3.
We then re-divided each gene based on the permuted set of sub-regions. Thus, after the permutation each gene maintains the same number and size distribution of sub-regions as in the biological division.
-
4.
For this permuted set of sub-regions, we generated sub-region intolerance scores, calculated the sub-region mutation rates and counted the number of pathogenic variants in each sub-region.
-
5.
Following this, we tested prediction across the permuted set of sub-regions using the same genome-wide assessment that we used for the biological division.
-
6.
We recorded the effect size of the intolerance scores in this assessment.
-
7.
We repeated steps (1) through (6) 100 times.
-
8.
This resulted in a vector of effect sizes that constitutes our null distribution of effect sizes for the permutation test.
To test how the biological division compares to the permuted divisions, we counted how many times out of the n
p
= 100 permutations the absolute value of the permuted division score effect size is smaller than the absolute value of the biological division score effect size. This count is designated X. The permutation P value is calculated by the following equation: (n
p
− X + 1) /(n
p
+ 1).
GERP scores
To quantify phylogenetic conservation across sub-regions, we generated a novel vector for each sub-region division that simply reflects the average GERP++ [7] score (where available) for those coordinates (Additional file 10 and Additional file 11).
MutationTaster scores
We ran MutationTaster’s QueryEngine (http://www.mutationtaster.org/StartQueryEngine.html, accessed September 2015) [2] with default options, outside of the option to filter against the 1000 Genomes project. By default, this option is selected. As we wanted analysis results for all variants, we deselected this option.
MutationTaster uses a Bayes classifier to determine whether a variant is a polymorphism or disease causing. The classifier has four output options:
-
1.
disease_causing: probably deleterious.
-
2.
disease_causing_automatic: known to be deleterious based on existing databases.
-
3.
polymorphism: probably harmless.
-
4.
polymorphism_automatic: known to be harmless based on existing databases.
Along with the prediction, for each variant MutationTaster outputs an estimated probability for the prediction. More information on MutationTaster can be found at http://www.mutationtaster.org/info/documentation.html or at [2].
To convert these results into a score between 0 and 1, we devised the following criteria:
-
1.
If the prediction is polymorphism, use the probability as the score. This will always be above 0.5. Thus, predicted polymorphisms receive scores in the range of 0.5 to 1.
-
2.
If the prediction is disease_causing, the score is the probability subtracted from 1. As the probability will always be above 0.5, the predicted disease causing variants receive scores in the range of 0 to 0.5.
-
3.
If the prediction is either polymorphism_automatic or disease_causing_automatic, this indicates that the variant’s prediction is based on a database entry, not the Bayes classifier. If the Bayes classifier disagrees with the automatic prediction, the probability will be less than 0.5. In these instances, we reassigned the variant’s prediction to match the Bayes classifier’s and reassigned the probability to 1 minus the originally reported probability. Following this, we treated the variant as described above.
Applying the hot zone approach
For both the autism [15] and epileptic encephalopathies [16] de novo mutations data we limited to single-nucleotide variants, falling in regions for which we have both subRVIS and genic RVIS scores. We calculated each variant’s PolyPhen-2 score using PolyPhen-2 HumVar [4]. Synonymous variants were assigned 0 while canonical splice, stop gain, and stop loss variants were assigned 1. Mutations present as variants in the NHLBI ESP exome variant calls [8] were excluded.
All the mutations in the epileptic encephalopathies data from the Epi4K study [16] are Sanger validated.
For the autism data from [15], we required that the mutations not be called in both siblings. We additionally required that either: (1) at least one of the institutes analyzing the data (Cold Spring Harbor Laboratory, Yale School of Medicine, University of Washington) had validated the mutation; or (2) at least one of the institutes labeled the mutation as a ‘strong’ variant call while no other institute labeled the mutation as ‘not called’ or ‘weak’.
Estimate of the disease risk per protein domain
Given the potential interest in whether some CDD protein domain types are more likely to carry reported pathogenic mutations than others, we have created a table including the tally of reported pathogenic mutations and the cumulative mutation rate for each CDD protein domain type across genes (Additional file 12). This table also includes the tally divided by the cumulative mutation rate. This is meant to serve as an approximate estimate denoting the number of reported pathogenic mutations after controlling for mutation rate. For comparative purposes, a higher value indicates more reported mutations given the sequence context.
Ethical approval
No ethical approval was required.
Availability of data and materials
The data and materials used in this manuscript are either previously published or are available in this publication as Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12. The software used in this publication is available on GitHub (https://github.com/igm-team/subrvis), released under the MIT license.