Category | Sequence context | Evolutionary scale | Model | Feature abbreviation used in this work | Definition | References |
---|---|---|---|---|---|---|
Position-level and window-level features | Position-level | Interspecies (mammals) | A | GerpS | Base-wise Rejected Substitution (RS) score defined by Genomic Evolutionary Rate Profiling (GERP++ scores) from mammalian alignments, excluding humans | [63] |
Interspecies (mammals) | A | GerpN | Neutral evolution score defined by GERP++, excluding humans | [63] | ||
Recent and ongoing in humans | B | bStatistic | b Statistic: background selection score indicating the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1 indicating little effect. B-statistic was based on human single-nucleotide polymorphism data from Perlegen Sciences, HapMap phase II, the SeattleSNPs NHLBI Program for Genomic Applications, and the NIEHS Environmental Genome Project | [20] | ||
Interspecies (primates) | A | priPhCons | Primate PhastCons conservation score, excluding humans | |||
Interspecies (mammals) | A | mamPhCons | Mammalian PhastCons conservation scores, excluding humans | |||
Interspecies (vertebrates) | A | verPhCons | Vertebrate PhastCons conservation score, excluding humans | |||
Interspecies (primates) | A | priPhyloP | Primate PhyloP conservation score, excluding humans | [44] | ||
Interspecies (mammals) | A | mamPhyloP | Mammalian PhyloP conservation score, excluding humans | [44] | ||
Interspecies (vertebrates) | A | verPhyloP | Vertebrate PhyloP conservation score, excluding humans | [44] | ||
1000-bp window | Recent and ongoing in humans | B | meanDaf1000G | Mean derived allele frequency of variants in 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) | [11] | |
Recent and ongoing in humans | B | meanHet1000G | Mean heterozygosity of 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) | [11] | ||
Recent and ongoing in humans | B | meanMAF1000G | Mean minor allele frequency of variants in 1-kb flanking region calculated from 1000 Genomes Project (excluding the query variant) | [35] | ||
Recent and ongoing in humans | B | meanMAFGnomAD meanMAF_AFRGnomAD meanMAF_AMRGnomAD meanMAF_EASGnomAD meanMAF_FINGnomAD meanMAF_NFEGnomAD meanMAF_OTHGnomAD meanMAF_ASJGnomAD | Mean minor allele frequency of variants in 1-kb window region calculated from GnomAD genome data (excluding the query variant from the calculation). Mean MAF was assessed for the global population and for population-specific frequencies: Africans and African Americans (AFR), Admixed Americans (AMR), East Asians (EAS), Finnish (FIN), non-Finnish Europeans (NFE), Ashkenazi Jewish (ASJ), and other populations (OTH) | [35] | ||
30-kb window | Recent and ongoing in humans | B | TajimasD_CHB_pvalue TajimasD_CEU_pvalue TajimasD_YRI_pvalue | Tajima’s D p value: neutrality test that compares estimates of the number of segregating sites and the mean pair-wise difference between sequences. The test is performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores. | [65] | |
30-kb window | Recent and ongoing in humans | B | FuLisD_CEU_pvalue FuLisD_CHB_pvalue FuLisD_YRI_pvalue FuLisF_CEU_pvalue FuLisF_CHB_pvalue FuLisF_YRI_pvalue | Fu and Li’s F* p value: neutrality test that compares the number of singletons with the average number of nucleotide differences between pairs of sequences. Fu and Li’s D* p value: neutrality test that compares the number of singletons with the total number of mutations in a genomic region within a group. These tests are performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores. | [35] | |
10-bp window | Recent and ongoing in humans | B | CDTS | The Context-Dependent Tolerance Score (CDTS) represents the difference between observed and expected variations in Humans. The expected variation is computed for each nucleotide genome-wide as the probability of variation of each nucleotide depending on its heptanucleotide context. CDTS was computed on 11,257 unrelated individuals. | [36] | |
75-bp flanking region | N/A | D | GC | Percent GC in a window of ± 75 bp | [63] | |
N/A | D | CpG | Percent CpG in a window of ± 75 bp | [63] | ||
Gene-level features | Non-coding region of the closest gene (a) | Recent and ongoing in humans | C | ncRVIS | Non-coding RVIS is a measure of the departure from the genome-wide average number of common variants found in the non-coding sequence of genes with a similar amount of non-coding mutational burden in humans. ncRVIS was computed on an in-house collection of whole genome sequencing of 690 individuals. | [72] |
Interspecies (mammals) | C | ncGERP | Average GERP++ score across a gene’s non-coding sequence | [46] | ||
Coding region of the closest gene | Interspecies (primates) | C | dN/dS | Primate dN/dS ratio, providing a measure of the coding-sequence conservation across primates | [68] | |
Recent and ongoing in humans | C | pLI | Probability of being loss-of-function intolerant (intolerant of heterozygous and homozygous loss-of-function variants), assessed from the ExAC database. | [35] | ||
Recent and ongoing in humans | C | RVIS percentile | Residual Variation Intolerance Score (RVIS) percentile, a measure of the departure from the average number of common functional mutations in genes with a similar amount of mutational burden in humans. RVIS was assessed on sequence data from 6503 whole exome from the NHLBI Exome Sequencing Project (ESP) | [71] | ||
Recent and ongoing in humans | C | GDI | Gene Damage Index, a gene-level metric of the mutational damage that has accumulated in the general population, based on CADD scores and on the 1000 Genomes Project data | [70] | ||
Phylo-genetic gene features | N/A | C | familyMemberCount | Number of human paralogs of the gene: Family member count (FMC) in OGEE database | [74] | |
N/A | C | gene_age | The gene age is estimating the origination time of genes from the presence or absence of orthologs in the vertebrate phylogeny. | [73] |