Skip to main content

Table 1 Natural selection features associated with non-coding single-nucleotide variants mined in this work. Features are classified under different categories depending on the sequence context (i.e., position level, window level, and gene level) and evolutionary scale: interspecies (vertebrates, mammals, and primates, excluding humans), or recent and ongoing natural selection in humans. We note here that the query variant was excluded from the calculations involving mean allele frequencies and mean heterozygosity of a given region and that the variant allele frequency itself was not used as a feature in any training or pathogenicity prediction throughout the study

From: NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans

Category Sequence context Evolutionary scale Model Feature abbreviation used in this work Definition References
Position-level and window-level features Position-level Interspecies (mammals) A GerpS Base-wise Rejected Substitution (RS) score defined by Genomic Evolutionary Rate Profiling (GERP++ scores) from mammalian alignments, excluding humans [63]
Interspecies (mammals) A GerpN Neutral evolution score defined by GERP++, excluding humans [63]
Recent and ongoing in humans B bStatistic b Statistic: background selection score indicating the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1 indicating little effect. B-statistic was based on human single-nucleotide polymorphism data from Perlegen Sciences, HapMap phase II, the SeattleSNPs NHLBI Program for Genomic Applications, and the NIEHS Environmental Genome Project [20]
Interspecies (primates) A priPhCons Primate PhastCons conservation score, excluding humans [42, 43]
Interspecies (mammals) A mamPhCons Mammalian PhastCons conservation scores, excluding humans [13, 14]
Interspecies (vertebrates) A verPhCons Vertebrate PhastCons conservation score, excluding humans [42, 43]
Interspecies (primates) A priPhyloP Primate PhyloP conservation score, excluding humans [44]
Interspecies (mammals) A mamPhyloP Mammalian PhyloP conservation score, excluding humans [44]
Interspecies (vertebrates) A verPhyloP Vertebrate PhyloP conservation score, excluding humans [44]
1000-bp window Recent and ongoing in humans B meanDaf1000G Mean derived allele frequency of variants in 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) [11]
Recent and ongoing in humans B meanHet1000G Mean heterozygosity of 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) [11]
Recent and ongoing in humans B meanMAF1000G Mean minor allele frequency of variants in 1-kb flanking region calculated from 1000 Genomes Project (excluding the query variant) [35]
Recent and ongoing in humans B meanMAFGnomAD
meanMAF_AFRGnomAD
meanMAF_AMRGnomAD
meanMAF_EASGnomAD
meanMAF_FINGnomAD
meanMAF_NFEGnomAD
meanMAF_OTHGnomAD
meanMAF_ASJGnomAD
Mean minor allele frequency of variants in 1-kb window region calculated from GnomAD genome data (excluding the query variant from the calculation). Mean MAF was assessed for the global population and for population-specific frequencies: Africans and African Americans (AFR), Admixed Americans (AMR), East Asians (EAS), Finnish (FIN), non-Finnish Europeans (NFE), Ashkenazi Jewish (ASJ), and other populations (OTH) [35]
30-kb window Recent and ongoing in humans B TajimasD_CHB_pvalue
TajimasD_CEU_pvalue
TajimasD_YRI_pvalue
Tajima’s D p value: neutrality test that compares estimates of the number of segregating sites and the mean pair-wise difference between sequences. The test is performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores. [65]
30-kb window Recent and ongoing in humans B FuLisD_CEU_pvalue
FuLisD_CHB_pvalue
FuLisD_YRI_pvalue
FuLisF_CEU_pvalue
FuLisF_CHB_pvalue
FuLisF_YRI_pvalue
Fu and Li’s F* p value: neutrality test that compares the number of singletons with the average number of nucleotide differences between pairs of sequences. Fu and Li’s D* p value: neutrality test that compares the number of singletons with the total number of mutations in a genomic region within a group. These tests are performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores. [35]
10-bp window Recent and ongoing in humans B CDTS The Context-Dependent Tolerance Score (CDTS) represents the difference between observed and expected variations in Humans. The expected variation is computed for each nucleotide genome-wide as the probability of variation of each nucleotide depending on its heptanucleotide context. CDTS was computed on 11,257 unrelated individuals. [36]
75-bp flanking region N/A D GC Percent GC in a window of ± 75 bp [63]
N/A D CpG Percent CpG in a window of ± 75 bp [63]
Gene-level features Non-coding region of the closest gene (a) Recent and ongoing in humans C ncRVIS Non-coding RVIS is a measure of the departure from the genome-wide average number of common variants found in the non-coding sequence of genes with a similar amount of non-coding mutational burden in humans. ncRVIS was computed on an in-house collection of whole genome sequencing of 690 individuals. [72]
Interspecies (mammals) C ncGERP Average GERP++ score across a gene’s non-coding sequence [46]
Coding region of the closest gene Interspecies (primates) C dN/dS Primate dN/dS ratio, providing a measure of the coding-sequence conservation across primates [68]
Recent and ongoing in humans C pLI Probability of being loss-of-function intolerant (intolerant of heterozygous and homozygous loss-of-function variants), assessed from the ExAC database. [35]
Recent and ongoing in humans C RVIS percentile Residual Variation Intolerance Score (RVIS) percentile, a measure of the departure from the average number of common functional mutations in genes with a similar amount of mutational burden in humans. RVIS was assessed on sequence data from 6503 whole exome from the NHLBI Exome Sequencing Project (ESP) [71]
Recent and ongoing in humans C GDI Gene Damage Index, a gene-level metric of the mutational damage that has accumulated in the general population, based on CADD scores and on the 1000 Genomes Project data [70]
Phylo-genetic gene features N/A C familyMemberCount Number of human paralogs of the gene: Family member count (FMC) in OGEE database [74]
N/A C gene_age The gene age is estimating the origination time of genes from the presence or absence of orthologs in the vertebrate phylogeny. [73]
  1. bp base pairs, GERP Genomic Evolutionary Rate Profiling, RS Rejected Substitution, N/A not applicable
  2. (a) Non-coding region of the closes gene defined in the original publication of ncRVIS and ncGERP as the collection of 5′UTR, 3′UTR, and an additional non-exonic 250 bp upstream of transcription start site (TSS)