Skip to main content

Table 1 Natural selection features associated with non-coding single-nucleotide variants mined in this work. Features are classified under different categories depending on the sequence context (i.e., position level, window level, and gene level) and evolutionary scale: interspecies (vertebrates, mammals, and primates, excluding humans), or recent and ongoing natural selection in humans. We note here that the query variant was excluded from the calculations involving mean allele frequencies and mean heterozygosity of a given region and that the variant allele frequency itself was not used as a feature in any training or pathogenicity prediction throughout the study

From: NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans

Category

Sequence context

Evolutionary scale

Model

Feature abbreviation used in this work

Definition

References

Position-level and window-level features

Position-level

Interspecies (mammals)

A

GerpS

Base-wise Rejected Substitution (RS) score defined by Genomic Evolutionary Rate Profiling (GERP++ scores) from mammalian alignments, excluding humans

[63]

Interspecies (mammals)

A

GerpN

Neutral evolution score defined by GERP++, excluding humans

[63]

Recent and ongoing in humans

B

bStatistic

b Statistic: background selection score indicating the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1 indicating little effect. B-statistic was based on human single-nucleotide polymorphism data from Perlegen Sciences, HapMap phase II, the SeattleSNPs NHLBI Program for Genomic Applications, and the NIEHS Environmental Genome Project

[20]

Interspecies (primates)

A

priPhCons

Primate PhastCons conservation score, excluding humans

[42, 43]

Interspecies (mammals)

A

mamPhCons

Mammalian PhastCons conservation scores, excluding humans

[13, 14]

Interspecies (vertebrates)

A

verPhCons

Vertebrate PhastCons conservation score, excluding humans

[42, 43]

Interspecies (primates)

A

priPhyloP

Primate PhyloP conservation score, excluding humans

[44]

Interspecies (mammals)

A

mamPhyloP

Mammalian PhyloP conservation score, excluding humans

[44]

Interspecies (vertebrates)

A

verPhyloP

Vertebrate PhyloP conservation score, excluding humans

[44]

1000-bp window

Recent and ongoing in humans

B

meanDaf1000G

Mean derived allele frequency of variants in 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant)

[11]

Recent and ongoing in humans

B

meanHet1000G

Mean heterozygosity of 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant)

[11]

Recent and ongoing in humans

B

meanMAF1000G

Mean minor allele frequency of variants in 1-kb flanking region calculated from 1000 Genomes Project (excluding the query variant)

[35]

Recent and ongoing in humans

B

meanMAFGnomAD

meanMAF_AFRGnomAD

meanMAF_AMRGnomAD

meanMAF_EASGnomAD

meanMAF_FINGnomAD

meanMAF_NFEGnomAD

meanMAF_OTHGnomAD

meanMAF_ASJGnomAD

Mean minor allele frequency of variants in 1-kb window region calculated from GnomAD genome data (excluding the query variant from the calculation). Mean MAF was assessed for the global population and for population-specific frequencies: Africans and African Americans (AFR), Admixed Americans (AMR), East Asians (EAS), Finnish (FIN), non-Finnish Europeans (NFE), Ashkenazi Jewish (ASJ), and other populations (OTH)

[35]

30-kb window

Recent and ongoing in humans

B

TajimasD_CHB_pvalue

TajimasD_CEU_pvalue

TajimasD_YRI_pvalue

Tajima’s D p value: neutrality test that compares estimates of the number of segregating sites and the mean pair-wise difference between sequences. The test is performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores.

[65]

30-kb window

Recent and ongoing in humans

B

FuLisD_CEU_pvalue

FuLisD_CHB_pvalue

FuLisD_YRI_pvalue

FuLisF_CEU_pvalue

FuLisF_CHB_pvalue

FuLisF_YRI_pvalue

Fu and Li’s F* p value: neutrality test that compares the number of singletons with the average number of nucleotide differences between pairs of sequences. Fu and Li’s D* p value: neutrality test that compares the number of singletons with the total number of mutations in a genomic region within a group. These tests are performed within 3 subpopulations of the 1000 Genome Project, producing population-specific scores.

[35]

10-bp window

Recent and ongoing in humans

B

CDTS

The Context-Dependent Tolerance Score (CDTS) represents the difference between observed and expected variations in Humans. The expected variation is computed for each nucleotide genome-wide as the probability of variation of each nucleotide depending on its heptanucleotide context. CDTS was computed on 11,257 unrelated individuals.

[36]

75-bp flanking region

N/A

D

GC

Percent GC in a window of ± 75 bp

[63]

N/A

D

CpG

Percent CpG in a window of ± 75 bp

[63]

Gene-level features

Non-coding region of the closest gene (a)

Recent and ongoing in humans

C

ncRVIS

Non-coding RVIS is a measure of the departure from the genome-wide average number of common variants found in the non-coding sequence of genes with a similar amount of non-coding mutational burden in humans. ncRVIS was computed on an in-house collection of whole genome sequencing of 690 individuals.

[72]

Interspecies (mammals)

C

ncGERP

Average GERP++ score across a gene’s non-coding sequence

[46]

Coding region of the closest gene

Interspecies (primates)

C

dN/dS

Primate dN/dS ratio, providing a measure of the coding-sequence conservation across primates

[68]

Recent and ongoing in humans

C

pLI

Probability of being loss-of-function intolerant (intolerant of heterozygous and homozygous loss-of-function variants), assessed from the ExAC database.

[35]

Recent and ongoing in humans

C

RVIS percentile

Residual Variation Intolerance Score (RVIS) percentile, a measure of the departure from the average number of common functional mutations in genes with a similar amount of mutational burden in humans. RVIS was assessed on sequence data from 6503 whole exome from the NHLBI Exome Sequencing Project (ESP)

[71]

Recent and ongoing in humans

C

GDI

Gene Damage Index, a gene-level metric of the mutational damage that has accumulated in the general population, based on CADD scores and on the 1000 Genomes Project data

[70]

Phylo-genetic gene features

N/A

C

familyMemberCount

Number of human paralogs of the gene: Family member count (FMC) in OGEE database

[74]

N/A

C

gene_age

The gene age is estimating the origination time of genes from the presence or absence of orthologs in the vertebrate phylogeny.

[73]

  1. bp base pairs, GERP Genomic Evolutionary Rate Profiling, RS Rejected Substitution, N/A not applicable
  2. (a) Non-coding region of the closes gene defined in the original publication of ncRVIS and ncGERP as the collection of 5′UTR, 3′UTR, and an additional non-exonic 250 bp upstream of transcription start site (TSS)