Compiling a list of candidate SNPs from the ASE gene list
We first overlapped the gene names with the genes in the GENCODE comprehensive gene annotation file (release 19, GRCh37.p13) to pinpoint the location of the exons of these genes. For each gene, we found all the SNPs that overlapped with its exons by using a vcf file from 2504 individuals (1000 Genomes database). We first calculated the heterozygous genotyping frequency of these SNPs as\( f\left({\mathrm{genotype}}_{{\mathrm{SNP}}_i}=1\right)=\frac{\#\mathrm{of}\ \mathrm{individuals}\ \mathrm{with}\ {\mathrm{genotype}}_{\mathrm{SNP}i}=1}{\mathrm{total}\#\mathrm{of}\ \mathrm{individuals}} \). We then removed the SNPs that had 0.1<\( f\left({\mathrm{genotype}}_{{\mathrm{SNP}}_i}=1\right) \)< 0.5 from the overlap list (see Supplementary Information for the rationale). We added the remaining SNPs to the candidate SNP list. We repeated this procedure for all of the genes in the list to obtain one final candidate SNP list.
Linking attacks
Let us assume that we have n total SNPs that can be observed in humans (e.g., all of the SNPs observed in the 1,000 Genomes Project). We can represent an individual’s genome as a set S = {g1, g2, …, gn}, where giis the genotype of the ith SNP. Candidate SNPs obtained using ASE genes become a subset of S, whose genotypes are assumed to be heterozygous (gi = 1), i.e., Scan = {g1 = x, g2 = 1, …, gn = x}, where gi = x means SNP i is not in the candidate list; hence, its genotype is unknown.
Scenario 1
Let us assume we have an ASE gene list of a known individual. This means we can compile a list of heterozygous SNPs for this known individual. In this case, Scan = {g1 = x, g2 = 1, …, gn = x} is the set of candidate genotypes for the known individual. The goal is to recover the genotypes for all of the SNPs in the set. Let us assume we have access to a database of anonymized genomes. Each anonymized genome j in the database can be represented as \( {S}_j^D=\left\{{g}_1,{g}_2,\dots, {g}_n\right\} \), where each genotype giis known.
Best match approach
For each individual j in the database, we find the intersection \( {S}_{can}\cap {S}_j^D \) and calculate a linking score \( L\left(i, can\right)={\sum}_{t=0}^{t=\left|{S}_{can}\cap {S}_j^D\right|}\frac{1}{{\mathit{\log}}_2\ f\left({g}_t=1\right)} \), where f(gt = 1) is the ratio of individuals whose tth SNP has the heterozygous genotype (gt = 1) to the total number of individuals in D [previously defined in [12]]. To recover the genome for the known individual, we then rank all the L(i, can) scores for all genomes in D in decreasing order. We denote the genome with the highest score as the genome of the known individual with candidate SNPs. To assess the statistical robustness of this prediction, we used our previously defined gap measure, which is the ratio between the L(i,can) score of the first-ranked individual (max=L(i,can)1) and that of second-ranked individual (max2=L(i,can)2 and gap=max/max2). We further calculate the statistical significance of gap by generating random candidate SNPs (as many as the original candidate SNPs), perform the above attack one thousand times, and compare the real gap value against the distribution of random gap values.
Entropy approach
The goal of this approach is to assign a probability of correctly linking the ASE list to each genome in D, which allows us to have a distribution. This approach is adopted from Narayanan and Shamtikov [14]. We calculate the probability of linking the candidate SNP list to a genome i in D as \( \pi \left(i, can\right)=c.{e}^{\frac{L\left(i, can\right)}{\sigma }} \), where c is a constant to satisfy ∑iπ(i, can) = 1, L(i,can) is the linking score described above, and σ is the standard deviation of the linking scores (Additional file 5: Fig. S4).
Scenario 2
The mathematical formulation of scenario 2 is the same as the first scenario. The only difference is that we have the genome of the known individual and we try to link this known genome to an anonymized ASE gene list, which is connected to a potentially private phenotype.
Identification of the top 20 common genes
After linking 382 ASE gene lists to a genome in D, we calculated the accuracy of the linking. We then separated the gene lists into two categories: (1) lists that led to correct re-identification and (2) lists that led to misidentification. We identified the genes that were shared across many ASE gene lists in both categories. Among the top 20 shared genes, we found that HLA genes were in the lists of >90% of both correctly re-identified and misidentified individuals. We then selectively removed different groups of genes (HLA, and genes at the intersection of both groups) and performed the linking attacks.
Usage of auxiliary data
We added one or two more features to our sets \( {S}_j^D \) (the genotypes of genome j in database D) and Scan(the candidate SNP genotype list) such that our new list does not only have genotypes but also includes biological sex and/or ancestry features. S′can = {g1 = x, g2 = 1, …, gn = x, sex = M/F, ancestry = EUR/AFR/AMR/EAS/SAS} and \( S{\prime}_j^D=\Big\{{g}_1,{g}_2,\dots, {g}_n,\mathrm{sex},\mathrm{ancestry} \)} are our new sets and we look for the \( S{\prime}_{can}\cap S{\prime}_j^D \) intersection to calculate the linking scores. Here, M and F are used for biologically male and female individuals, respectively. EUR, AFR, AMR, EAS, and SAS correspond to European, African, Admixed American, East Asian, and South Asian ancestries, respectively.