Meta-analysis
Summary statistics for sex-combined autosomal analyses were previously published [24]. Following the same procedure, we carried out meta-analyses stratified by sex for 5 lipid traits (HDL-C, LDL-C, TG, nonHDL-C, and TC) for both the autosomes and chromosome X. The sample size for chromosome X (Total N = 1,238,180; males = 749,391; females = 562,410) was lower than available for autosomes as not all participating biobanks submitted results for chromosome X. Quality control of summary statistics from contributing cohorts was performed using EasyQC [87]. Prior to meta-analysis, we removed variants with low imputation info scores (r2 < 0.3), those with minor allele count < 3, and those with Hardy–Weinberg equilibrium p-value < 1 × 10−8. Variants on the X chromosome were filtered using the female imputation info scores and Hardy–Weinberg equilibrium p-values. Summary statistics were corrected by the genomic-control factor calculated from the median p-value of variants with minor allele frequency > 0.5%. For cohorts that contributed summary statistics imputed both on the Haplotype Reference Consortium (HRC) and 1000 Genomes Population v3 (1KGP3) panels, we generated a single file containing all possible variants, favoring those imputed from the HRC imputation panel due to generally higher imputation quality of these variants. Multi-ancestry meta-analysis was performed with MR-MEGA [88] with 5 principal components and using the inverse-variance weighted method in METAL to estimate effect sizes [89]. Independent loci were defined with physical distance > 500 kb or genetic distance > 0.25 cM, whichever one would result in a larger window, followed by a conditional analysis using rareGWAMA [90] as previously described [24], to identify index variants that were shadows of nearby, more-significant associations. Conditional analysis for chromosome X used a female-only UK Biobank LD reference (N = 21,510). In line with the analysis in the autosomes, a locus was identified as dependent if the effect size after conditioning on the most significant variant in the area was more than 1.43 times smaller than the original (95th percentile of the effect size ratios for chromosome X).
Differences in effect size between males and females were tested within each cohort using [91]:
$$Z=\frac{{B}_{m}-{B}_{f}}{\sqrt{{se}_{m}^{2}+{se}_{f}^{2}-2*r*{se}_{f}*{se}_{m}}}$$
and were then meta-analyzed across studies using METAL, to account for cohort-specific ascertainment (e.g., enrichment of cases for type 2 diabetes), or demographics, such as age.
Replication
We collected summary statistics from 8 cohorts across 6 ancestry groups, including African or African American, East Asian, European, Hispanic, Middle Eastern, and South Asian. Each cohort provided sex-stratified and X chromosome association results for the tested traits, as available. The difference in effect sizes between males and females was calculated within each cohort as described above and then meta-analyzed across studies using METAL. X chromosome association results were meta-analyzed using METAL with weighting by sample size.
Gene prioritization methods
Closest gene
We annotated the closest gene to the lipid multi-ancestry index variants [24] by identifying the closest gene transcript on either side (500 kb) of the index variant [92].
Colocalization with GTEx eQTLs
For each of the five lipid phenotypes, we first lifted over GWAS summary statistics from the multi-ancestry meta-analysis [24] to GRCh38 using the UCSC liftOver tool. Then, we defined a set of approximately independent windows across the genome within which colocalization with eQTLs was run. To define these, we first identified all genome-wide significant variants (p-value < 5e − 08) from the meta-analysis for each lipid trait and sorted them by significance, from most significant to least. Starting with the most significant variant, we aimed to define a window defining independent genetic signals; we define a variant’s window as a region within the greater of 500 kb or 0.25 cM on either side of this “sentinel variant.” Genetic distances were defined using reference maps from HapMap 3. All other genome-wide significant variants within this window were discarded from the list of sentinel variants, and similar windows were defined for the remaining genome-wide significant variants.
We ran an eQTL colocalization using GTEx v8 eQTL summary statistics within each of our defined windows for all lipid traits. For each of the 49 GTEx tissues, we first identified all genes within 1 Mb of the sentinel variant, and then restricted analysis to those genes with eQTLs (“eGenes”) in that tissue (FDR < 0.05). We used the R package “coloc” (run on R version 3.4.3, coloc version 3.2.1) [93] with default parameters to run colocalization between the GWAS signal and the eQTL signal for each of these cis-eGenes, using as input those variants in the defined window, i.e., all variants present in both datasets. A colocalization posterior probability of (PP3 + PP4) > 0.8 was used to identify loci with enough colocalization power, and PP4/PP3 > 0.9 was used to define those loci that show significant colocalization [94].
Transcriptome-wide association studies (TWAS)
For our transcriptome-wide association analysis (TWAS), we integrated the results of our GWAS with eQTL summary statistics from GTEx v8. The S-PrediXcan software [95] allows us to integrate these two datasets using only summary statistics from GWAS without needing individual-level genotype data. We used the multi-ancestry lipid GWAS summary statistics [24] and harmonized them with the GTEx summary statistics. Then we performed the TWAS using the eQTL models estimated on GTEx v8 expression data. For each of the 49 GTEx tissues, we identified “significant genes” those genes with p-values more significant than an FDR threshold of 0.05.
Genes with coding variants
We determine the coding variants within 99% credible sets and the coding variants in LD > 0.8 with variants in the 99% credible sets with the credible sets as defined here [24]. We define regions for construction of the credible sets as ± 500 kb around each index variant. We used Bayes factors (BFs) for each variant from the MR-MEGA output and generated the credible sets within each region by ranking all variants by BF and calculating the number of variants required to reach a cumulative probability of at least 99%. Additionally, we used previously established gene-based associations [96] to determine whether rare coding variation in a gene were associated with blood lipid levels (p < 0.001). We labeled a gene as having coding variants if any of these criteria were met.
DEPICT
We used Data-driven Expression-Prioritized Integration for Complex Traits (DEPICT, v1 beta version rel194 for 1 KG imputed GWAS) to prioritize genes at our index variants, on the assumption that truly associated genes share functional annotations [27]. Index variants [24] with p-value < 5 x 10−8 were retained as input. We implemented the DEPICT analysis with the default settings of 500 permutations for bias adjustment and 20 replications for FDR estimation. DEPICT prioritizes genes by calculating the similarity of a given gene to genes from other associated loci across 14,461 reconstituted gene sets and estimates the nominal gene prioritization p-value and the estimated false discovery rate of each gene. The prioritized genes at FDR < 0.05 were considered significant.
PoPS
We used the PoPS method to prioritize genes in the previously reported [24] multi-ancestry index variants for all lipid traits. The PoPS method [28] is a new gene prioritization method that identifies the causal genes by integrating GWAS summary statistics with gene expression, biological pathway, and predicted protein–protein interaction data. First, as part of the PoPS analysis, we used MAGMA to compute gene association statistics (z-scores) and gene–gene correlations from GWAS summary statistics and LD information from a multi-ancestry reference panel (1000 Genomes). Next, PoPS performs marginal feature selection by using MAGMA to perform enrichment analysis for each gene feature separately. The model is fitted by generalized least squares (GLS), and MAGMA results are used to perform marginal feature selection, retaining only features that pass a nominal significance threshold (p < 0.05). Then PoPS computes a joint enrichment of all selected features simultaneously in a leave one chromosome out (LOCO) framework. The gene features employed by PoPS are listed here: https://github.com/FinucaneLab/gene_features. Finally, PoPS computes polygenic priority scores for each gene by fitting a joint model for the enrichment of all selected features. The PoPS score for a gene is independent of the GWAS data on the chromosome where the gene is located. The PoPS analysis returned scores for a total of 18,383 genes per lipid trait. We only kept the top 20% genes among all 18,383 genes. We then annotated our index variants with the nearest ENSEMBL genes in a 500-kb window (either side) and selected the highest PoPS score gene in the locus as the prioritized one.
We performed the PoPS analysis on our lipid-specific multi-ancestry meta-analysis results, using all populations from 1000G as the reference for the LD information in MAGMA. As a sensitivity step, we also repeated the same analysis using only the European population from 1000G as the reference. We observed high concordance in the top two PoPS prioritized genes from both reference panels. In detail, the same 2119 genes (89%) were prioritized as the top genes from both panels, a further 203 genes were prioritized as a top gene with one panel and as the second top with the other and only 7 genes were completely mismatched between the two reference panels.
Monogenic genes
We annotated genes known to cause Mendelian lipid disorders based on proximity with identified GWAS loci [97, 98]. GWAS index variants within ± 500 kb of the transcription start and end positions from the USCS genome browser annotations were annotated as nearby known monogenic dyslipidemia genes.
Mouse knockout lipid phenotype silver set genes
Human gene symbols (9557 unique genes) were mapped to gene identifiers (HGNC) and their corresponding mouse ortholog genes were obtained using Ensembl (www.ensembl.org). Phenotype data for single-gene knockout mouse models were obtained from the International Mouse Phenotyping Consortium (IMPC) (www.mousephenotype.org) latest data release 12.0 (www.mousephenotype.org/data/release). The knockout mouse models were primarily produced by IMPC but also include some models that have been reported from the relevant literature and were curated by Mouse Genome Informatics (MGI) data release 6.16 (www.informatics.jax.org). For each mouse model, reported phenotypes were grouped using the mammalian phenotype ontology hierarchy into broad categories relevant to lipids: growth and body weight (MP:0001259), lipid homeostasis (MP:0002118), cholesterol homeostasis (MP:0005278), and lipid metabolism (MP:0013245). This resulted in mapping of human genes to significant phenotypes in animals.
For each of the multi-ancestry lipid index variant [24], we mapped the closest gene to the knockout mouse phenotypes and curated the set to only include mouse phenotypes strictly relating to lipid metabolism. That resulted in our silver set of 740 genes with mouse lipid phenotypes (Additional file 33: Table S22).
Overlap between methods
We standardized the gene names across different methods using the R/geneSynonym package, a wrapper to gene synonym information in ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz. We also quantified how often the same gene was prioritized by multiple methods for each index variant and determined scores that ranged from 1 to 6 (S1-S6), based on the number of methods that prioritized the gene.
We integrated multiple gene prioritization methods to identify likely causal genes in the latest global lipid genetics consortium GWAS results. In total, we have implemented the 6 individual gene prioritization methods above that utilize the GWAS summary statistics from meta-analysis. Our gene prioritization methods can be placed into two broad categories, the locus-specific methods and the genome-wide methods. The locus-specific methods leverage local GWAS data by connecting the causal variants to the causal gene(s) using genomic distance, eQTLs, or protein-coding variants.
More specifically, there are four locus-specific methods that have been implemented including: (1) The closest protein-coding gene around the index variants based on the genomic distance, (2) eQTL colocalization using r COLOC package, (3) TWAS using S-PrediXcan, (4) coding variants which have been identified in 99% credible sets OR in LD > 0.8 with coding variants OR from gene-based tests (p < 0.001) of rare coding variants. For the eQTL and TWAS, we first used all the 49 GTEx tissues and then restricted to only 5 lipid-specific tissues: liver, adipose subcutaneous, adipose visceral, whole blood, and small intestine. In addition, two genome-wide methods were employed: (1) DEPICT (FDR < 0.05), (2) PoPS (Top 1 gene). It is reasonable to combine similarity-based methods with locus-based methods since they use two different sources of information.
To determine the relative performance of each prioritization method and their combined scores for lipid loci, we used 21 genes known to cause Mendelian dyslipidemias as a gold standard set (ABCA1, ABCG5, ANGPTL3, APOA5, APOB, APOE, CETP, CYP27A1, GPD1, GPIHBP1, LCAT, LDLR, LDLRAP1, LIPA, LIPC, LMF1, LPL, MTTP, PCSK9, SAR1B, SCARB1), and 740 mouse knockout genes causing lipid phenotypes as a silver standard set (Additional file 33: Table S22). We examined two metrics for each gene prioritization approach: (1) the proportion of prioritized genes in the gold/silver standard set, and (2) the proportion of correctly identified genes among all prioritized genes (Additional file 3: Figure S1). Note that out of the 2286 lipid associations, 97 fell within 500 kb of a Mendelian gene and 1280 within 500 kb of a mouse knockout gene with a lipid phenotype. We observed that the TWAS results yielded a high number of prioritized genes, but lead to a low proportion correctly identified. The TWAS approach had a much smaller proportion of genes correctly prioritized among all the prioritized genes, given it prioritized a total of 3511 genes, which was 3.5-fold greater than the other methods (~ 1000 genes). Notably, PoPS provided a similar proportion of correctly identified genes (78%) as of TWAS, while retained a high proportion of prioritized genes in the gold standard set (67%). Compared with PoPS, PoPS + (PoPS plus one of the local methods) slightly sacrificed the proportion of correctly identified genes from 78 to 71%, but improved the proportion of prioritized genes in the gold standard set from 67% to 79%. Overall, PoPS/PoPS + outperform other gene prioritization methods on both metrics for our gold (Additional file 3: Figure S1A) and silver (Additional file 3: Figure S1B) standard gene sets. We also assessed lipid-relevant tissue (liver, subcutaneous and visceral adipose, whole blood, and small intestine) expression QTLs (lipid eQTLs) and transcriptome-wide association (lipid TWAS) and found that the expression results from all tissues performed slightly better at recovering the reference gene sets compared with limiting to the lipid-relevant tissues (Additional file 3: Figure S1).
Text mining analysis
We retrieved the whole MEDLINE/PubMed titles and abstracts as of March 06, 2022, from National Library of Medicine (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/; https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We then examined whether a list of genes prioritized by PoPS + and any one of the lipid-related keywords (lipid, lipids, triglyceride, triglycerides, fatty acid, cholesterol, dyslipidemias, hyperlipidemia, hypercholesteremia, diabetes, type 2 diabetes, type II diabetes, heart, cardiovascular, artery, coronary, coronary artery, coronary heart, atherosclerosis, peripheral vascular, PAD, stroke) occurred in the same abstract. We counted how many lipid-related publications that have a specific gene co-occurred with at least one lipid-related keyword. The same text mining approach was also implemented to a set of randomly selected genes from the 18,383 protein-coding genes used by the PoPS. We estimated the number of lipid-related publications we would expect to see by chance. A Mann–Whitney U test was performed to show whether there was a significant difference between the number of lipid-related publications of the PoPS + gene set and reference gene set.
Drug target mining analysis
To gain therapeutic insights from our gene prioritization results, we performed a lookup in Therapeutic Target Database (TTD) 2022 [99] (http://db.idrblab.net/ttd/). Specifically, we cross-referenced 466 unique lipid-associated genes prioritized by PoPS + (Additional file 2: Table S2) with 1563 genes corresponding to at least one drug (either under development or approved) with known clinical indication in TTD 2022. As a quality control for this lookup, we excluded all TTD entries related to drugs that were discontinued, terminated, or withdrawn from the market. The full lookup results are available in Additional file 8: Table S6.
Driver tissues for lipid levels
We performed phenotype-tissue association analysis using DESE (driver-tissue estimation by selective expression) [40]. DESE estimates the causal tissues by selective expression of phenotype-associated genes in GWAS. We used the GWAS summary statistics from the five lipid traits and the GTEx v8 normalized gene-level and transcript-level expression datasets as input. SNPs inside a gene and its ± 5 kb adjacent regions were first mapped to the gene, and then DESE ran iteratively to produce a list of driver tissues and the corresponding p-values of the associations. We used a Bonferroni-corrected significance threshold of 0.05/54 = 9.3 x 10−4.
PheWAS analysis
Construction of lipid PGSs
We had previously developed a multi-ancestry PGS for LDL-C that was demonstrated to perform well across multiple ancestry groups [24]. In a similar manner, we also generated PGS for HDL-C, nonHDL-C, TC, and triglycerides. First, multi-ancestry meta-analysis results were generated with METAL [89] after excluding individuals from the Michigan Genomics Initiative and the UK Biobank. The set of variants used to construct the PGS was limited to those that were well-imputed (R2 > 0.3) in MGI, UK Biobank, and MVP. Risk scores based on PRS-CS [100] or pruning and thresholding with Plink [101] across several r2 (0.1, 0.2), distance (250 kb, 500 kb), and p-value thresholds (5 × 10−10, 5 × 10−9, 5 × 10−8, 5 × 10−7, 5 × 10−6, 5 × 10−5, 5 × 10−4, 5 × 10−3, 0.05) were developed. For each trait, the single best score was selected based on the adjusted r2 calculated in the UK Biobank of the linear model for the lipid trait with the risk score and age, sex, batch, and PC1-4 as covariates. This corresponded to PRS-CS for HDL-C and nonHDL-C and pruning and thresholding for LDL-C (r2 = 0.1, p-value = 5 × 10−4, 500 kb), TG (r2 = 0.1, p-value = 5 × 10−3, 500 kb), and TC (r2 = 0.1, p-value = 5 × 10−4, 500 kb). The variance explained by the risk score among the UK Biobank participants was similar across traits (adjusted r2 of the full model-adjusted r2 of covariates: HDL-C = 0.13; LDL-C = 0.15; nonHDL-C = 0.14; TC = 0.14; TG = 0.10) and validated the ability of the risk score to predict genetically increased lipid levels.
PheWAS of lipid PGSs and index lipid variants in the UK Biobank and MVP
We used the European ancestry subset of individuals from the UK Biobank (408,886 samples) and the European samples from MVP (69,670 samples) to perform the PheWAS analysis.
We constructed a weighted PGS for each of the lipid traits, based on the corresponding genome-wide significant multi-ancestry index variants. We used the PheWAS package in R [102] to map ICD-10 codes from hospital records into clinically relevant phenotypes (phecodes) and to implement these association analyses, while adjusting for sex, age, 10 genetic principal components, and genotyping array (for the UK Biobank only) in each cohort. For the lipid-PGS PheWAS, each PGS was inverse normalized prior to analysis and lipid levels were corrected for statin use. The MVP samples used for the PheWAS analysis were not included in the GWAS meta-analysis [24].
Similarly, we extracted all multi-ancestry autosomal index variants for all lipid traits from the same European ancestry subset of the UK Biobank and MVP and performed a single-variant PheWAS association analysis per cohort. Additionally, we performed a single-variant PheWAS association analysis in the UK Biobank only with the sex-stratified and X chromosome index variants from the multi-ancestry analysis.
Meta-analysis of MVP and the UK Biobank PheWAS results
We combined, via meta-analysis, PheWAS lipid-specific PGS results for all intersecting phecodes and biomarkers between the UK Biobank and MVP (Europeans only) per lipid trait. We used ICD10-based phecodes and manually matched biomarkers to identify intersecting phenotypes between the two datasets. We restricted our meta-analysis to phenotypes that had at least 100 samples (total number for continuous traits or number of cases for binary traits) in each cohort. After the meta-analysis, we excluded phenotypes that had less than 500 combined samples (total number for continuous traits or number of cases for binary traits), to avoid reporting spurious results [103]. That resulted in a total of 773 phenotypes (739 phecodes and 34 biomarkers/measurements). We used both fixed and random effects model for the meta-analysis. We assessed heterogeneity using the p-value for Cochran’s q and set the level for significant heterogeneity at a Bonferroni threshold (p-value ≤ 6.5 × 10−5, to account for multiple testing of 773 phenotypes). We report the results from the fixed-effects model for the phenotypes with non-significant heterogeneity and the results from the random effects model for all others. Similarly, we meta-analyzed all index-variant PheWAS results between the UK Biobank and MVP and obtained results for 811 phenotypes and 1750 lipid multi-ancestry index variants, after excluding instances with a combined sample size < 500.
Lipid index variants with CAD, T2D, and NAFLD datasets
The GWAS meta-analysis results of CAD and T2D were acquired from MVP [62] and DIAGRAM Consortium [61], respectively. For variant rs1229984, the CAD result is from CARDIoGRAMPlusC4D meta-analysis [104], as it was not present in the MVP results. The NAFLD GWAS and meta-analysis was performed in the UK Biobank and Michigan Genomics Initiative (MGI). We determined the association of the lipid index variants with CAD, T2D, and NAFLD and aligned the alleles across all the traits to the LDL-lowering allele. We then highlighted the protective lipid coding alleles associated with CAD.
GWAS and meta-analysis of NAFLD in the UK Biobank and Michigan Genomics Initiative (MGI)
Individuals with NAFLD were identified using ICD-9 571.8 and ICD-10 K76.0. Individuals with hepatitis, liver cirrhosis, liver abscess, ascites, a liver transplant, hepatomegaly, jaundice, or with abnormal result of serum enzyme levels or a function study of the liver were excluded (exclusion phecodes 70.2, 70.3, 571.51, 571.6, 571.8, 571.81, 572, 573, 573.2, 573.3, 573.5, 573.7, 573.9) [105]. Analysis was performed using SAIGE v43.3 [106]. Analysis in the UK Biobank included white British individuals with batch, sex, birth year, and the first 4 genetic principal components as covariates. A total of 1122 cases and 399,900 controls were included in the analysis. Analysis in MGI included only European-ancestry participants with array version, sex, birth year, and the first 4 genetic principal components as covariates. A total of 2901 cases and 49,098 controls were analyzed. Meta-analysis was performed using METAL with weighting based on the effective sample size calculated as 4/((1/Ncases) + (1/Ncontrols)).
CAD/T2D colocalization analysis with lipid traits
We used R package coloc v3.2.1 [93] to perform summary statistics-based colocalization via a Bayesian approach and test whether the 5 lipid traits share common genetic causal variants with CAD or T2D. We first defined a window of ± 100 kb around each index variant [24]. Then for each window of the 10 pairs of traits, we ran colocalization with default parameters using those SNPs present in both datasets. A colocalization posterior probability of PP4 > 0.8 was used to define those loci that show significant colocalization.