### Step 1: Subsampling GWAS association statistics

#### Step 1-a: Specify sampling distribution of summary statistics

We assume the quantitative trait *Y* follows a linear model:

$$ Y=\mathbf{X}\boldsymbol{\upbeta } +\epsilon $$

(1)

where **X** *=* (*X*_{1}, …, *X*_{p}) denotes the random vector of *p* SNP; **β** *=* (β_{1}, …, β_{p})^{T} is a *p*-dimensional vector representing fixed SNP effect sizes; *ϵ* is the error term following a normal distribution with zero mean. Let **y** and **x** = (**x**_{1}, …, **x**_{p}) denote the observed *N* × 1 phenotypic and *N* × *p* genotypic data of *N* independent individuals. For simplicity, we assume **y** and **x**_{j}’s are centered. The summary association statistics in GWAS are obtained from the marginal linear regressions. Then, for *j* = 1, …, *p*, we can denote the regression coefficients and their standard errors as follows:

$$ {\hat{\beta}}_j={\left({\mathbf{x}}_j^{\mathrm{T}}{\mathbf{x}}_j\right)}^{-1}\left({\mathbf{x}}_j^{\mathrm{T}}\mathbf{y}\right) $$

(2)

$$ \mathrm{SE}\left({\hat{\beta}}_j\right)=\sqrt{\frac{{\hat{\varepsilon_j}}^{\top}\hat{\varepsilon_j}}{\left(N-1\right){\mathbf{x}}_j^{\top }{\mathbf{x}}_j}} $$

(3)

where ϵ_{j} is the error term for the marginal linear regression of phenotype on the *j*th SNP and \( \hat{\varepsilon_j}=\mathbf{y}-{\mathbf{x}}_j{\hat{\beta}}_j \) is the observed residual from the *j*th marginal regression. If we have access to the full data set, most model-tuning approaches involve randomly sampling a subset of *N* − *n* individuals as the training set, i.e., **y**^{(tr)} and **x**^{(tr)}. Naturally, the remaining subset of *n* individuals will be the validation dataset denoted as **y**^{(v)} and **x**^{(v)}. When only the summary statistics file based on the full dataset is provided, the traditional model-tuning approaches cannot be implemented. Instead, we propose a method to generate marginal summary statistics for the training and validating datasets from summary statistics of the full dataset. By central limit theorem, as sample size *N* → ∞, we have

$$ {\mathbf{x}}^{\mathrm{T}}\mathbf{y}\sim \mathbf{N}\left(N\mathrm{E}\left({\mathbf{X}}^{\mathrm{T}}Y\right),N\mathrm{Var}\left({\mathbf{X}}^{\mathrm{T}}Y\right)\right) $$

(4)

$$ {\mathbf{x}}^{(tr)^{\mathrm{T}}}{\mathbf{y}}^{(tr)}\sim \mathbf{N}\left(\left(N-n\right)\mathrm{E}\left({\mathbf{X}}^TY\right),\left(N-n\right)\mathrm{Var}\left({\mathbf{X}}^TY\right)\right) $$

(5)

$$ {\mathbf{x}}^{(v)^{\mathrm{T}}}{\mathbf{y}}^{(v)}\sim \mathbf{N}\left(n\mathrm{E}\left({\mathbf{X}}^TY\right),n\mathrm{Var}\left({\mathbf{X}}^TY\right)\right) $$

(6)

where **x**^{T}**y** is the observed *p* × 1 summary statistics for *N* individuals and *p* genetic markers that can be directly calculated or approximated from the full GWAS summary statistics, and **x**^{(tr)⊤}**y**^{(tr)} and **x**^{(v)⊤}**y**^{(v)} represent summary statistics for two partitions of full GWAS samples, which are the training set and validation set, respectively. In the following derivation, we use superscripts (*tr*) and (*v*) to indicate whether any summary statistics are computed from **x**^{(tr)⊤}**y**^{(tr)} or **x**^{(v)⊤}**y**^{(v)}. It can be shown that

$$ {\mathbf{x}}^{(tr)^{\mathrm{T}}}{\mathbf{y}}^{(tr)}\mid {\mathbf{x}}^{\mathrm{T}}\mathbf{y}\sim \mathbf{N}\left(\frac{\left(N-n\right)}{N}{\mathbf{x}}^{\mathrm{T}}\mathbf{y},\frac{\left(N-n\right)}{N}\boldsymbol{\Sigma} \right)\operatorname{} $$

(7)

where · ∣ · denotes the conditional distribution and **Σ** is the observed covariance matrix of **x**^{⊤}**y** from the GWAS data. A detailed derivation of this conditional distribution is included in Additional file 1. Note that until now our framework does not depend on the assumption of linkage equilibrium.

#### Step 1-b: Estimate covariance matrix of summary statistics

To subsample summary statistics, we now estimate the covariance matrix of **x**^{⊤}**y**. Under simple scenarios where the SNPs are independent (i.e., GWAS summary statistics is pruned), **Σ** is a symmetric matrix whose diagonal and non-diagonal elements can be denoted as

$$ {\Sigma}_j={\beta}_j^2\mathrm{Var}\left({X}_j^2\right)+E\left({\epsilon}_j^2\right)E\left({X}_j^2\right) $$

(8)

$$ {\Sigma}_{ji}={\beta}_j{\beta}_iE\left({X}_j^2\right)E\left({X}_i^2\right) $$

(9)

For *j* = 1, …, *p* and *i* ≠ *j*. Here, \( E\left({\epsilon}_j^2\right) \) can be estimated by the mean squared error in marginal regressions, which can be further approximated by \( N{\left[\mathrm{SE}\left(\hat{\beta_j}\right)\right]}^2E\left({X}_j^2\right) \). In addition, each SNP’s effect size (i.e., *β*_{j}) is typically very small in GWAS and \( E\left({X}_j^2\right) \) only depends on each SNP’s minor allele frequency (MAF) which is commonly provided in GWAS summary statistics or can be estimated from a reference panel such as the 1000 Genomes Project [44]. Taken together, **Σ** can be estimated with

$$ {\hat{\Sigma}}_j=N{\left[\mathrm{SE}\left({\hat{\beta}}_j\right){\hat{\sigma}}_j^2\right]}^2 $$

(10)

$$ {\hat{\Sigma}}_{ji}={\hat{\beta}}_j{\hat{\beta}}_i{\hat{\sigma}}_j^2{\hat{\sigma}}_i^2 $$

(11)

where \( {\hat{\sigma}}_j^2 \) is an MAF-based estimator of \( E\left({X}_j^2\right) \).

#### Step 1-c: Partition summary statistics for training and validation sets

After generating **x**^{(tr)⊤}**y**^{(tr)} terms as described above from the conditional distribution, we can obtain the validating or testing summary statistics by

$$ {\mathbf{x}}^{(v)^{\mathrm{T}}}{\mathbf{y}}^{(v)}={\mathbf{x}}^{\mathrm{T}}\mathbf{y}\hbox{-} {\mathbf{x}}^{(tr)^{\mathrm{T}}}{\mathbf{y}}^{(tr)} $$

(12)

Consequently, subsampled GWAS summary statistics for the training set can be estimated by

$$ {{\hat{\beta}}_j}^{(tr)}={\left[\left(N-n\right){\hat{\sigma}}_j^2\right]}^{-1}{\mathbf{x}}^{(tr)^{\mathrm{T}}}{\mathbf{y}}^{(tr)} $$

(13)

$$ \mathrm{SE}\left({\hat{\beta_j}}^{(tr)}\right)=\sqrt{\frac{N}{N-n}}\mathrm{SE}\left(\hat{\beta_j}\right) $$

(14)

### Step 2: Evaluate model performance using GWAS summary data

#### Step 2-a: Calculate PRS prediction accuracy with summary statistics

Being able to generate summary statistics for the training and validation datasets resolves a critical issue in model tuning. However, challenges remain in evaluating PRS performance on the testing or validation set without individual-level data. Almost all the PRS approaches in the literature use a linear prediction model as follows:

$$ \hat{\mathbf{Y}}=\mathbf{Xw} $$

(15)

where **w**^{⊤} = (*w*_{1}, …, *w*_{p}) is the weight for SNPs in PRS. In a traditional PRS, marginal regression coefficients from GWAS are used as the weight values, i.e., \( \mathbf{w}=\hat{\boldsymbol{\upbeta}} \), while in other PRS models the weight can be more sophisticated. Here, we demonstrate how to calculate *R*^{2}, a commonly used metric to quantify PRS predictive performance, from subsampled GWAS summary data, but our method can be extended to other metrics (e.g., AUC [45]) as well. *R*^{2} on the validation dataset (**y**^{(v)}, **x**^{(v)}) can be calculated as

$$ {R}^2=\frac{{\left({\sum}_{i=1}^n{y}_i^{(v)}{\hat{y}}_i^{(v)}-n\overline{{\mathbf{y}}^{(v)}}\overline{{\hat{\mathbf{y}}}^{(v)}}\right)}^2}{\sum \limits_{i=1}^n{\left({y}_i^{(v)}-\overline{{\mathbf{y}}^{(v)}}\right)}^2{\sum}_{i=1}^n{\left({\hat{y}}_i^{(v)}-\overline{{\hat{\mathbf{y}}}^{(v)}}\right)}^2} $$

(16)

where \( {\hat{\mathbf{y}}}^{(v)}={\mathbf{x}}^{(v)}\mathbf{w}\ \mathrm{and}\ \overline{{\hat{\mathbf{y}}}^{(v)}} \) is the sample mean of \( {\hat{\mathbf{y}}}^{(v)} \). If the SNPs are pruned, it can be shown that the empirical variance of \( \hat{\mathbf{Y}} \) can be approximated by

$$ \frac{1}{n}{\sum}_{i=1}^n{\left({\hat{y}}_i^{(v)}-\overline{{\hat{\mathbf{y}}}^{(v)}}\right)}^2\approx {\sum}_{j=1}^p{w}_j^2{\hat{\sigma}}_j^2 $$

(17)

Although empirical variance of *Y* does not affect model tuning, it affects the scale of *R*^{2} and is thus critical for interpreting the results. This term can be approximated by

$$ \frac{1}{n}{\sum}_{i=1}^n{\left({y}_i^{(v)}-\overline{{\mathbf{y}}^{(v)}}\right)}^2={\beta}_j^2E\left({X}_j^2\right)+E\left({\epsilon}_j^2\right),\forall j $$

(18)

Although Var(*Y*) is always greater than \( E\left({\epsilon}_j^2\right) \) for any *j*, the gap between these two is negligible in real GWAS due to the small effect size of each individual SNP. Thus, a simple estimator for Var(*Y*) can be

$$ \frac{1}{n}{\sum}_{i=1}^n{\left({y}_i^{(v)}-\overline{{\mathbf{y}}^{(v)}}\right)}^2\approx {\max}_j\left[\frac{1}{N}{\hat{\varepsilon}}_j^T{\hat{\varepsilon}}_j\right]\approx N\kern0.28em {\max}_j\left[\mathrm{SE}{\left({\hat{\beta}}_j\right)}^2{\hat{\sigma}}_j^2\right] $$

(19)

Additionally, since we assumed data to be centered, the mean values in the numerator can be dropped. Taken together, *R*^{2} can be estimated as

$$ {R}^2\approx \frac{{\left(\frac{1}{n}{\sum}_{j=1}^p{w}_j{\mathbf{x}}_j^{(v)^{\mathrm{T}}}{\mathbf{y}}^{(v)}\right)}^2}{N\kern0.28em {\max}_j\kern0.28em \left[\mathrm{SE}{\left({\hat{\beta}}_j\right)}^2{\hat{\sigma}}_j^2\right]\kern0.28em {\sum}_{j=1}^p{w}_j^2{\hat{\sigma}}_j^2} $$

(20)

In practice, we use the 90% quantile of \( \frac{{\hat{\varepsilon}}_j^{\top }{\hat{\varepsilon}}_j}{N-1} \), *j* = 1, 2, …*p*, as a more robust estimator for Var(*Y*).

#### Step 2-b: Model-tuning strategies

So far, we have introduced strategies to subsample association statistics on training and validation sets and evaluate model performance using GWAS summary statistics. Combining these two key steps, we will be able to perform model tuning using GWAS summary data. Suppose a PRS model uses GWAS marginal estimates \( \hat{\boldsymbol{\upbeta}} \) as input and generates SNP weights \( {w}_j\left(\hat{\boldsymbol{\upbeta}},\lambda \right) \) for each SNP. The goal is to find the optimal value of tuning parameter *λ* that maximizes the predictive accuracy. In the simple setting we introduced above, we will generate summary statistics for training and validation datasets. After specifying a tuning parameter *λ*, SNP weights in PRS can be trained by applying the model to the training summary statistics. Then, the prediction accuracy *R*^{2} on the validation summary statistics will be a function of *λ*. Therefore, we can select *λ* so that it maximizes model performance.

$$ \hat{\lambda}={\mathrm{argmax}}_{\lambda}\left({R}^2\left(\lambda \right)\right) $$

(21)

More generally, if the goal is to compare different models, both the summary statistics subsampling and performance evaluation steps remain unchanged. In this case, *R*^{2} will be a function of the model and we can choose the best-performing model by optimizing *R*^{2}

$$ \hat{m}=\arg {\max}_{m=1,2\dots, M}\left({R}^2\left(\mathrm{model}\;\mathrm{m}\right)\right) $$

(22)

Furthermore, this framework can be used to conduct various types of model-tuning procedures. What we have laid out above is the simple training-validation data split approach. If one is interested in applying repeated learning, they can simply repeat the procedure (i.e., resampling training/validation datasets and evaluating *R*^{2} on the validation set) *K* times. The average *R*^{2} across *K* folds can be used to select the best model. Similarly, if *K*-fold cross-validation needs to be implemented, we can first independently simulate *K* − 1 sets of training subsample **x**^{(tr, k)⊤}**y**^{(tr, k)} with sample size \( \frac{N}{K} \). Then, we can obtain the *K*^{th} subsample by

$$ {\mathbf{x}}^{{\left(K, tr\right)}^T}{\mathbf{y}}^{\left(K, tr\right)}={\mathbf{x}}^{\mathrm{T}}\mathbf{y}\hbox{-} {\sum}_{k=1}^{K-1}{\mathbf{x}}^{{\left( tr,k\right)}^{\mathrm{T}}}{\mathbf{y}}^{\left( tr,k\right)} $$

(23)

Finally, rotate each one of the *K* subsamples as a validation sample and the rest as a training sample, and use the average *R*^{2} to select the best model. Taken together, PUMAS is a general framework that can perform a variety of model-tuning tasks.

### Simulation settings

We conducted simulations using real genotype data from WTCCC. The WTCCC dataset contains 15,918 samples with 393,273 genotyped SNPs across the whole genome. We removed SNPs that are not available in 1000 Genomes Project Phase III European samples from the simulations since 1000 Genomes data were used as the LD reference panel. We excluded individuals with genotype missingness rate higher than 0.01 and removed SNPs that satisfy any of the following conditions: (i) having minor allele frequency less than 0.01, (ii) having missing rate higher than 0.01 among all subjects, and (iii) having *p*-value from the Hardy-Weinberg equilibrium test lower than 1e−6. After quality control, 322,235 variants and 15,567 samples remained in the analyses. We first simulated effect sizes *β*_{j} from a normal distribution \( N\left(0,\frac{h^2}{Mp}\right) \) where *h*^{2} is the heritability (fixed at 0.5), *M* is the total number of SNPs, and *p* is the proportion of causal variants. We chose two values of *p* (i.e., 0.001 and 0.1) to represent sparser and more polygenic genetic architecture. Following the LDAK paper [46], we then replaced the raw effect sizes by \( {\beta}_j^{\ast }={\beta}_j{\left[2{p}_j\left(1-{p}_j\right)\right]}^{\alpha } \) where *α* = − 2, − 1, 0, 1, 2 to better evaluate the performance of PUMAS under various genetic architecture. Thus, in total, we conducted simulations under 10 different settings. In each setting, causal SNPs were randomly selected across the genome and the effect sizes of non-causal SNPs were set to be 0. Using these simulated effect sizes, we generated continuous trait values in GCTA [47]. We then performed marginal linear regressions and obtained GWAS summary statistics using PLINK [48]. These summary statistics were used as input for PUMAS.

We compared PUMAS with repeated learning (i.e., Monte Carlo cross-validation). Instead of partitioning *N* samples into *k* non-overlapping folds, which is what a *k*-fold cross-validation does, repeated learning randomly selects \( \frac{N\left(k-1\right)}{k} \) samples to form the training dataset and evaluates the model performance on the remaining \( \frac{N}{k} \) samples. This procedure is then repeated *k* times to obtain an averaged prediction accuracy (i.e., *R*^{2}) across *k* folds of analysis for each prediction model. Here, we implemented a 4-fold repeated learning approach. In each fold, we randomly select 75% of WTCCC samples (i.e., \( \frac{3}{4}\times 15567\approx 11675 \)) to perform GWAS, and evaluate the predictive performance of PRS on the remaining 25% of individuals (i.e., 15567 − 11675 = 3892). We repeated this process 4 times and reported the average predictive *R*^{2} for each PRS model with different *p*-value cutoffs. For comparison, we implemented PUMAS in a similar fashion. Based on the GWAS summary data computed from all WTCCC samples, PUMAS generates a set of summary statistics for 75% of samples as the training data for PRS and evaluates the predictive performance of PRS on the corresponding validation summary statistics (i.e., summary statistics for the remaining 25% of samples). We repeated this procedure 4 times and reported the average *R*^{2} for each PRS model. In this simulation, we consider PRS models with *p*-value cutoffs ranging from 1e−5 to 1.

To show that PUMAS can be applied to binary traits, we conducted additional simulations under settings described above. For each simulation setting, we kept the same SNP effects, heritability, and proportion of causal variants. However, instead of generating quantitative phenotypes, we simulated binary phenotypes using a population prevalence of 50% and case-control ratio of 1:1 in GCTA [47]. We performed marginal logistic regressions in PLINK to obtain GWAS summary statistics. Like the simulation for quantitative traits, we compared the performance between PUMAS and 4-fold repeated learning using individual-level data for binary traits. We used the area under the ROC curve (AUC) as the metric to assess prediction accuracy in repeated learning.

Finally, to investigate the accuracy of our approximation for the covariance of **x**^{T}**y**, we performed additional simulations in WTCCC. We implemented a total of 8 settings with different sample sizes, heritability, and proportion of causal variants. We simulated quantitative trait values and performed marginal linear regression to obtain GWAS summary statistics. Then, we calculated and compared diagonal elements Σ_{j} and \( {\hat{\Sigma}}_j \), and off-diagonal elements Σ_{ji} and \( {\hat{\Sigma}}_{ji} \), respectively. Note that the theoretical values of elements in the covariance matrix are shown in Eqs. (8–9), while the approximated values are shown in Eqs. (10–11). We assessed the approximation of both diagonal and off-diagonal elements. We found a very high correlation between \( {\hat{\Sigma}}_j \) and Σ_{j} (greater than 0.99 in all of 8 simulation settings) and negligible off-diagonal elements (Additional file 1: Fig. S15; Additional file 13: Table. S12), which justifies our approximation procedure.

### GWAS data

GWAS summary statistics on EA was shared to us by Dr. Aysu Okbay. In this dataset, samples from Add Health, HRS, 23&me, and Wisconsin Longitudinal Study were excluded (*N* = 742,903; number of SNPs = 10,824,042). Imputed genotype data for Add Health (*N* = 9974; number of SNPs = 9,664,514) and HRS (*N* = 15,567; number of SNPs = 18,144,468) were accessed through dbGap (phs001367 and phs000428) and the EA phenotypes were defined following the SSGAC GWAS [19]. Both Add Health and HRS genotype data were imputed using 1000 Genome Project data as reference. The comprehensive data cleaning procedure is documented on the Add Health website at https://addhealth.cpc.unc.edu/wp-content/uploads/docs/user_guides/AH_GWAS_QC.pdf. For HRS, SNPs with imputation quality score < 0.8 were removed from the dataset. After matching samples with accessible phenotypic information, 4775 Add Health samples and 10,214 HRS samples with self-reported European ancestry were used to validate EA PRS. The IGAP 2013 AD GWAS (*N* = 54,162; number of SNPs = 7,055,881) dataset was accessed through the IGAP website (http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php). GWAS summary statistics (number of SNPs = 9,037,014) for 7050 ADGC samples can be accessed through the NIAGADS database (NG00076). Predictive performance on ADGC samples was assessed using summary statistics-based *R*^{2}. Following a recent paper, we constructed the AD-proxy phenotype in the UK Biobank based on each sample’s AD status, AD history of parents, whether parents are still alive, and parental age (or age at death) [24]. Imputed genotype data (*N* = 355,583; number of SNPs = 9,605,099) were accessed through the UKB. The UKB genotype data was imputed to the Haplotype Reference Consortium reference. We removed samples who are not of European ancestry and SNPs with minor allele frequency < 0.01 or imputation *R*^{2} < 0.9. In addition, we applied PUMAS to benchmark PRS performance on 65 GWASs. Details on these studies are summarized in Additional files 7, 8, and 9: Table. S6-S8.

For all PUMAS analysis throughout the paper, we first extracted SNPs intersected with the 1000 Genome Phase III data of European ancestry [44]. Then, we pruned GWAS summary statistics by a LD-block window size of 100 variants, a step size of 5 variants to shift windows, and a pairwise LD (i.e., *r*^{2}) threshold of 0.1 using PLINK [48]. We used samples of European ancestry in the 1000 Genome Project Phase III as the reference panel to estimate LD. For GWASs that do not report MAF in the summary statistics, we estimated MAF from 1000 Genome project European samples. In addition, for the analysis of EA and AD, we also intersected GWAS summary statistics with SNPs in the validation set before LD-pruning. A *p*-value grid was used to search for the optimal *p*-value cutoff (Additional file 4: Table. S3).

### Identifying neuroimaging traits associated with AD

GWAS results for imaging traits were accessed from https://med.sites.unc.edu/bigs2/data/. The IGAP 2019 AD GWAS summary statistics was accessed via NIAGADS (NG00075). We constructed the AD-proxy phenotype in the UK Biobank following a recent paper [24]. To avoid sample overlap between GWASs, we inferred individuals in the UK Biobank who have undergone brain MRI scans and removed them from the AD-proxy GWAS. All individuals who have visited at least one of the UKB imaging centers were removed from the analysis. 318,773 independent samples remained after removing imaging samples from the data. We performed GWAS with the first 12 principal components [49], age, sex, genotyping array, and assessment center as covariates. To test if our approach to remove overlapping samples between neuroimaging GWAS and the AD-proxy analysis was effective, we used cross-trait LD score regression to estimate the intercepts between 211 imaging traits and the AD-proxy GWAS (Additional file 1: Fig. S13) [50]. BADGERS software was used to conduct the imaging trait PRS-AD association analysis [30]. Meta-analysis was conducted using the sample size-weighted approach [51].