 Method
 Open access
 Published:
A fast nonparametric test of association for multiple traits
Genome Biology volume 24, Article number: 230 (2023)
Abstract
The increasing availability of multidimensional phenotypic data in large cohorts of genotyped individuals requires efficient methods to identify genetic effects on multiple traits. Permutational multivariate analysis of variance (PERMANOVA) offers a powerful nonparametric approach. However, it relies on permutations to assess significance, which hinders the analysis of large datasets. Here, we derive the limiting null distribution of the PERMANOVA test statistic, providing a framework for the fast computation of asymptotic p values. Our asymptotic test presents controlled type I error and high power, often outperforming parametric approaches. We illustrate its applicability in the context of QTL mapping and GWAS.
Background
In the past years, the availability of deep phenotype data in large cohorts of genotyped individuals has dramatically increased [1]. In addition, recent technological developments have enabled genomewide profiling of a variety of molecular traits [2, 3]. The vast majority of genomewide association studies (GWAS) and molecular quantitative trait loci (QTL) mapping analyses test for association with genetic variants using a single trait at a time, even when multiple traits have been measured [2, 4,5,6,7]. Multivariate methods, however, present several advantages over the standard univariate strategy. Many phenotypes share genetic and environmental influences, which may be reflected in their correlation structure [8]. Hence, multivariate analysis offers increased statistical power to detect genetic associations [9, 10]. The multivariate setting is particularly suitable to investigate pleiotropy [11] and can be advantageous even when only a small subset of the traits is affected by the genotype [12]. Additionally, it provides a unique framework to study the molecular mechanisms through which genetic variants act, allowing joint analyses across multiple phenotypic layers [13]. When the same trait is measured in different conditions (e.g., across tissues or environments) or over time, multivariate analyses can be used to characterize contextdependent genetic effects [14, 15]. As multivariate analysis requires fewer individual tests, the multiple testing burden is also reduced.
Several methods have been proposed for multitrait genetic association studies. MVPLINK (canonical correlation analysis) [16] and MultiPhen (ordinal regression) [17] model the genotype as a dependent variable and test for association with a weighted sum of phenotypes. However, this approach hinders the assessment of complex designs (e.g., interactions between the genotype and other covariates). In contrast, multivariate regression methods treat the phenotypes as dependent variables, offering more flexible modeling. This is the case of multivariate analysis of variance (MANOVA) or multivariate linear mixed models (LMMs) [18]. The latter have become very popular, as they can naturally account for genetic relatedness among individuals. However, fitting multivariate LMMs is computationally intensive and can be very slow in large datasets, despite continuous implementation enhancements [18, 19]. A major drawback of most multivariate regression methods is the assumption that the model errors follow a multivariate normal distribution, which often does not hold (consider, for instance, multivariate proportions, or count data). In practice, normalization of each individual trait, e.g., via rankbased inverse normal transformations, is commonly applied. This, however, does not guarantee multivariate normality [20], and may reduce statistical power compared to modeling the untransformed traits [21]. Asymptotic multivariate normality is also required by approaches that leverage univariate summary statistics, such as MTAR [22] or MOSTest [23]. In addition, estimating trait correlations from summary statistics is not straightforward and can be affected by trait heritability or linkage disequilibrium patterns, among other factors, that need to be accounted for to avoid potential biases [22]. While several Bayesian approaches have been proposed for multitrait association studies [12, 24], they often suffer from a large computational cost.
Altogether, this highlights the need of a fast nonparametric method suitable for multitrait GWAS and QTL mapping. Anderson introduced a distancebased approach, known as permutational multivariate analysis of variance (PERMANOVA), that extends the univariate factorial linear model to multiple dimensions without requiring a known probability distribution of the dependent variables [25]. In PERMANOVA, the hypothesis of noeffects is tested by a permutation procedure based on a pseudoF statistic, where the sums of squares in ANOVA are replaced by sums of interdistances between observations. We previously employed PERMANOVA to study alternative splicing (AS) across different human populations, using the Hellinger distance between vectors of relative transcript abundances as dissimilarity metric [26]. However, while this approach remains conceptually appealing, the increased size and complexity of current datasets requires a precision for p value calculation that turns the permutational procedure infeasible. Only for oneway fixed designs, Anderson and Robinson showed the asymptotic distribution of the numerator of the test statistic [27]. We used this result, implemented in sQTLseekeR, to identify genetic effects on alternative splicing across human populations [28] and tissues [29]. Nevertheless, p value calculation in more complex designs still relied on permutations.
To overcome this limitation, here we obtain the asymptotic distribution of the PERMANOVA test statistic for complex designs in the Euclidean distance case. Our result also holds after any transformation of the data that preserves the independence of the observations. We develop a procedure to compute asymptotic p values that we implement in the MANTA (Multivariate Asymptotic Nonparametric Test of Association) R package (available at https://github.com/dgarrimar/manta and in the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/package=manta). We also provide a containerized Nextflow pipeline for multivariate GWAS using MANTA, named mvgwasnf (available at https://github.com/dgarrimar/mvgwasnf). In a typical GWAS setting (e.g., 5 traits measured in 10,000 individuals tested vs the genotype plus ten additional covariates), our asymptotic approach achieves over a 10\(^\text {6}\)fold reduction in the running time per test compared to computing 10\(^\text {4}\) permutations, while producing p values down to 10\(^{\text {14}}\). Through a comprehensive set of simulations, we evaluate the type I error, statistical power, and running time of the asymptotic test, in comparison with MANOVA and multivariate LMMs. Overall, our method presents controlled type I error and high power to detect genetic associations, comparable to the parametric approaches, and outperforming them in several settings (particularly with correlated traits when genetic effects and traittotrait correlations are concordant). It is also computationally more efficient than these methods, especially compared with LMMs.
We illustrate our approach in a number of realcase scenarios. First, we carry out the first populationbiased splicing QTL (pbsQTL) mapping analysis across multiple human tissues, using the GenotypeTissue Expression (GTEx) project cohort, and identify genetic variants that affect alternative splicing differently in distinct ethnic groups. We show, in particular, the case of a pbsQTL that is potentially a risk factor for kallikrein5 (KLK5)related diseases in European American, but not in African American individuals. Second, we perform a GWAS of the magnetic resonance imaging (MRI)derived volumes of hippocampal subfields in the UK Biobank cohort. This is the largest GWAS of hippocampal subfield volumes carried out so far. We identify 41 genomewide significant loci, dwarfing previous collections based on univariate approaches [4]. Most of these loci have not been previously related to the hippocampus. However, many of them have been associated with traits such as intellectual ability or neurodegenerative and neuropsychiatric disorders. This highlights the importance of analyzing MRIderived endophenotypes to gain insight into the morphological alterations that mediate the impact of genetic variation in brain complex traits and diseases [30].
Results
The asymptotic null distribution of the PERMANOVA test statistic
Consider a vector of q response variables and a vector of p predictor variables, both observed on n individuals. For example, response variables could be different brain measurements, and predictors may include the patient’s age, disease status, and the genotype at a given SNP. We define \({\textbf {Y}}\) as the \(n \times q\) matrix of response variables, and \({\textbf {X}}\) as the \(n \times p\) matrix of predictors. The multivariate multiple linear regression (MMR) regresses \({\textbf {Y}}\) on \({\textbf {X}}\) as:
where \(\varvec{\beta }\) is a \(p \times q\) matrix of parameters, and \({\textbf {E}}\) a \(n \times q\) matrix of random errors. Usually, model (1) includes several predictors of different types (e.g., main factors: genotype; interactions: genotype \(\times\) disease status; continuous covariates: age). Anderson’s approach [25], also known as PERMANOVA, can be used to assess the effect of these predictors by imposing conditions on a subset of parameters \(\varvec{\beta }_0\) (i.e., \(\varvec{\beta }_0={\textbf {0}}\)). For instance, when assessing the effect of the genotype in our example above, \(\varvec{\beta }_0\) would be the subset of parameters corresponding to the genotype term. From the general PERMANOVA pseudoF statistic, the following expression can be derived (see the “Methods” section and Additional file 1: Supplementary Note 1):
where tr denotes the trace, and \({\textbf {H}} = {\textbf {X}}({\textbf {X}}^\text {T}{} {\textbf {X}})^{1}{} {\textbf {X}}^\text {T}\) is the usual projection matrix (or hat matrix) in linear models. Analogously, \({\textbf {H}}_0\) is the projection matrix corresponding to \({\textbf {X}}_0\), which is \({\textbf {X}}\) dropping the columns associated with the subset of parameters \(\varvec{\beta }_0\) in the hypothesis.
We show that the numerator in (2) has the following limiting distribution (see Lemma 2 in Additional file 1: Supplementary Note 1):
therefore, the trace converges to a weighted sum of q independent chisquare variables with \(pp_0=\text {rank}({\textbf {H}})\text {rank}({\textbf {H}}_0)\) degrees of freedom. The coefficients \(\lambda _j\) are the eigenvalues of \(\varvec{\Sigma }\), estimated in practice by the eigendecomposition of the sample covariance matrix of the residuals of the full model. The denominator in (2) converges in probability to the constant \(\sum_{j=1}^q \lambda _j\) (see Additional file 1: Supplementary Note 1).
In order to obtain the asymptotic distribution of (2), we apply some conditions to model (1). Mainly, the rows of \({\textbf {E}}\) must be independent with identical \(q \times q\) covariance matrix \(\varvec{\Sigma }\) (see Theorem 1 in Additional file 1: Supplementary Note 1). Mutual independence of the rows of \({\textbf {E}}\) is guaranteed if the observations are independently sampled. Hence, any transformation of \({\textbf {Y}}\) that preserves the independence of the observations results in the same type of limiting distribution (see Additional file 1: Supplementary Note 1). This includes, among others, the logarithm or the square root. In the specific case of proportion data, a square root transformation in (2) is equivalent to using the Hellinger distance between observations in Anderson’s pseudoF general expression (see (9) in Additional file 1: Supplementary Note 1). Notably, our approach does not make any assumption about the distribution model in E, and presents several practical advantages over standard PERMANOVA (see below).
The cumulative distribution function (CDF) of the asymptotic distribution in (3) can be used to compute p values for the predictors in custom MMR models. Although the CDF of a weighted sum of chisquare random variables does not have a closed form, it can be approximated with high accuracy, and several algorithms have been developed for this purpose [31,32,33] (Additional file 1: Fig. S1). We have implemented this approach in the MANTA R package, available at https://github.com/dgarrimar/manta and in the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/package=manta (see the “Methods” section).
Comparison between asymptotic and permutational approaches
To study the properties of our asymptotic approach, we first considered a model with two categorical predictors (A, B) plus their interaction (AB). We simulated \(n =\) 300 observations of \(q =\) 3 multivariate normal response variables, with B under the alternative hypothesis, \(H_1\) (see the “Methods” section). We used our result in (3) to obtain the asymptotic null distribution of the test statistic in (2) for the interaction term (\(\tilde{\text {F}}_{AB}\)), and compared it with the distribution of permuted \(\tilde{\text {F}}^{\pi }_{AB}\) values (equivalently, we could have studied A instead of AB). Permutations were restricted to occur within the levels of factor B (see the “Methods” section). As shown in Fig. 1a, the distribution that we derived matches exactly the one obtained by permutations, even in the upper tail region. We also provide empirical evidence that our theoretical result holds regardless of the distribution of Y (Additional file 1: Fig. S2). This contrasts to the result obtained by McArtor et al. [34], who suggested a different asymptotic distribution for the pseudoF statistic. Their proposal departs substantially from the distribution obtained by permutations (Additional file 1: Fig. S3), unless all the model parameters are assumed to be zero in the null hypothesis (i.e., the omnibus test).
Provided that our result in (3) only holds asymptotically, the accuracy of the proposed asymptotic p values will depend on the total sample size (n). In addition, other factors such as the number of response variables (q) or their correlation structure may also have an impact. Thus, we evaluated the relative difference between asymptotic and permutationbased p values across a broad range of values of n and q (Fig. 1b). We considered independent response variables. Other situations with increasing degrees of correlation (in absolute value) between the response variables would be equivalent to scenarios with fewer independent response variables. When the n/q ratio is small, asymptotic and permutationbased p values may differ substantially. As the n/q ratio increases, this bias converges to 0. Overall, when the asymptotic null does not hold, asymptotic p values tend to be conservative. As a result, they can still be used without inflated type I error rates (Additional file 1: Fig. S4). When the asymptotic null holds (note that this occurs even for relatively small values of the n/q ratio, i.e., \(n/q \approx\) 20), we observe an almost perfect correlation between asymptotic and permutationbased p values (Pearson’s \(r>\) 0.9999), even for small p values (Fig. 1c). Analogous results were obtained with different distributions of Y and data transformations (Additional file 1: Fig. S5).
Overall, our asymptotic approach produces highly accurate p values, while dramatically reducing the computational cost with respect to the permutation test (Fig. 1d). In addition, it allows to compute p values down to a precision limit of 10\(^{14}\), difficult to achieve through permutations when the number of observations is relatively large (the smallest p value that can be achieved via permutation is \(\frac{\text {1}}{P+\text {1}}\), where P is the number of permutations). It is also particularly advantageous in complex designs where selecting the permutation schema that ensures an exact test is not trivial, or even not possible (see the “Methods” section).
Simulation study
We designed a comprehensive set of simulations to evaluate the performance of our method in the context of phenotypegenotype association testing. We obtained genotype data from the 1000 Genomes Project (1KGP, Phase 3) [35], and generated cohorts with different structures (unrelated individuals, population stratification, relatedness, actual 1KGP structure) [36]. We simulate phenotypes as the sum of the contribution of the effect of a genetic variant, population structure and residual noise, using an additive model. We consider a variety of scenarios, modifying the number of traits, their distribution and correlation structure, the minimum allele frequency (MAF) of the genetic variant, the structure of the population, and the fraction of phenotypic variance explained by each term in the model (see the “Methods” section). For a given scenario, we obtain 10,000 phenotypegenetic variant pairs and evaluate type I error and power of asymptotic PERMANOVA (as implemented in MANTA), and compare with those of MANOVA and multivariate LMMs (as implemented in GEMMA [18]). To ensure highly parallel, portable, and reproducible simulations, we embedded our simulation framework in a containerized Nextflow [37] computational workflow, available at https://github.com/dgarrimar/mantasim.
Type I error
We simulated phenotypes from a cohort of unrelated individuals using model (8) in the “Methods” section, without a causal variant effect term. We set the fraction of variance explained by the genetic background, \(h^2_g\), to 0.2 (see the “Methods” section). MANTA displays controlled type I error across different numbers of traits (Fig. 2a). As the test does not assume any probabilistic distribution for the residuals, this does not affect type I error (Additional file 1: Fig. S6). However, like its parametric counterparts, MANTA is sensitive to heterogeneity in multivariate dispersion [25]. We observed that type I error rates can be substantially inflated when there are differences in the trait variances between genotype groups (Fig. 2c, upper panel). This is particularly problematic for lower MAFs. MANOVA and GEMMA display an analogous behavior, with slightly larger type I errors. However, in contrast to the parametric approaches, MANTA controls well type I error when the trait correlation structure differs between genotype groups [25] (Fig. 2c, lower panel). Heterogeneity in variances or correlations may arise, for instance, in the presence of epistasis or gene by environment effects [38]. In addition, several studies have reported genetic variants associated with differences in phenotype variances (i.e., variance QTLs) [38, 39]. We also simulated a scenario with strong outliers (see the “Methods” section), known to be challenging for type I error control when testing low MAF variants with methods assuming normally distributed traits [17, 20]. Although outliers affect all the methods, across different numbers of traits, MANTA is more robust to its presence than MANOVA or GEMMA, specially when many traits are analyzed and traittotrait correlations are relatively large (Additional file 1: Fig. S7).
In the context of GWAS studies, population stratification (systematic differences in ancestry) and relatedness (either family structure or cryptic relatedness among individuals with no known family relationships) are known to result in inflated type I errors [40]. When simulating phenotypes from a synthetic cohort formed by a small number of distinct populations, or from the actual 1KGP dataset (see the “Methods” section), MANTA generates calibrated p values, across different values of \(h^2_g\), by including the top principal components (PC) of the genotype matrix as covariates in the model (Fig. 2b and Additional file 1: Fig. S8a). MANOVA behaves similarly. When simulating strongly related individuals, however, only LMMbased approaches such as GEMMA yield calibrated p values (Additional file 1: Fig. S8a). Still, when the fraction of variance explained by relatedness is relatively low, our approach (MANTA+PC) performs better than parametric MANOVA+PC (Additional file 1: Fig. S8b).
Given our previous work with multivariate proportions in alternative splicing analyses [26, 28, 29], we were also interested in evaluating the performance of MANTA, MANOVA and GEMMA in this scenario. With this aim, we dropped the genetic background term in model (8), simulating trait vectors as points in the simplex (see the “Methods” section). However, while MANTA can deal directly with linearly dependent phenotypes, this is not the case of MANOVA, which requires an additional transformation. Similarly, numerical optimization in GEMMA tends to fail with this type of traits, which deviate substantially from multivariate normality (the software produces an error and does not generate any result; this is particularly striking when very few traits are analyzed, e.g., with \(q =\) 3, only 83 out of 10,000 tests completed correctly), or results in markedly deflated p values. After transforming the traits (square root), the three methods control well type I error, although GEMMA still displays slightly deflated p values (Additional file 1: Fig. S9).
Power
To evaluate power, we simulated phenotypes from the actual 1KGP dataset, given that MANTA+PC, MANOVA+PC and GEMMA displayed controlled type I error in this scenario. Figure 2d displays power curves in different settings. First, we considered uncorrelated traits with unit variance, and causal variants affecting any number of traits and explaining, on average, a fraction of the phenotypic variance (\(h^2_v\)) ranging from 0.1 to 1%. Overall, the three methods behave similarly across trait distributions and numbers of traits, and display large power to detect small differences (e.g., a causal variant explaining 0.5% of the variance of 1 out of 5 uncorrelated normal traits can be detected with power 0.95 after Bonferroni correction, see Additional file 1: Figs. S10 and S11). In addition, we simulated different trait variances, correlation structures and types of effects. As previously reported, the power of multivariate methods is highly sensitive to the specific combination of genetic effects and phenotype correlations [10]. When the causal variant has similar effects across traits, as traittotrait correlation increases, MANTA outperforms MANOVA and GEMMA (Additional file 1: Fig. S12). Although the scenario of concordant genetic effects and trait correlations seems more likely [10], we also simulated different effects across traits (see the “Methods” section). In this case, MANOVA and GEMMA present higher power than MANTA with increasing traittotrait correlations (Additional file 1: Fig. S13).
We also observed power differences regarding the variances of the traits, with MANTA providing higher power when the trait variances are relatively similar, and MANOVA and GEMMA displaying the opposite behavior (Additional file 1: Fig. S14). For each method, we barely observed differences in power when increasing the proportion of phenotypic variance explained by the genetic background at low traittotrait correlations. However, a drop in power was observed (more marked for MANOVA and GEMMA) when both the proportion of phenotypic variance explained by the genetic background and the traittotrait correlations are large (Additional file 1: Fig. S15). In the context of multivariate proportion traits, when simulating genetic effects using a different approach (vectors of traits are points in the simplex with different locations, depending on the genotype at the causal variant and on a parameter \(\Delta\), see the “Methods” section), MANTA performs well with both untransformed and transformed (square root) traits. As stated above, MANOVA and GEMMA present problems with linearly dependent traits. Nevertheless, the three methods show similar power with transformed traits (Additional file 1: Fig. S16).
Overall, MANTA presents high power to detect genetic effects on multiple traits, comparable to MANOVA and GEMMA, and outperforms them in several scenarios, particularly when genetic effects and traittotrait correlations are concordant.
Running time and RAM usage
We evaluated the running time and RAM usage of MANTA with respect to the sample size (up to n = 100,000 individuals) and the number of traits analyzed (up to q = 20 traits), in comparison to MANOVA and GEMMA (see the “Methods” section). Results are summarized in Fig. 2e and Additional file 1: Fig. S17. The running time and RAM usage of MANTA scale approximately linearly with the number of individuals analyzed, and sublinearly with the number of responses. MANOVA displays a similar behavior, with slightly increased running times for small sample sizes. However, GEMMA’s running time and RAM usage increase dramatically with both n and q. As a result, its application needs to be restricted to the analysis of modest numbers of phenotypes in cohorts of moderate size.
Real data applications
Populationbiased splicing QTL mapping
Alternative splicing (AS), the process through which multiple transcript isoforms are produced from a single gene, is subject to a tight regulation, often tissue, celltype, or conditionspecific [41]. However, while populationbiased genetic effects on gene expression have been explored [2], little is known about the populationdependent genetic control of AS across human tissues. AS can be treated as a multivariate phenotype, built from the relative abundances of a gene’s transcript isoforms [29]. Thus, here we apply asymptotic PERMANOVA to identify cis populationbiased splicing quantitative trait loci (pbsQTLs), across a panel of human tissues. The complete pbsQTL catalog generated is available in Zenodo (https://doi.org/10.5281/zenodo.8349415) [42].
We obtained transcript expression data from the V8 release of the GTEx Project [2], corresponding to up to 54 tissues from 838 deceased donors, and matched genotypes. We restricted our analyses to 818 individuals of European or African ancestry (referred to as European American (EA) and African American (AA), respectively), and to 31 tissues with more than 20 individuals of each ancestry (see the “Methods” section and Additional file 1: Fig. S18). For cis pbsQTLs mapping, we developed sQTLseekeR2.int, a slightly modified version of sQTLseekeR2, which implements asymptotic PERMANOVA to assess the significance of the effect on AS of the interaction between the donor’s genotype and ancestry (see the “Methods” section).
There are several factors, such as differences in linkage disequilibrium (LD) or allele frequencies between populations, that could bias pbsQTL discovery. However, we did not find evidence of genetic variants with large differences in MAF or genes in regions with very different LD structure between EA and AA groups, being more (or less) likely identified as pbsQTLs and pbsGenes, respectively (see the “Methods” section). Overall, the distribution of MAF and LD differences between EA and AA groups was not significantly different for pbsQTLs/pbsGenes and for other tested variants/genes (Additional file 1: Fig. S19).
At a 0.1 false discovery rate (FDR), we identified a total of 7719 cis pbsQTLs affecting 938 genes (i.e., pbsGenes: 913 proteincoding genes and 25 long intergenic noncoding RNAs, lincRNAs) (Additional file 1: Table S1). Among them, only one (NQO2) is also a pbeGene in GTEx [2]. These numbers are substantially smaller than the ones reported for regular cis sQTLs in the same dataset [29]. This could be explained by the more stringent preprocessing applied here (e.g., at least 10 individuals per level of the interaction are required, see the “Methods” section), which resulted in a smaller number of variantgene pairs tested (see Additional file 1: Table S1), as well as by the larger statistical power required to detect significant interactions [43]. However, the major driver of this difference is probably the pronounced sample size imbalance between EA and AA groups, which substantially reduces statistical power to identify pbsQTLs, regardless of the group where the sQTL effect is present (Additional file 1: Fig. S20). Our observations are consistent with the numbers reported by the GTEx Consortium on regular and populationbiased eQTLs [2].
As expected, the ratio between the number of genes with pbsQTLs and the number of tested genes grows with the tissue sample size (Spearman’s \(\rho =\) 0.89, Additional file 1: Fig. S21). Skin (sunexposed) and skeletal muscle present the largest values of this ratio. Both tissues are known to have structural and functional differences between EA and AA individuals [44, 45]. In addition, skin (not sunexposed), displays a comparable—or even larger—value of this ratio with respect to other tissues with larger sample sizes (e.g., lung, thyroid, nerve). We observed that pbsQTLs are enriched, with respect to matched non pbsQTLs (see the “Methods” section), in functional annotations related to AS, such as splice sites or RNAbinding protein (RBP) binding sites, as well as in GWAS hits and transcription factor binding sites (Additional file 1: Fig. S22a). pbsQTLs are also closer to splice sites than non pbsQTLs (twosided Wilcoxon RankSum test p value < 10\(^{16}\), Additional file 1: Fig. S22b). Altogether, this suggests that our pbsQTLs are indeed capturing the biology underlying AS.
Among the pbsQTLs identified, we found rs2739412, a pbsQTL for the Kallikrein related peptidase 5 (KLK5) gene in skin (both sunexposed and not sunexposed). rs2739412 affects the relative abundances of the three main AS isoforms of KLK5 in EA individuals, but not in AA individuals. In EA individuals, the abundance of isoform KLK5203 increases with the number of copies of the C allele, while isoforms KLK5201 and KLK5202 display the opposite behavior (Fig. 3a). The three isoforms differ in the structure of the 5′ untranslated region (5′ UTR), with KLK5203 having a shorter 5′ UTR than KLK5201 and KLK5203 (Fig. 3b). We specifically analyzed the reads mapping in this region across genotype groups and ancestries, which confirmed our results (Fig. 3c, Additional file 1: Table S2).
KLK5 is a serine protease that plays a central role in normal stratum corneum shedding. High KLK5 activity has been related to pathological desquamation in Netherton syndrome and atopic dermatitis [47]. Given its function in extracellular matrix degradation, it has also been proposed as a candidate biomarker for epithelialcell carcinomas [48, 49]. Specifically, higher expression of KLK5 isoforms with short 5′ UTRs, including KLK5203, has been reported in ovarian cancer with respect to normal ovarian cells [48, 50]. Since the 5′ UTR is an important region for posttranscriptional regulation, and its sequence is a key determinant of translation efficiency [51], we investigated the impact of isoform usage on the levels of the KLK5 protein. Using a convolutional neural network model (https://optimus5.cs.washington.edu/MRL), trained on the results of a massively parallel reporter assay [52], we predicted the mean ribosomal load (MRL) of the 5′ UTR of KLK5 isoforms. We found that isoform KLK5203 presents almost a 60% larger predicted MRL than KLK5201 and KLK5202 (6.42 vs 4.10 and 3.99, respectively), suggesting that its preferential usage could lead to higher levels of the KLK5 protein.
In GTEx V8 skin tissues, rs2739412 is not identified as an eQTL for KLK5, but it is an sQTL for intron I3 (chr19:50,952,668–50,952,747) according to LeafCutter + FastQTL [2], with the C allele associated with a decreased intronexcision ratio (Additional file 1: Fig. S23). This is consistent with our results, given that this intron is present in the 5’ UTR of isoforms KLK5201 and KLK5202, but absent from isoform KLK5203. However, the LeafCutter + FastQTL pipeline did not identify other introns as significant, such as I1 (chr19:50,952,668–50,952,950), which displays clear differences between genotype groups in EA individuals (Fig. 3c, Additional file 1: Table S2). In addition, the differences in I3 between the heterozygous and CChomozygous individuals at rs2739412 are obscured (Additional file 1: Fig. S23), likely by the effect of pooling EA and AA individuals, ignoring the effect of ethnicity. This highlights the increased power of our multivariate approach over univariate strategies, as well as the value of the analysis of contextdependent genetic effects on AS.
We show additional examples of pbsQTLs identified by our approach in Additional file 1: Fig. S24, affecting the lincRNA SNHG8 and the proteincoding gene RPUSD4. Regular eQTLs and sQTLs have been identified for both genes in multiple tissues [2, 29], but no populationbiased effects were reported.
GWAS of hippocampal subfield volumes
The hippocampus is a critical structure for memory, spatial navigation and cognition [53], which has been related to several major brain disorders, including Alzheimer’s disease [54] and schizophrenia [55]. Multiple GWAS have identified genetic variants associated with whole hippocampal volume [56, 57]. However, the hippocampus is a heterogeneous structure, with different subregions that carry out distinct functions [53] and may be differentially affected by disease [58]. While a recent large GWAS (\(n =\) 21,297) reported genetic associations with individual subfield volumes [4], this study analyzed each subfield separately, likely resulting in reduced statistical power. In contrast, here we apply asymptotic PERMANOVA (as implemented in MANTA) to identify genetic variants affecting jointly the volumes of hippocampal subfields in the UK Biobank cohort (\(n =\) 41,414). To the best of our knowledge, this is the largest GWAS of hippocampal subfield volumes performed to date. Summary statistics are available in Zenodo (https://doi.org/10.5281/zenodo.8349443) [59].
Briefly, we first obtained magnetic resonance imaging (MRI)derived volumes of 12 hippocampal subfields (from anterior to posterior, approximately: parasubiculum, presubiculum, subiculum, cornu ammonis (CA) 1, CA 2/3, CA 4, granule cell layer of the dentate gyrus (DG), hippocampusamygdala transition area (HATA), fimbria, molecular layer of the DG, hippocampal fissure and hippocampal tail) and genotype data from the UK Biobank, corresponding to 41,414 individuals (we summed the corresponding volumes in left and right brain hemispheres, given that they are highly correlated, Additional file 1: Fig. S25). Then, we used MANTA, embedded in a Nextflow pipeline for multivariate GWAS that we named mvgwasnf (available at https://github.com/dgarrimar/mvgwasnf), to test for association between hippocampal subfield volumes and a total of 8,197,132 genetic variants (see the “Methods” section).
Our GWAS identified 41 genomewide significant loci associated with hippocampal subfield volumes (Fig. 4, Additional file 1: Table S3). Overlap with the GWAS Catalog revealed that, out of them, only 10 (24%) have been previously related to hippocampal volumes. Thirty (73%) have been associated with other MRIderived brain endophenotypes, and 20 (49%) with cognition, intellectual ability or neurodegenerative and neuropsychiatric disorders. 8 (20%) have not been associated with brainrelated traits before. Remarkably, we replicated 8 out of 10 loci associated with individual hippocampal subfields previously identified using a univariate approach in [4]. For most of these loci, the affected subfields were identical in both studies. However, in some cases, our multivariate approach identified associations with subfields that were not captured by the univariate strategy (Additional file 1: Fig. S26).
Overall, the Experimental Factor Ontology (EFO) terms which correspond to the most commonly associated traits with the SNPs in the identified loci are related to brain morphology and cognition (Additional file 1: Fig. S27). In addition, GO enrichment of genes across the 41 loci revealed functions related to neuronal development and differentiation, as well as processes such as locomotion and exploratory behavior (Additional file 1: Fig. S28). The former observation has also been reported in GWAS of cortical and subcortical brain measurements [23], while the latter was recently described for genes prioritized from loci associated with the volume of the whole hippocampus and the hippocampal tail [4]. We also analyzed our results jointly with summary statistics from GWAS of Alzheimer’s disease [60] and schizophrenia [61], two diseases for which the hippocampus is known to be relevant [54, 55]. In both cases, we observed stronger associations with the disease among the SNPs affecting hippocampal volumes (Additional file 1: Fig. S29). This highlights the importance of analyzing multidimensional MRIderived endophenotypes to gain insight into complex brain disorders [30].
As an example of the identified loci, Fig. 4b displays the chr3:190,591,315–190,836,742 locus, which contains the GMNC gene and its upstream region. GMNC, encoding the Geminin coiledcoil domaincontaining protein, plays an important role in the generation of neural stem cells, which are responsible for proliferation and neurogenesis in the adult brain [62]. We show the effect of rs9877502 (chr3:190,669,518, G/A), a SNP located in this locus, across several hippocampal subfields (Fig. 4c). In particular, the A allele is associated with increased CA13 volumes and decreased volumes of other subfields, including subiculum and presubiculum. rs9877502 has not been previously related to hippocampal volumes, but it is associated with Alzheimer’s disease risk, tangle pathology and cognitive decline [63]. Another example is the chr16:70,658,224–70,692,574 locus. The most associated SNPs overlap the IL34 gene, which encodes interleukin 34 (IL34). This cytokine acts as a cell typespecific ligand for the colony stimulating factor 1 receptor. In the brain, IL34 is key for the maintenance and differentiation of the microglia [64]. Indeed, modulating IL34 levels has been proposed as a selective approach to control microglial proliferation in neurodegenerative disorders [65]. IL34 seems also to play a relevant role in the clearance of amyloid \(\beta\)protein, a major hallmark of Alzheimer’s disease [66]. An intronic variant of IL34 (rs78927322, chr16:70,636,538 C/G, not tested in our analysis due to low MAF), located 22 Kb upstream from the locus that we defined, had been previously associated with hippocampal volume in Alzheimer’s disease patients (Alzheimer Disease Neuroimaging Initiative (ADNI) cohort) [67]. Additional file 1: Fig. S30 shows the regional plot corresponding to the IL34 locus, as well as the effects of the lead SNP rs68049363 (chr16:70,662,480, C/G), mainly affecting CA1 volume. In addition to the identification of genetic associations, as we used age as an independent variable in our MANTA model, we further recapitulated the wellknown effect of aging [54] across hippocampal subfields (p value for age 1 \(\cdot\) 10\(^{14}\), Additional file 1: Fig. S31).
Finally, to complement our comprehensive benchmark based on simulations, we aimed at comparing MANTA with multivariate LMMs (as implemented in GEMMA) in the context of real data. However, running GEMMA in the complete hippocampal subfield dataset (41,414 individuals, 12 traits, over 8 million genetic variants) would be computationally infeasible. Hence, we downsampled the dataset to 10,000 individuals, and performed GWAS with the two methods. Both approaches identified three genomewide significant loci, two of them in common (Additional file 1: Fig. 32a and Table S4). The three loci identified by MANTA (but not the one identified only by GEMMA) are also significant at genomewide level in the analysis of the complete dataset. Overlap with the GWAS Catalog shows that all the loci, except for the one identified only by GEMMA, are enriched in variants associated with hippocampal volumes and other brainrelated traits (Additional file 1: Fig. 32b). In particular, the locus on chromosome 5 found by MANTA replicates one of the loci identified in a recent large univariate GWAS of hippocampal subfields [4], despite using only half of their sample size. Overall, despite small differences, the results obtained by MANTA and GEMMA are comparable. This is consistent with our simulation results, which show that both methods present similar power in the majority of scenarios, although in some cases the specific combination of genetic effects and traittotrait correlations may result in power differences. However, the computation time required by the two methods is very different (from a few hours in the case of MANTA to almost 6 days in the case of GEMMA).
Discussion
In this work we obtain the limiting distribution of the PERMANOVA test statistic under the null hypothesis, for complex designs and Euclidean distances. Our result also holds for relatively small values of the ratio between the sample size and the number of dependent variables, and after any data transformation that preserves the independence of the observations. We provide an efficient approach to compute asymptotic p values for the association between any quantitative multivariate response and a set of predictors of interest that we have implemented in the MANTA R package (available at https://github.com/dgarrimar/manta and in the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/package=manta). MANTA produces highly accurate p values, down to a precision limit of 10\(^{14}\), while dramatically reducing the running time with respect to the permutation test. This also avoids having to select the appropriate permutation schema, which may not be straightforward [68]. Our comprehensive simulation study in the context of genotypephenotype association testing, and our analyses using real datasets, demonstrate that asymptotic PERMANOVA is a valuable nonparametric alternative to identify genetic effects on multiple traits in the context of GWAS and QTL mapping.
In extensive simulations, we show that the asymptotic test yields calibrated p values independently of the number and distribution of the traits of interest. In contrast to parametric approaches (i.e., MANOVA, multivariate LMMs), the test is robust to differences in the trait correlation structure between genotype groups, and displays smaller type I error rates in the presence of outliers and heterogeneity in trait variances. In addition, it presents large power to identify genetic associations, comparable to the parametric tests, and outperforms them in several scenarios. Particularly, our test seems more powerful when genetic effects and trait correlations are concordant. Regarding empirical running time and RAM usage as a function of the number of observations and traits analyzed, our method behaves similarly to MANOVA, and performs orders of magnitude better than multivariate LMMs. Among the three methods evaluated, only asymptotic PERMANOVA can deal with untransformed linearly dependent phenotypes, such as proportions. We also highlight the value of providing the community with a reproducible and portable simulation pipeline to evaluate multivariate methods in the context of phenotypegenotype association studies (available at https://github.com/dgarrimar/mantasim).
To illustrate the versatility and power of asymptotic PERMANOVA, we have employed it specifically to identify genetic associations with intrinsically multivariate phenotypic traits. First, we used it to generate the first catalog of populationbiased cis genetic effects on alternative splicing across human tissues. In particular, we identified 7719 pbsQTLs in the GTEx V8 cohort, mainly in tissues with known differences between individuals of European and African ancestry (i.e., skin [44] or muscle [45]), but also in others such as thyroid or blood. There are multiple reasons, other than true different genetic effects, why an association could be observed in one population but not in another, such as different LD structure, different allele frequencies, different power due to unbalanced sample sizes, etc. However, we have specifically looked at these factors, and they do not seem to have a significant impact. In addition, as with populationbiased eQTLs, identifying these associations with GTEx sample sizes is challenging [2], especially given the large sample size imbalance between ancestry groups. However, in a scenario with highly correlated traits that depart substantially from normality (i.e., relative transcript abundances), our multivariate, nonparametric method offers increased power. Despite its modest size, the generated catalog can help to gain insights into the populationspecific regulation of alternative splicing, both in health and disease contexts, as we show for the KLK5 gene. In this case, we have identified a genetic variant that is associated with the alternative usage of KLK5 isoforms in European Americans, but not in African Americans. Since this alternative usage may be linked to disease risk [48, 50], the responsible allele would be a risk factor only in individuals of European ancestry. This is an example of how the European bias of genetic association studies could lead to inaccurate risk assessment in nonEuropean populations, and strongly argues for increasing the diversity of such studies, by including individuals of underrepresented ancestries [69].
Second, we employed our asymptotic approach to perform the largest GWAS of hippocampal subfield volumes to date (UK Biobank cohort, \(n =\) 41,414). Due to the highly shared genetic architecture of brainrelated traits and disorders, this is a paradigmatic scenario where multivariate approaches are better tailored to capture underlying biology than the conventional univariate strategy [23]. Indeed, we identified 41 (31 novel) genomewide significant loci, over four times more than the largest previous study on individual subfields [4]. Subsequent functional enrichment revealed a large overlap with loci associated with a variety of brainrelated traits, gene sets with functions in neurodevelopment and links to disease (schizophrenia, Alzheimer’s disease). We also identified some candidate genes, such as GMNC or IL34, with key roles in brain pathophysiology. Overall, these results contribute to understand the mechanisms through which genetic variants impact complex organismal traits, via their effects on molecular and higher order endophenotypes, such as organ morphology.
Beyond the specific casestudies above, a wide variety of traits in Biology are intrinsically multivariate (e.g., size and connectivity of brain regions, cardiometabolic traits, facial and allometric measurements, neuropsychiatric symptoms, composition of the gut microbiota, gene networks, singlecell multiomics, etc.) and—often—not normally distributed. Hence, these traits are wellsuited for the analysis with asymptotic PERMANOVA. Consider also the measurements obtained from wearable devices [70] or the information that can be automatically extracted using deep learning techniques, such as features obtained from histological images via convolutional autoencoder networks [71]. In addition, fast multitrait interaction tests offer the opportunity to investigate (pleiotropic) contextdependent genetic effects and epistasis [72] genomewide. Although in this work we have focused on testing the association between genetic variants and biological traits, note that our approach, as implemented in MANTA, can be applied to any set of predictor and (quantitative) response variables of interest, expanding its usage to multiple fields (Biology, Chemistry, Economics, Epidemiology, Psychology, Physics, etc.).
Nevertheless, our method also presents some limitations. First, while it does not make any assumption on the distribution of the traits, it still requires homoscedasticity (homogeneity of trait dispersions) and independence between the observations. Both are common assumptions in most linear modeling strategies, although they can be relaxed in generalized linear models or mixed models. However, these require either defining a priori the variance structure, which can be particularly difficult in large and complex biological datasets, or inferring it from the actual data, which is often highly inefficient. Our assumption of independence can be violated in the presence of population stratification or genetic relatedness between individuals [40]. Although we can account for the former by including the top genotype PCs as covariates in the model, the latter can only be corrected via mixed models. Nevertheless, as we show here, multivariate LMMs can have prohibitive running times even when analyzing few traits in mediumsized cohorts. In addition, asymptotic PERMANOVA seems more robust than MANOVA to this assumption, and genetic relatedness can be reliably inferred [73], being straightforward to discard related individuals prior to GWAS. Recently, PERMANOVA was modified to incorporate information of genetic relatedness in a mixed model manner [74]. Evaluating whether our asymptotic result can be applied in this context would be a potential avenue for future research. A second limitation is that asymptotic PERMANOVA, as with most multivariate methods, does not provide an estimate of the effect size that can be directly employed in downstream analyses, unlike univariate approaches. Finally, our statistical framework is currently limited to Euclidean distances, which may be too restrictive for some applications.
Conclusions
We propose asymptotic PERMANOVA as a fast, powerful exploratory method, that can be followed up by more detailed analyses to further characterize the relationship between predictors and dependent variables. We provide an efficient, userfriendly implementation in the MANTA R package, and a containerized Nextflow pipeline for highly parallel, portable, and reproducible multivariate GWAS analyses (mvgwasnf). As the size, the multidimensionality and the overall complexity of biological data continues to grow, our approach will enable nonparametric, multivariate analyses of millions of individuals in reasonable running times.
Methods
The PERMANOVA test statistic
Following the notation of the “Results” section, the \(n \times q\) matrix of response variables, \({\textbf {Y}} =(y_{ij})\) collects n independent observations of a vector of q random variables, and the \(n \times p\) matrix \({\textbf {X}}\), the values of p predictor variables. Anderson proposed a geometric, permutationbased method (permutational multivariate analysis of variance or PERMANOVA) [25] in order to study the effects of \({\textbf {X}}\). This approach uses a \(n \times n\) suitable distance matrix between the n individuals based on the \({\textbf {Y}}\) outcomes, allowing the computation of a pseudoF statistic. If the Euclidean distance is used, some properties can be studied in the context of the standard multivariate multiple linear regression (MMR) (see Additional file 1: Supplementary Note 1). The aim of MMR is to regress \({\textbf {Y}}\) on \({\textbf {X}}\) following the model in (1) and generalizes some of the multiple regression results (that is, when \(q=\) 1). For instance, the ordinary least squares (OLS) estimation of the \(\hat{\varvec{\beta }}\) parameters is \(\hat{\varvec{\beta }} = ({\textbf {X}}^\text {T}{} {\textbf {X}})^{1}{} {\textbf {X}}^\text {T} {\textbf {Y}}\), provided that \({\textbf {X}}\) has full rank. \(\hat{\varvec{\beta }}\) is the solution of the q simultaneous multiple linear regressions on each column of \({\textbf {Y}}\). Each column in \(\hat{\varvec{\beta }}\) corresponds exactly to the individual multiple regression of the associated column in \({\textbf {Y}}\).
In the Euclidean distance case, if the null hypothesis of interest is \(\varvec{\beta } = {\textbf {0}}\) (all the parameters of every predictor are null, i.e., the omnibus test), the PERMANOVA and MMR test statistics are equivalent with expression:
where \({\textbf {H}}\) is the usual projection matrix in linear models and tr denotes the trace. Expression (2) in the “Results” section shows the test statistic used when testing hypotheses on a subset of parameters.
Null distribution of the test statistic under permutation
The empirical null distribution of the PERMANOVA test statistic (\(\tilde{\text {F}}\)) can be characterized using permutations, that is, by recomputing \(\tilde{\text {F}}\) after random shuffling of the data. Then, p values are obtained by comparing the observed value of \(\tilde{\text {F}}\) to the distribution of permuted \(\tilde{\text {F}}^{\pi }\) values. The only assumption of the permutation test is that the observations are exchangeable under the null hypothesis (\(H_0\)). In complex designs, however, it is unclear how to ensure this in order to obtain an exact test (i.e., a test with a type I error rate exactly equal to the significance level selected a priori) [68].
In the case of a model with two main factors (i.e., A and B) and an interaction term (i.e., AB), observations are exchangeable between the different levels of A and B only under the global null hypothesis. However, in the presence of main effects (A or B under the alternative hypothesis, \(H_1\)) observations are exchangeable only within levels of other main factors. For example, if B is under \(H_1\), an exact permutation test for A that controls for the effect of B requires permutations to be restricted to the levels of B. In this scenario, unrestricted permutation of raw data yields an approximate test. See [68] for a detailed discussion. Notably, there is no exact test for the interaction term controlling for the effect of both main factors, as in this case the only possible value of the permuted test statistic is the one obtained on the original data.
Computation of asymptotic p values and MANTA R package
As we describe in this work, the null distribution of the test statistic converges to a weighted sum of independent chisquare variables (see the “Results” section). To compute asymptotic p values, we can rely on its cumulative distribution function (CDF). Although such distribution does not have a closed form, it can be approximated with high accuracy, and several approaches are available. We focused on three of these algorithms: Imhof [31], Davies [32] and Farebrother [33], as implemented in the CompQuadForm R package [75]. While the first two rely on the numerical inversion of the characteristic function, the third takes advantage of the fact that the CDF can be expressed as an infinite series of central chisquare distributions [75].
To compare their performance, we simulated uniform sets of weights (\(\lambda _j \sim U(a = 0, b = 1)\), with \(j \in \{1, \ldots , q\}\)), considering different values of q and degrees of freedom for the chisquare distribution. Then, for a range of values of the test statistic we evaluated the obtained p values and the computation time. Note that any set of weights can be scaled to obtain values in the interval [0,1], and that scaling both the weights and the test statistic results in identical asymptotic p values. The typical behavior of each algorithm is shown in Additional file 1: Fig. S1.
Overall, we found almost identical p values between the three methods down to a precision of \(10^{10}\). However, while Farebrother p values decreased monotonically with the value of the test statistic, down to the precision limit (\(\approx 10^{14}\)), Imhof and Davies p values below \(10^{10}\) displayed an erratic behavior, with values \(\le 0\). In addition, regarding speed, Farebrother outperformed Imhof and Davies in the majority of scenarios. Hence, we selected the Farebrother method for p value calculation. Only when \(\lambda _j/\sum_{j=1}^q\lambda _j \approx 0\), for one or more j in \(\{1, \ldots , q\}\), this approach displayed longer running times, especially for large values of the test statistic. To solve this problem, we dropped the weights for which \(\lambda _j/\sum_{j=1}^q\lambda _j < t\). We tried several values of t, and found that \(t = 10^{3}\) provides a good balance between speed and accuracy.
We have implemented the asymptotic PERMANOVA test in the MANTA R package, available at https://github.com/dgarrimar/manta and in the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/package=manta. MANTA enables asymptotic p value calculation for the predictors in userdefined MMR models, using the Farebrother method. It allows to select different types of sums of squares (i.e., I, II, or III), as well as logarithm and square root data transformations.
Simulations to compare asymptotic and permutation tests
Model
We considered the following MMR model, in which the response variables are regressed on two categorical predictors (i.e., factors A and B) and their interaction (AB):
where \(\varvec{y}_{kli}\) is the qdimension vector corresponding to the k, l, i observation, \(\varvec{\mu }\) the vector of means, \(\varvec{\alpha }_k\) the q vector of parameters (one component per response variable) associated with level k of factor A and, similarly, \(\varvec{\beta }_l\) the vector of parameters of level l of factor B, \(\varvec{\alpha \beta }_{kl}\) the vector corresponding to level k, l of the interaction AB, and \(\varvec{\epsilon }_{kli}\) the vector of random errors. We selected 2 and 3 levels for A and B, respectively, in a completely crossed, balanced design.
Data generation
Observations of the vector of response variables (\(\varvec{y}\), a row of \({\textbf {Y}}\)) were generated by random sampling from a given multivariate distribution. We considered several distributions, varying the total sample size (n) and the number of response variables (q). We simulated both the null hypothesis (\(H_0\)) of no association between predictors and responses and the alternative hypothesis (\(H_1\)) of factor B associated with all the responses. To illustrate that transformations of \({\textbf {Y}}\) that preserve the independence of the observations result in the same limiting distribution, in some scenarios we applied square root and logarithm transformations.

Multivariate normal. We considered first this scenario, as it is assumed in many multivariate linear modeling approaches. Under \(H_0\), the vector of responses was simulated as \(\varvec{y} \sim \mathcal {N}({\textbf {0}}, {\textbf {I}}_q)\), where \({\textbf {I}}_q\) is the \(q \times q\) identity matrix. Under \(H_1\), we generated \(\varvec{y}\) in the first and second levels of B with means 1 and −1, respectively, ensuring that \(E(\varvec{y}) = {\textbf {0}}\).

Vectors of proportions. Our interest in this scenario is related to our previous work with multivariate proportion data for the study of alternative splicing [26, 28, 29], and corresponds to the generation of points in the \(q1\) simplex. Here, \(\varvec{y} \sim \mathcal {S}(\varvec{p}, \sigma _g)\), where \(\varvec{p}\) is a given point in the simplex and \(\sigma _g\) is the standard deviation of the generator model. We obtained \(\varvec{p}\) so that \(p_1 = \frac{L}{(q+L1)}\) and \(p_j = \frac{1}{(q+L1)}, \forall j \in \{2, \ldots , q\}\), with \(L \in \{1,2, \ldots , 10\}\). Note that \(L = 1\) corresponds to the center of the simplex, while \(L> 1\) to locations that range from the center of the simplex to one of its vertices, \(\varvec{e}_{\varvec{1}} = (1, 0, \ldots , 0)\). Unless stated otherwise, we set \(L = 1\). To generate observations in the \(q1\) simplex with certain variability (given by \(\sigma _g\)) around \(\varvec{p}\), ensuring that \(E(\varvec{y}) = \varvec{p}\), we implemented an approach that performs small random displacements from \(\varvec{p}\) towards the simplex vertices (see Additional file 1: Supplementary Note 2). Under \(H_1\), we generated observations of the responses in the first level of factor B as \(\varvec{y} \sim S(\varvec{p}_\Delta , \sigma _g)\), where \(\varvec{p}_{\Delta }\) is obtained from \(\varvec{p}\) advancing along the geodesic that joins \(\varvec{p}\) with the simplex vertex \(\varvec{e}_1 = (1, 0 \ldots , 0)\). This displacement depends on a parameter \(\Delta\) (see Additional file 1: Supplementary Note 2). Analogously, in the second level of B, we obtained \(\varvec{y} \sim S(\varvec{p}_{\Delta }, \sigma _g)\) to ensure that \(E(\varvec{y}) = \varvec{p}\). Once \({\textbf {Y}}\) was obtained, we applied a square root transformation. This is equivalent to using the Hellinger distance on the untransformed data, as pointed out in the “Results” section.

Gaussian copula with uniform marginals. In this scenario, \(\varvec{y} \sim \mathcal {C}(\varvec{R})\), where \(\varvec{R}\) is the correlation matrix of \(\varvec{y}\). We set \(\varvec{R}\) to \({\textbf {I}}_q\). We used the normalCopula function from the copula R package [76], with uniform \(U(a = 0, b = 1)\) marginals, to generate \({\textbf {Y}}\), which was eventually centered and scaled. We also simulated observations of the responses under \(H_1\) by adding (subtracting) 1 to the rows of \({\textbf {Y}}\) corresponding to the first (second) level of factor B.

Multinomial. In this scenario, \(\varvec{y} \sim \mathcal {M}(N, \varvec{p})\), where N and \(\varvec{p}\) are the number of trials and the vector of event probabilities, respectively. We set N to 1000 and simulated \(\varvec{p}\) as in the multivariate proportion scenario (see above). Under \(H_1\), we simulated observations of the response variables in the first level of B as \(\varvec{y} \sim \mathcal {M}(N, \varvec{p}_\Delta )\), where \(\varvec{p}_{\Delta }\) is obtained from \(\varvec{p}\) as in the multivariate proportion scenario (see Additional file 1: Supplementary Note 2). Likewise, in the second level of B, \(\varvec{y} \sim \mathcal {M}(N, \varvec{p}_{\Delta })\) to ensure that \(E(\varvec{y}) = N\varvec{p}\). Once \({\textbf {Y}}\) was obtained, we applied a logarithm transformation.
Evaluation of running time
To compare the running time of asymptotic vs standard PERMANOVA, we simulated multivariate normal data with increasing size (from 1000 to 10,000 observations, \(q =\) 3) following model (5), under \(H_0\) (see above). We measured the running time to perform a single test, repeated 5 times with different input data. In the case of standard PERMANOVA, we considered 10\(^3\), 10\(^4\), and 10\(^5\) permutations. All running times were measured on a single core of an Intel Xeon Platinum 8160 CPU (2.10GHz).
Simulations in the context of genotypephenotype association studies
Model
We employed the following MMR model, in which multiple phenotypes are regressed on a set of continuous covariates, in addition to the genetic variant of interest:
where \(\varvec{y}_i\) is the vector of q traits measured in the ith individual, \(\varvec{\mu }\) the vector of intercepts, \(\varvec{c}_i\) the vector of covariates corresponding to the ith individual, \(\varvec{W}\) the \(k \times q\) matrix of covariate effects on the q traits, \(x_i\) the genotype (scalar, i.e., 0, 1, or 2) of variant X in this individual, \(\varvec{\beta }\) the vector of genetic effects of X on each of the q traits, and \(\varvec{\epsilon }_i\) the vector of random errors.
We further considered a second model, without additional covariates, which incorporates a random term to account for population structure:
where \(\varvec{u}_i\) is the vector of random effects due to population structure.
Here, we refer to population structure, not only as largescale, systematic differences in ancestry, but also as other forms of relatedness between individuals. Note that the former may be also accounted for by including the top k principal components of the genotype matrix as covariates in model (6) [40].
Data generation
Genotypes and population structure
We obtained genotype data from the 1000 Genomes Project (1KGP, Phase 3). We considered 8,046,946 biallelic SNPs and short indels with minimum allele frequency (MAF) \(\ge\) 0.05, measured in 2504 individuals. In addition to the real 1KGP dataset, we simulated cohorts of \(n =\) 1000 individuals with different population structures (unrelated individuals, population stratification, relatedness) as described in [36]. In short, we first assigned randomly A ancestors from the real 1KGP dataset to each new individual. Then, we simulated the individual’s genotype as a mosaic of blocks of 1000 variants, each coming from one of the ancestors selected at random. Genetic relatedness in the simulated cohort depends on the number of ancestors A (e.g., \(A = 2\) for highly related individuals, \(A = 10\) for approximately unrelated individuals). Population stratification can also be simulated by sampling the ancestors of an individual from the same subpopulation. To simulate unrelated and related individuals we set \(A = 10\) and \(A = 2\), respectively, with ancestors sampled from all European populations (CEU, FIN, GBR, IBS, TSI). To simulate population stratification, we set \(A = 10\), with each individual’s ancestors sampled from the same European population.
For running time and RAM usage evaluation (see below), we simulated cohorts mimicking the 1KGP data structure (\(A = 10\), each individual’s ancestors sampled from the same subpopulation, considering all subpopulations available), with sample sizes in the range 1000−100,000. To study the impact of heteroscedasticity and outliers on type I error as a function of MAF (see below), we simulated biallelic SNPs with a given MAF under a binomial model, with \(x_i \sim B(N, p)\), where \(N = 2\) and \(p =\) MAF. We considered MAFs \(\in \{0.2, 0.1, 0.05, 0.01, 0.005\}\).
Phenotypes
We simulated different numbers (q) of phenotypes, as the sum of the contribution of the effect of a causal variant, population structure and residual noise, following an additive model analogous to (7):
equivalently, in matrix form:
where Y is the \(n \times q\) matrix of phenotype values, \(\varvec{x}\) the vector of genotypes at variant X for n individuals, \(\varvec{\beta }\) the vector of genotype effects on the q traits, \({\textbf {U}}\) the \(n \times q\) matrix of random effects due to population structure, and E the \(n \times q\) matrix of random errors.
Random errors were simulated by random sampling from a multivariate distribution:

Multivariate normal. Here, \(\varvec{\epsilon } \sim \mathcal {N}({\textbf {0}}, \varvec{\Sigma })\). We simulated structured covariance matrices as follows. First, we set all pairwise trait correlations to r, where \(r \in \{0, 0.2, 0.5, 0.8\}\). Then, we obtain \(\varvec{\Sigma } = \varvec{V}^\frac{1}{2}\varvec{RV}^\frac{1}{2}\), where \(\varvec{R}\) is the trait correlation matrix and \(\varvec{V}\) a \(q \times q\) diagonal matrix containing the trait variances. We considered either unit variances or equally spaced variances in the range \(\{1, \ldots , \sigma ^2_{max} \}\), with \(\sigma ^2_{max} > 1\). We also generated random covariance matrices as \(\varvec{\Sigma } = {\textbf {AA}}^T\), where \({\textbf {A}}\) is a \(q \times q\) matrix with elements \(a_{jk} \sim \mathcal {N}(0, 1)\). Note that we obtained \(\varvec{\Sigma }\) so that it is positive definite. In this scenario, we simulated heteroscedasticity by setting covariance matrices to \(\varvec{\Sigma }\), \(\frac{\tau + 1}{2}\varvec{\Sigma }\) and \(\tau \varvec{\Sigma }\), respectively, depending on whether the genotype at the variant is 0, 1, or 2. We further simulated differences in the correlation structure (but not in the trait variances) between genotype groups, by scaling covariances accordingly to obtain correlation matrices \(\varvec{R}\), \(\frac{2}{\tau+1} \varvec{R}\), and \(\frac{1}{\tau} \varvec{R}\), respectively, depending on whether the genotype at the variant is 0, 1, or 2. We set \(\tau\) \(\in \{2, 3, 4\}\).

Multivariate \(\varvec{t}\). Here, \(\varvec{\epsilon } \sim t_\nu ({\textbf {0}}, \varvec{\Sigma })\). We set \(\nu = 3\) degrees of freedom. This distribution is similar to multivariate normal but presents very heavy tails. We simulated unit variances and different trait correlations as in the multivariate normal case.

Vectors of proportions. Here, \(\varvec{\epsilon } \sim \mathcal {S}(\varvec{p}, \sigma _g)\), where \(\varvec{p}\) is a given point in the simplex and \(\sigma _g\) is the standard deviation of the generator model. Details on the simulation are given in a previous section.

Gaussian copulas with different marginals. Here, \(\varvec{\epsilon } \sim \mathcal {C}(\varvec{R})\), where \(\varvec{R}\) is the trait correlation matrix. We considered unit variances and different trait correlation structures as in the multivariate normal case. We simulated different marginal distributions: either uniform \(U(a = 0, b = 1)\), beta \(Beta(\alpha = 0.5, \beta = 0.5)\), or gamma \(\Gamma (k = 1, \theta = 10)\). Details on the simulation are given in a previous section.

Multinomial. Here, \(\varvec{\epsilon } \sim \mathcal {M}(N, \varvec{p})\), where N and \(\varvec{p}\) are the number of trials and the vector of event probabilities, respectively. Details on the simulation are given in a previous section.
Population structure effects were simulated by random sampling from a matrix normal distribution [18, 36]:
where K is the known \(n \times n\) relatedness matrix, obtained as \({\textbf {K}} = \frac{1}{g}{} {\textbf {GG}}^T\), where G is the centered and scaled genotype matrix corresponding to g genomewide variants observed in n individuals. \(\varvec{\Sigma }_{\textbf {U}}\) is the traittotrait covariance matrix due to population structure, obtained as \(\varvec{\Sigma }_{\textbf {U}} = {\textbf {AA}}^T\), where A is a \(q \times q\) matrix with elements \(a_{jk} \sim \mathcal {N}(0, 1)\).
We simulated the null hypothesis (\(H_0\)) of no association between the genetic variant and the traits, by setting null genetic effects (\(\varvec{\beta } = {\textbf {0}}\), i.e., dropping the variant term in (8)). Additionally, we simulated the alternative hypothesis (\(H_1\)) where a causal variant affects \(t \le q\) traits with equal (\(\varvec{\beta } = {\textbf {1}}\)) or different effects (\(\beta _j\) equally spaced in the range \(\{1, \ldots , \beta _{max} \}\), with \(\beta _{max} > 1\)). Note that the former and the latter correspond, respectively, to concordant and discordant genetic effects and traittotrait correlations. When generating multivariate proportion or multinomial traits, we simulated concordant genetic effects as \(\beta _1 = 1\), \(\beta _j = \frac{1}{(q  1)}, \, j > 1\). We rescaled matrices \({\textbf {x}}\varvec{\beta }^T\), \({\textbf {U}}\) and \({\textbf {E}}\) so that the fractions of variance explained by the genetic variant and population structure were \(h_v^2\) and \(h_g^2\), respectively. We selected \(h_v^2\) ranging from 0.001 to 0.01, and \(h_g^2 \in \{0,0.2,0.4,0.6,0.8\}\).
Finally, to generate traits (rather than residuals), that are vectors of proportions (i.e., sum up to 1), we followed the steps depicted in a previous section to obtain points in the simplex. To simulate \(H_1\) (i.e., a causal variant associated with the traits), we generated values of the traits in individuals with genotypes 0, 1, and 2 at the causal variant as \(\varvec{y} \sim S(\varvec{p}_{\Delta }, \sigma _g)\), \(\varvec{y} \sim S(\varvec{p}, \sigma _g)\) and \(\varvec{y} \sim S(\varvec{p}_\Delta , \sigma _g)\), respectively (see above for details). Note that in this case population structure is not taken into account.
Evaluation of type I error and power
We selected a significance level of \(\alpha =\) 0.05. For each combination of conditions, we simulated \(m =\) 10,000 phenotypegenotype pairs (Y, \(\varvec{x}\)). Under \(H_0\), we evaluated the type I error for asymptotic PERMANOVA test, MANOVA (Pillai’s trace), and the multivariate LMM—implemented in GEMMA [18]—(Wald test), as follows:
where p is the p value of the association and \(\varvec{I}\) the indicator function. We employed an analogous setting to estimate power when simulating phenotypegenotype pairs under \(H_1\). We evaluated power after adjusting p values for multiple hypothesis testing (Bonferroni).
Evaluation of running time and RAM usage
To evaluate the running time and RAM usage of MANTA, MANOVA and GEMMA as a function of sample size, we simulated cohorts mimicking the population structure of the 1KGP dataset (see above), with increasing size (from 1000 to 100,000 individuals). We fixed the fraction of variance explained by population structure at 0.2 and the number of traits at 3. We measured the running time to perform 10,000 tests on the same number of phenotypegenotype pairs, simulated under \(H_0\) (no causal variant effects). In the case of MANTA and MANOVA, 20 genotype principal components (PC) were included as covariates in the model. In the case of GEMMA, we also considered the time required to build the relatedness matrix. Each run was repeated 5 times with different input data. Analogously, we evaluated the running time of the three methods as a function of the number of traits (from 2 to 20), simulating cohorts with sample size 1000. All running times were measured on a single core of an Intel Xeon Platinum 8160 CPU (2.10GHz). Regarding RAM usage, we report the peak RSS (resident set size) for the corresponding processes, estimated by Nextflow via ps o rss.
Implementation
We used the implementation of standard PERMANOVA available in the vegan R package (adonis method). Asymptotic p values were computed using our implementation of asymptotic PERMANOVA test, available at https://github.com/dgarrimar/manta (MANTA v1.0.0). MANOVA p values were computed using the manova method in the stats R package, with default parameters. We employed the implementation of the multivariate LMM available in GEMMA [18] v0.98.3 (https://github.com/geneticsstatistics/GEMMA), which relies on Wald test for p value calculation (default). Model (5) was assessed by MANTA with default parameters. Both MANTA (type I sums of squares, p values computed only for the genotype term via the option subset) and MANOVA assessed model (6). GEMMA assessed model (7). Numerical optimization in GEMMA failed (and consequently, the software produced an error and did not generate any result) for a small fraction of the tests performed. However, the number of failures was large in certain scenarios, e.g., when the trait distribution deviated substantially from multivariate normality or the number of traits analyzed was relatively large. All the simulations were performed using R v3.5.2 and Python v3.5.3. For parallelization and portability purposes, we embedded our code in a pipeline (available at https://github.com/dgarrimar/mantasim) built using Nextflow (v20.04.1), a framework for computational workflows [37]. We also used Docker container technology (https://www.docker.com) to ensure the reproducibility of our results.
Populationbiased splicing QTL mapping
GTEx data and population definition
Transcript expression (transcripts per million, TPM) and variant calls were obtained from the V8 release of the GTEx Project (dbGaP accession phs000424.v8.p2, https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000424.v8.p2). These correspond to 15,253 samples from 838 deceased donors with both RNAseq in up to 54 tissues and Whole Genome Sequencing (WGS) data available. Metadata at donor and sample level was also retrieved. GTEx V8 uses the hg38/GRCh38 human reference genome assembly and the GENCODE v26 annotation (https://www.gencodegenes.org/human/release_26.html). Further details on GTEx data preprocessing and quality control (QC) pipelines can be found in [2].
We considered individuals of European and African ancestry (97.6% of all individuals). We defined European Americans (EA, n = 715) and African Americans (AA, n = 103) as the subset of selfreported White and Black individuals, respectively (field “RACE” in the GTEx metadata). Selfreported ancestry was confirmed via genotype PCA (Additional file 1: Fig. S18). Following [2], we restricted our analysis to 31 tissues with sample sizes >20 in both populations.
pbsQTL mapping
For cis populationbiased sQTL (pbsQTL) mapping, we developed a slightly modified version of sQTLseekeR2, named sQTLseekeR2.int, which implements asymptotic PERMANOVA test to assess the significance of the association between alternative splicing (AS) on one side, and the genotype, the condition of interest (i.e., ancestry), and the interaction between the two on the other. In sQTLseekeR2.int, as in sQTLseekeR2, AS is modeled as a multivariate outcome, composed by the relative abundances of the alternative transcript isoforms of a gene (splicing ratios), after a square root transformation. In this context, a significant interaction should be interpreted as a genetic effect on the AS phenotype that differs between EA and AA individuals. However, sQTLseekeR2.int does not provide information about the specific transcript isoforms affected or the size of the multitrait effect. To help with interpretation, it reports the absolute maximum difference (MD) in mean adjusted transcript relative expression between genotype groups [29], for both EA and AA individuals (although the two transcripts involved in MD calculation for EA and AA individuals may not coincide). sQTLseekeR2.int was run using a containerized Nextflow pipeline. The software and the pipeline are available at the interaction branch of the https://github.com/guigolab/sQTLseekeR2 and https://github.com/guigolab/sqtlseeker2nf repositories, respectively.
Under the assumption that most variants with cis effects on alternative splicing are likely to be carried on the sequence of the primary transcript or its close vicinity [29], the cis window was defined as the gene body plus 5 Kb upstream and downstream the gene boundaries. We considered protein conding genes and long intergenic noncoding RNAs (lincRNAs) expressed \(\ge\) 1 TPM in at least 80% of the samples (samples with lower gene expression were removed from the analysis of the gene), with at least two isoforms and a minimum isoform expression of 0.1 TPM (transcripts with lower expression in all samples were removed). These filters correspond to the default parameters of sQTLseekeR2.int. We analyzed only biallelic SNPs and short indels (autosomal + X) with MAF \(\ge\) 0.01. We required at least 10 samples per observed level of the interaction term. Both genotype and population were treated as categorical variables. Donor ischemic time, gender and age, as well as the sample RIN (RNA integrity number) and genotyping platform were regressed out from the splicing ratios prior to association testing.
In total, 687,737 variants and 14,122 genes were analyzed. To correct for the fact that multiple variants are tested per gene, we used eigenMT [77]. eigenMT estimates the effective number of independent tests (\(M_{eff}\)) per gene, considering the LD structure among the tested variants. \(M_{eff}\) is then used instead of the total number of tests (M) in Bonferroni correction. We set \(\alpha =\) 0.05. As our test statistic is sensitive to the heterogeneity of the splicing ratios’ variability between the levels of the interaction term, a permutationbased (10\(^4\) permutations) multivariate homoscedasticity test [78] was also performed for each genevariant pair. Pairs failing this test after multiple testing correction by eigenMT were not reported as significant pbsQTLs. eigenMT allows to compute a genelevel p value (corresponding to the smallest—corrected—p value per gene). To account for the fact that multiple genes are tested genomewide, we applied BenjaminiHochberg false discovery rate (FDR) to genelevel p values [77]. We set a FDR threshold of 0.1.
We evaluated the impact that differences in allele frequency or linkage disequilibrium (LD) between EA and AA individuals could have in pbsQTL discovery. Regarding allele frequencies, we estimated the MAF of all the tested genetic variants, separately in the EA and AA cohorts. Then, for each variant, we obtained \(\Delta \text {MAF} = \text {MAF}_{\text {EA}}\text {MAF}_{\text {AA}}\). The percentage of pbsQTLs and not pbsQTLs (other tested variants not identified as pbsQTLs) with large differences in MAF (\(\Delta \text {MAF}>\) 0.3) was not significantly different (2.4% vs 2.2%, respectively, \(\chi ^2\) test p value 0.28). As for LD, we computed the average \(r^2\) (i.e., \(\overline{r^2}\)) between all pairs of variants tested for association with a given gene, separately in the EA and AA cohorts. Then, for each gene, we obtained \(\Delta \overline{r^2} = \overline{r^2}_{\text {EA}}  \overline{r^2}_{\text {AA}}\). The percentage of pbsGenes and not pbsGenes (other tested genes not identified as pbsGenes) with large differences in LD (\(\Delta \overline{r^2}>\) 0.3) was not significantly different (3.4% vs 3.7%, respectively, \(\chi ^2\) test p value 0.66).
We also investigated the effect of sample size imbalance between EA and AA cohorts on statistical power to identify pbsQTLs. Particularly, we were interested in evaluating whether sample size imbalance could result in differences in power to detect EAspecific vs AAspecific effects. We simulated the splicing ratios of 1000 genes with \(q=3\) splicing isoforms, measured in \(n=\) 1000 individuals (as in the vectors of proportions scenario depicted above). We generated a factor with two levels, mimicking the two ancestry groups, with decreasing proportions of the less represented ancestry, and 1000 binomial SNPs with MAF = 0.3, matching the median MAF of the variants tested in the GTEx dataset analysis. We simulated populationspecific (present in one of the ancestry groups but absent in the other), additive sQTL effects of arbitrary size (\(\delta =\) 0.005). We used MANTA to assess the significance of the interaction term, as in the analysis of the real GTEx dataset. We estimated power as previously described. To account for the variability in power estimates, we repeated the simulation and power calculation procedure 100 times.
Functional enrichment of pbsQTLs
We obtained eCLIP peaks in HepG2 and K562 cell lines for 113 RBPs [79] from the ENCODE Portal (https://www.encodeproject.org, accessed 20190704). For each RBP, we selected the peaks significant at FDR < 0.01 and with a foldchange (FC) with respect to the mock input \(\ge\) 2. We further required a minimum overlap between replicates (50% of the length of the union of a given pair of peaks). Splice donor and acceptor sites from proteincoding and lincRNA genes were derived from the GENCODE v26 annotation (https://www.gencodegenes.org/human/release_26.html). Disease and complextrait associated variants were retrieved from the GWAS Catalog (https://www.ebi.ac.uk/gwas, accessed 20210129), extended to include GTEx variants in high linkage disequilibrium (\(r^2>\) 0.8) with the GWAS hits. We obtained ChIPseq peaks for 66 transcription factors from the Ensembl Regulation dataset (ftp://ftp.ensembl.org/pub/release86/regulation/homo_sapiens/AnnotatedFeatures.gff.gz). The coordinates of these functional elements were overlapped with all the tested variants (either pbsQTLs or not) to obtain a functional annotation per variant. Then, pbsQTLs were compared to a null distribution of 1000 sets of randomly sampled variants not identified as pbsQTLs (i.e., non pbsQTLs), of the same size of the pbsQTL set. Non pbsQTLs were matched to pbsQTLs in terms of relative location within the gene and MAF. The enrichment was calculated as the odds ratio (OR) of the frequency of a certain annotation among pbsQTLs to the mean frequency of the same annotation across the 1000 non pbsQTLs sets. The significance of each enrichment was assessed using a twosided Fisher’s exact test.
GWAS of the volumes of hippocampal subfields
UK Biobank data
We obtained magnetic resonance imaging (MRI)derived hippocampal subfield volumes, corresponding to 41,414 unrelated (pairwise KING [73] kinship coefficient < 0.0884) individuals with genotype data available, from the UK Biobank. Hippocampal subsegmentation from T1weighted structural images was performed with FreeSurfer v6.0 (http://surfer.nmr.mgh.harvard.edu), based on the atlas described in [80], in the framework of an imageprocessing pipeline developed and run on behalf of the UK Biobank [81]. We also obtained metadata at individual level. Further details on image acquisition, quality control, and analysis are available at Resource 1977 (https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=1977). Additional information on genotyping, genotype QC, and imputation can be found at Category 100319 (https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=100319) and at [82] (human reference genome assembly: hg19/GRCh37). The complete list of DataFields retrieved can be found in Additional file 1: Table S5.
Genomewide association analysis
The traits of interest were the volumes of 12 hippocampal subfields. From anterior to posterior, approximately: parasubiculum, presubiculum, subiculum, cornu ammonis (CA) 1, CA 2/3, CA 4, granule cell layer of the dentate gyrus (DG), hippocampusamygdala transition area (HATA), fimbria, molecular layer of the DG, hippocampal fissure, and hippocampal tail. We obtained the total volume of each subfield by summing its corresponding volume in (i) hippocampal tail and body and (ii) left and right brain hemispheres. The correlation of subfield volumes between hemispheres was large (Additional file 1: Fig. S25). We considered individual’s age, gender, total hippocampal volume (sum of all subfields minus the hippocampal fissure), and the first five genotype principal components (PCs) as relevant covariates. We analyzed only biallelic SNPs and short indels in autosomes with MAF \(\ge\) 0.01, missingness < 0.05 and at least 10 individuals per genotype group.
We used MANTA v1.0.0 (https://github.com/dgarrimar/manta) (with type I sums of squares) to test for association between genetic variants and the volumes of the 12 hippocampal subfields. We defined a model that included the covariates plus the genotype. Except for gender, all predictors were treated as continuous variables. p values were computed only for the genotype term via the option subset in MANTA. In total, 8,197,132 variants were tested for association. The analysis was run within a containerized Nextflow pipeline, that we named mvgwasnf (v1.0.0), available at https://github.com/dgarrimar/mvgwasnf. We adopted the common 5 \(\cdot\) 10\(^{8}\) threshold for genomewide significance.
Functional analysis
Loci definition was carried out by the Functional Mapping and Annotation of GenomeWide Association Studies (FUMA) platform [83], with default parameters. The identified loci, as well as the actual genomewide significant variants, were overlapped with the GWAS Catalog (including the Experimental Factor Ontology (EFO) annotations for the GWAS terms, https://www.ebi.ac.uk/gwas, accessed 20210129). We identified the closest genes to genomewide significant variants, and performed hypergeometric tests to assess Gene Ontology (GO) Biological Process term overrepresentation. We selected as gene universe all proteincoding genes, and set a FDR threshold of 0.1. GWAS summary statistics for Alzheimer’s disease—or family history of Alzheimer’s disease— [60] and schizophrenia [61] were also retrieved and overlapped with our set of genomewide significant variants, extended to include variants in high linkage disequilibrium (\(r^2 \ge\) 0.8).
Comparison between MANTA and GEMMA
We randomly downsampled the hippocampal subfield dataset to \(n=\) 10,000 individuals, and performed a GWAS with both MANTA and GEMMA. Given the slightly different filtering strategies, we focused on the common set of 7,077,024 tested variants. We used mvgwasnf (v1.0.0) to run MANTA as above. To run GEMMA, we implemented gemmanf (v1.0.0), a containerized Nextflow pipeline available at https://github.com/dgarrimar/gemmanf. We used the same covariates as above, except for genotype PCs, which were excluded from GEMMA’s analysis (LMMs can naturally account for population structure). We compared p value inflation between MANTA and GEMMA. As MANTA p values do not come from a normal distribution, we used an equivalent to \(\lambda _G\) computed on p values, \(\lambda _X = \text {median}(\log _{10}p_{\text {observed}})/\text {median}(\log _{10}p_{\text {null}})\) [84]. Inflation factors for MANTA and GEMMA were \(\lambda _X=\) 1.07 and \(\lambda _X=\) 1.02, respectively. GEMMA seems to do only slightly better accounting for population stratification than 5 genotype PCs. We used FUMA for loci definition, and the resulting loci were overlapped with the GWAS Catalog, as described above. We also obtained the complete EFO ontology (https://www.ebi.ac.uk/efo) in Open Biomedical Ontologies (OBO) format. We used the ontologySimilarity R package [85] to compute the pairwise semantic similarity (method = resnik) between the GWAS terms identified, and built a similarity matrix, S. From it, we derived a distance matrix, D, as \(max(S)  S\), which allowed to perform a hierarchical clustering of the terms.
Availability of data and materials
All the data employed in this study is publicly available. GTEx data was obtained from dbGaP (https://www.ncbi.nlm.nih.gov/gap), accession phs000424.v8.p2 (https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000424.v8.p2). Brain MRI and genotype data were obtained from the UK Biobank (https://www.ukbiobank.ac.uk/) under Application Number 50141. The list of DataFields retrieved is available in Additional file 1: Table S5. ENCODE eCLIP data was obtained from the ENCODE Portal (https://www.encodeproject.org). The Ensembl Regulation dataset was obtained from ftp://ftp.ensembl.org/pub/release86/regulation/homo_sapiens/AnnotatedFeatures.gff.gz. The GWAS Catalog, including the Experimental Factor Ontology (EFO) annotations for the GWAS terms, was obtained from https://www.ebi.ac.uk/gwas. The GENCODE v26 annotation was obtained from https://www.gencodegenes.org/human/release_26.html. A detailed description of the data can be found in the “Methods” section. The pbsQTL catalog and the hippocampal subfield GWAS summary statistics generated are available in Zenodo (https://doi.org/10.5281/zenodo.8349415 [42] and https://doi.org/10.5281/zenodo.8349443 [59], respectively).
The MANTA R package is available at https://github.com/dgarrimar/manta and in the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/package=manta. The Nextflow pipelines employed to perform the simulations are available at https://github.com/dgarrimar/mantasim. The Nextflow pipeline for multivariate GWAS using MANTA is available at https://github.com/dgarrimar/mvgwasnf. sQTLseekeR2int and its companion Nextflow pipeline for conditionbiased sQTL mapping are available at the interaction branch of the https://github.com/guigolab/sQTLseekeR2 and https://github.com/guigolab/sqtlseeker2nf repositories, respectively. The Nextflow pipeline for multivariate GWAS using GEMMA is available at https://github.com/dgarrimar/gemmanf. All the source code is also available in Zenodo (https://doi.org/10.5281/zenodo.8349555) [86] and released under the GPL3 license.
References
Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.
Moore JE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710.
van der Meer D, et al. Brain scans from 21,297 individuals reveal the genetic architecture of hippocampal subfield volumes. Mol Psychiatry. 2018;25:3053–65.
Natarajan P, et al. Deepcoverage whole genome sequences and blood lipids among 16,324 individuals. Nat Commun. 2018;9:3391.
Hughes DA, et al. Genomewide associations of human gut microbiome variation and implications for causal inference analyses. Nat Microbiol. 2020;5:1079–87.
Li YI, et al. RNA splicing is a primary link between genetic variation and disease. Science. 2016;352:600–4.
Korte A, et al. A mixedmodel approach for genomewide association studies of correlated traits in structured populations. Nat Genet. 2012;44:1066–71.
Galesloot TE, van Steen K, Kiemeney LALM, Janss LL, Vermeulen SH. A Comparison of Multivariate GenomeWide Association Methods. PLoS ONE. 2014;9:e95923.
Porter HF, O’Reilly PF. Multivariate simulation framework reveals performance of multitrait GWAS methods. Sci Rep. 2017;7:38837.
Pickrell JK, et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016;48:709–17.
Stephens M. A Unified Framework for Association Analysis with Multiple Related Phenotypes. PLoS ONE. 2013;8:e65245.
Giambartolomei C, et al. A Bayesian framework for multiple trait colocalization from summary association statistics. Bioinformatics. 2018;34:2538–45.
Moore R, et al. A linear mixedmodel approach to study multivariate geneenvironment interactions. Nat Genet. 2019;51:180–6.
Ning C, et al. Efficient multivariate analysis algorithms for longitudinal genomewide association studies. Bioinformatics. 2019;35:4879–85.
Ferreira MAR, Purcell SM. A multivariate test of association. Bioinformatics. 2009;25:132–3.
O’Reilly PF, et al. MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS. PLoS ONE. 2012;7:e34861.
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat Methods. 2014;11:407–9.
Furlotte NA, Eskin E. Efficient multipletrait association and estimation of genetic correlation using the matrixvariate linear mixed model. Genetics. 2015;200:59–68.
Ray D, Chatterjee N. Effect of nonnormality and low count variants on crossphenotype association tests in GWAS. Eur J Hum Genet. 2020;28:300–12.
Beasley TM, Erickson S, Allison DB. RankBased Inverse Normal Transformations are Increasingly Used, But are They Merited? Behav Genet. 2009;39:580–95.
Guo B, Wu B. Integrate multiple traits to detect novel traitgene association using GWAS summary data with an adaptive test approach. Bioinformatics. 2019;35:2251–7.
van der Meer D, et al. Understanding the genetic determinants of the brain with MOSTest. Nat Commun. 2020;11:3512.
Greenlaw K, et al. A Bayesian group sparse multitask regression model for imaging genetics. Bioinformatics. 2017;33:2513–22.
Anderson M. A new method for nonparametric multivariate analysis of variance. Aust Ecol. 2001;26:32–46.
GonzàlezPorta M, Calvo M, Sammeth M, Guigó R. Estimation of alternative splicing variability in human populations. Genome Res. 2012;22:528–38.
Anderson MJ, Robinson J. Generalized discriminant analysis based on distances. Aust N Z J Stat. 2003;45:301–18.
Monlong J, Calvo M, Ferreira PG, Guigó R. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat Commun. 2014;5:4698.
GarridoMartín D, Borsari B, Calvo M, Reverter F, Guigó R. Identification and analysis of splicing quantitative trait loci across multiple tissues in the human genome. Nat Commun. 2021;12:727.
VilorTejedor N, et al. Multivariate Analysis and Modelling of multiple Brain endOphenotypes: Let’s MAMBO! Comput Struct Biotechnol J. 2021;19:5800–10.
Imhof JP. Computing the distribution of quadratic forms in normal variables. Biometrika. 1961;48:419–26.
Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of χ^{2} Random Variables. Appl Stat. 1980;29:323–33.
Farebrother RW. Algorithm AS 204: The Distribution of a Positive Linear Combination of χ^{2} Random Variables. Appl Stat. 1984;33:332–9.
McArtor DB, Lubke GH, Bergeman CS. Extending multivariate distance matrix regression with an effect size measure and the asymptotic null distribution of the test statistic. Psychometrika. 2017;82:1052–77.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74.
Casale FP, Rakitsch B, Lippert C, Stegle O. Efficient set tests for the genetic analysis of correlated traits. Nat Methods. 2015;12:755–8.
Di Tommaso P, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9.
Brown AA, et al. Genetic interactions affecting human gene expression identified by variance association mapping. eLife. 2014;3:e01381.
Dumitrascu B, Darnell G, Ayroles J, Engelhardt BE. Statistical tests for detecting variance effects in quantitative trait studies. Bioinformatics. 2019;35:200–10.
Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genomewide association studies. Nat Rev Genet. 2010;11:459–63.
Chen M, Manley JL. Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat Rev Mol Cell Biol. 2009;10:741–54.
GarridoMartín D, Reverter F, Calvo M, Guigó R. A fast nonparametric test of association for multiple traits. Populationbiased sQTL catalog. Zenodo; 2023. https://doi.org/10.5281/zenodo.8349415.
Leon AC, Heo M. Sample sizes required to detect interactions between two binary fixedeffects in a mixedeffects linear regression model. Comput Stat Data Anal. 2009;53:603–8.
Rawlings AV. Ethnic skin types: are there differences in skin structure and function? Int J Cosmet Sci. 2006;28:79–93.
Ceaser T, Hunter G. Black and White Race Differences in Aerobic Capacity, Muscle Fiber Type, and Their Influence on Metabolic Processes. Sports Med. 2015;45:615–23.
GarridoMartín D, Palumbo E, Guigó R, Breschi A. ggsashimi: Sashimi plot revised for browser and annotationindependent splicing visualization. PLOS Comput Biol. 2018;14:e1006360.
Briot A, et al. Kallikrein 5 induces atopic dermatitislike lesions through PAR2mediated thymic stromal lymphopoietin expression in Netherton syndrome. J Exp Med. 2009;206:1135–47.
Dong Y, Kaushal A, Brattsand M, Nicklin J, Clements JA. Differential splicing of KLK5 and KLK7 in epithelial ovarian cancer produces novel variants with potential as cancer biomarkers. Clin Cancer Res. 2003;9:1710–20.
Figueroa CD, Molina L, Bhoola KD, Ehrenfeld P. Overview of tissue kallikrein and kallikreinrelated peptidases in breast cancer. Biol Chem. 2018;399:937–57.
Kurlender L, et al. Differential expression of a human kallikrein 5 (KLK5) splice variant in ovarian and prostate cancer. Tumor Biol. 2004;25:149–56.
Hinnebusch AG, Ivanov IP, Sonenberg N. Translational control by 5’untranslated regions of eukaryotic mRNAs. Science. 2016;352:1413–6.
Sample PJ, et al. Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nat Biotechnol. 2019;37:803–9.
Zeidman P, Maguire EA. Anterior hippocampus: the anatomy of perception, imagination and episodic memory. Nat Rev Neurosci. 2016;17:173–82.
van de Pol LA, et al. Hippocampal atrophy in Alzheimer disease: age matters. Neurology. 2006;66:236–8.
Lieberman JA, et al. Hippocampal dysfunction in the pathophysiology of schizophrenia: a selective review and hypothesis for early detection and intervention. Mol Psychiatry. 2018;23:1764–72.
Enhancing Neuro Imaging Genetics through MetaAnalysis (ENIGMA) Consortium & the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium. Common variants at 12q14 and 12q24 are associated with hippocampal volume. Nat Genet. 2012;44:545–51.
Hibar DP, et al. Novel genetic loci associated with hippocampal volume. Nat Commun. 2017;8:13624.
Small SA, Schobel SA, Buxton RB, Witter MP, Barnes CA. A pathophysiological framework of hippocampal dysfunction in ageing and disease. Nat Rev Neurosci. 2011;12:585–601.
GarridoMartín, D., Reverter, F., Calvo, M. & Guigó, R. A fast nonparametric test of association for multiple traits. Multitrait GWAS of hippocampal subfields (summary statistics). Zenodo. 2023. https://doi.org/10.5281/zenodo.8349443.
Jansen IE, et al. Genomewide metaanalysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet. 2019;51:404–13.
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophreniaassociated genetic loci. Nature. 2014;511:421–7.
Lalioti ME, et al. GemC1 is a critical switch for neural stem cell generation in the postnatal brain. GLIA. 2019;67:2360–73.
Cruchaga C, et al. GWAS of cerebrospinal fluid tau levels identifies risk variants for Alzheimer’s disease. Neuron. 2013;78:256–68.
Wang Y, et al. IL34 is a tissuerestricted ligand of CSF1R required for the development of Langerhans cells and microglia. Nat Immunol. 2012;13:753–60.
Obst J, et al. Inhibition of IL34 Unveils TissueSelectivity and Is Sufficient to Reduce Microglial Proliferation in a Model of Chronic Neurodegeneration. Front Immunol. 2020;11:579000.
Mizuno T, et al. Interleukin34 Selectively Enhances the Neuroprotective Effects of Microglia to Attenuate Oligomeric Amyloidβ Neurotoxicity. Am J Pathol. 2011;179:2016–27.
Chung J, et al. Genomewide association study of Alzheimer’s disease endophenotypes at prediagnosis stages. Alzheimers Dement. 2018;14:623–33.
Anderson MJ. Permutation tests for univariate or multivariate analysis of variance and regression. Can J Fish Aquat Sci. 2001;58:626–39.
Sirugo G, Williams SM, Tishkoff SA. The Missing Diversity in Human Genetic Studies. Cell. 2019;177:26–31.
Doherty A, et al. GWAS identifies 14 loci for devicemeasured physical activity and sleep duration. Nat Commun. 2018;9:5257.
Ash JT, Darnell G, Munro D, Engelhardt BE. Joint analysis of expression levels and histological images identifies genes associated with tissue morphology. Nat Commun. 2021;12:1609.
Wei WH, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nat Rev Genet. 2014;15:722–33.
Manichaikul A, et al. Robust relationship inference in genomewide association studies. Bioinformatics. 2010;26:2867–73.
Joo JWJ, et al. Efficient and Accurate MultiplePhenotype Regression Method for High Dimensional Data Considering Population Structure. Genetics. 2016;204:1379–90.
Duchesne P, Lafaye De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the LiuTangZhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.
Kojadinovic I, Yan J. Modeling Multivariate Distributions with Continuous Margins Using the copula R Package. J Stat Softw. 2010;34:1–20.
Davis JR, et al. An Efficient MultipleTesting Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants. Am J Hum Genet. 2016;98:216–24.
Anderson MJ. Distancebased tests for homogeneity of multivariate dispersions. Biometrics. 2006;62:245–53.
Van Nostrand EL, et al. Robust transcriptomewide discovery of RNAbinding protein binding sites with enhanced CLIP (eCLIP). Nat Methods. 2016;13:508–14.
Iglesias JE, et al. A computational atlas of the hippocampal formation using ex vivo, ultrahigh resolution MRI: Application to adaptive segmentation of in vivo MRI. NeuroImage. 2015;115:117–37.
AlfaroAlmagro F, et al. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage. 2018;166:400–24.
Bycroft C, et al. Genomewide genetic data on \(\sim\)500,000 UK Biobank participants. bioRxiv. 2017;166298. https://doi.org/10.1101/166298.
Watanabe K, Taskesen E, Van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8:1826.
Manrai AK, Ioannidis JP, Patel CJ. Signals Among Signals: Prioritizing Nongenetic Associations in Massive Data Sets. Am J Epidemiol. 2019;188:846–50.
Greene D, Richardson S, Turro E. ontologyX: a suite of R packages for working with ontological data. Bioinformatics. 2017;33:1104–6.
GarridoMartín D, Reverter F, Calvo M, Guigó R. A fast nonparametric test of association for multiple traits. Source code. Zenodo. 2023. https://doi.org/10.5281/zenodo.8349555.
Acknowledgements
We thank Manuel Muñoz, Emilio Palumbo, Beatrice Borsari, Natalia Vilor and Juan D. Gispert for useful discussions. We thank Romina Garrido for administrative support. The GenotypeTissue Expression (GTEx) project is supported by the Common Fund of the Office of the Director of the National Institutes of Health (https://commonfund.nih.gov/GTEx). This research has been conducted using the UK Biobank Resource (https://www.ukbiobank.ac.uk) under Application Number 50141. We also acknowledge support of the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) to the EMBL partnership, ‘Centro de Excelencia Severo Ochoa’, the CERCA Programme / Generalitat de Catalunya and the European Regional Development Fund (ERDF).
Review history
The review history is available as Additional file 2.
Peer review information
Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
This work was partially funded by grant CZF2019002436 from the Chan Zuckerberg Initiative (CZI).
Author information
Authors and Affiliations
Contributions
D.GM., M.C., and R.G. conceived and designed the study. M.C. developed the theoretical framework, with the help of D.GM. and F.R. D.GM. implemented the software, performed the simulations and analyzed the data. M.C. and F.R. contributed ideas and statistical advice, helping with the design of the software. D.GM. and M.C. wrote the original draft. All the authors reviewed the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Supplementary Tables, Figures and Notes.
Additional file 2.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
GarridoMartín, D., Calvo, M., Reverter, F. et al. A fast nonparametric test of association for multiple traits. Genome Biol 24, 230 (2023). https://doi.org/10.1186/s13059023030768
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059023030768