MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits

Large-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present MegaLMM, a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that MegaLMM can leverage thousands of traits at once to significantly improve genetic value prediction accuracy. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02416-w).

(1) in the main text), a multivariate linear mixed effect model can be specified as: where the n × t matrix Y represent observations on t traits for n observational units, X is a n × b matrix of "fixed" effect covariates with effect sizes matrix B, U is an r × t matrix of random effects for each of the t traits, with corresponding random effect design matrix Z, and E is a n × t matrix of residuals for each of the t traits. The random effects term ZU may represent several independent components: where each Z m is an n × r m design matrix for a set of related parameters with corresponding coefficient matrix U m . The distribution of each random effect coefficient matrix is U m ∼ N (0, K m , G m ), where N (M, Σ, Ψ) is the matrix normal distribution with mean matrix M, among-row covariance K m and among-column (i.e., among-trait) covariance G m . The residual matrix E has distribution E ∼ N (0, I n , R) where I n is the n × n identity matrix and R is an unknown t × t positive-definite covariance matrix.
The MegaLMM re-parameterization of S1 is given as: where F is an n × K matrix of latent factors, Λ is a K × t factor loadings matrix, X = [X 1 , X 2 ] is a partition of the n × b fixed effect covariate matrix between the b 1 covariates with improper priors and the b 2 = b − b 1 covariates with proper priors, and U R and U F coefficients matrices are specified as: The distributions of the random effects are specified as: where Ψ Rm , Ψ F m , Ψ RE , and Ψ F E are all diagonal matrices. Diagonal elements of Ψ F m and Ψ F E are non-negative, while diagonal elements of Ψ Rm and Ψ RE are strictly positive.

Prior Parameterization
In the MegaLMM software, we have chosen functional forms for each prior parameter that balance between interpretability (for accurate prior elicitation), and compatibility with efficient computational algorithms. We detail these choices below.
Variance Components. The MegaLMM model (see S3 and Eq. (2) in main text) has (t + K)(M + 1) variance component parameters that need to be estimated (i.e., diagonal elements of the Ψ matrices). Most Bayesian linear mixed models (LMMs) place independent inverse-Gamma priors on each variance component. This is a convenient choice due to conjugacy with a Gaussian likelihood. However, inverse-Gamma priors can cause problems with mixing in Gibbs samplers (particularly when the variance component is close to zero) (Gelman 2006). Default hyperparameters for inverse-Gaussian distributions also lead to non-intuitive surfaces for the joint distribution of two or more variance components, which favor models where one random effect dominates over the others (Runcie and Crawford 2019). In our experience, eliciting priors for the proportion of variance attributable to each random effect is more intuitive than eliciting priors for the absolute value of each variance. Below, we use the symbol h 2 m to represent the proportion of total variance attributable to random effect m because of the parallel between this term and the concept of heritability.
For each of the t observed traits, let the variance parameters be denoted by ψ Rj = (ψ R1j , . . . , ψ RM j , ψ REj ). We specify priors on these parameters indirectly by re-parameterizing them as where σ 2 Rj = m ψ Rmj + ψ REj is used to denote the total variance and h 2 Rj = (h 2 R1j , . . . , h 2 RM j , h 2 REj ) with each individual proportion being equal to h 2 Rmj = ψ Rmj /σ 2 Rj . We do end up using an inverse-gamma for the prior distribution on the total variance term σ 2 Rj because estimates are generally not near zero (unless all variation in a trait is explained by X or Fλ j ) and it is the only variance parameter for each trait in the re-parameterized model.
We allow virtually any prior specification for h 2 Rj (including non-parametric distributions) on the M -dimensional simplex (such that the vector sums to one). We do this by approximating the prior surface over a pre-defined grid of points. This follows our earlier work with single-trait linear mixed models (Runcie and Crawford 2019) where we showed that such grid-interpolations can be leveraged for massive computational gains without appreciable loss of accuracy in estimating moments of posterior distributions. In the proceeding sections, we describe how the Gibbs sampler implemented in MegaLMM takes advantage of this discretized prior surface for improved MCMC sampling and faster computation.
For the K latent factor traits, the variance parameters are denoted by ψ F k = (ψ F 1k , . . . , ψ F M k , ψ F Ek ), which we again re-parameterize as ψ F k = σ 2 F k h 2 F k . However, to ensure identifiability, we set σ 2 F k = 1. We then allow the same discretized prior to be used for h 2 F k as we just described.
Factor Loadings Matrix. Rows of Λ hold the regression coefficients that describe the relationship between the observed and latent factor traits. With K factors and t traits, there are Kt regression coefficients, so strong regularization is required when t and/or K is large. We use a two-dimensional global-local shrinkage approach based on the horseshoe prior to achieve both regularization and interpretability on the factor traits without having to carefully specify k itself. Within each row of Λ, the local shrinkage part of the prior pushes the unimportant coefficients strongly towards zero. This step is exchangeable across traits, reflecting a lack of information on which traits may be correlated. Global shrinkage is induced across rows, where regularization on all coefficients for the (k + 1)-th latent trait is done relative to the k-th. Here, we draw on the "sparse infinite factor" prior from Bhattacharya and Dunson (2011) which induces ordering of the latent traits from the most-to-least important by requiring that the magnitude of the global shrinkage parameter stochastically increases from one latent trait to the next. This enforces that the expected number of nonzero elements in row k + 1 is smaller than the number of nonzero elements in row k. This means that high-order factors are less important, and we can choose a threshold beyond which we can safely ignore any higher-order factors. We parameterize this two-dimensional prior in terms of the expected proportion of approximately zero regression coefficients in the first (i.e., most important) factor, and the expected change in this proportion as we move from from factor k to factor k + 1. Our prior for Λ has the following form: where we fix δ 1 = 1, C + (0, s) is the half-Cauchy distribution with scale s, ∼ iG(a, b) is the inverse Gamma distribution with shape a and scale b, and φ kj provides the local-shrinkage on each λ kj , while τ k provides global-shrinkage for the k-th row of Λ (which we denote as (Λ ) k ).
To choose hyperparameters τ 0 , a δ and b δ , we consider how each controls the expectation of the number of effectively nonzero coefficients in each row of Λ (no parameter will be exactly zero but the horseshoe prior shrinks non-informative priors close to zero). Using the approximate shrinkage-factor calculation from Piironen and Vehtari (2017) and assuming that columns of F have unit variance, we can estimate the effective number of nonzero coefficients in row k given the global shrinkage factor τ k as where κ kj = (1 + nφ 2 kj τ 2 k ) −1 is the shrinkage factor for each coefficient. Note: this equation differs slightly from equation 2.4 in (Piironen and Vehtari 2017) because in their formulation the prior variance of the regression coefficients is not proportional to σ 2 (equation 2.2). Therefore, the σ −2 term cancels in our calculation of κ kj . Piironen and Vehtari (2017) showed that the expectation of m k | τ k simplifies to Thus, the quantity τ k √ n can be interpreted as the odds of each coefficient in (Λ ) k being nonzero. If we knew that there were exactly t * k non-zero coefficients of (Λ ) k , we could solve for τ k = t * k /( √ n(t − t * k )). We start with a prior guess of t 0 non-zero regression coefficients for the first and most-important factor (Λ ) 1 . This means that τ 1 = τ since δ 1 = 1, and so we set the hyperparameter τ 0 to the value t 0 /( √ n(t − t 0 )). Since τ ∼ C + (0, τ 0 ), the median of the prior distribution for the quantity τ 1 √ n will be t 0 /(t − t 0 ). Next, for (Λ ) 2 , τ 2 = τ 1 √ δ 2 , so the median of τ 2 √ n (i.e., the prior guess for t 2 /(t − t 2 )) will be √ δ 2 t 0 /(t − t 0 )-which is √ δ 2 times the odds for each coefficient of λ 1 . The same pattern will repeat for each succeeding factor, with its odds for each coefficient being distinguishable from zero changing by a factor of √ δ k relative to the previous factor. As long as E[δ k ] < 1, these odds will decrease for higher-order factors, eventually reaching close to zero. We use this to calibrate the hyperparameters a δ and b δ , either by simulating from the prior or by using the approximation that E[ 1/δ k ] ≈ a δ /b δ . For example, setting a δ = 3 and b δ = 1 gives a mean decrease in odds of 1.7× and a 91% probability of the change in odds being between 1× and 3×.
It is important to note that we do not try to learn the number of factors K. The advantage of the "infinite factor model" prior is that higher-rank factors are shrunk strongly towards zero in a datadependent way. Therefore, the actual value of K tends to be unimportant as long as it is large enough such that all important factors can be included. Making K larger than that value has little impact on the posterior predictions or inferences of other parameters. We can thus save computation by pruning factors with all small coefficients.
Note that our prior here differs from the priors proposed by Bhattacharya and Dunson (2011) and Runcie and Mukherjee (2013) because of our decision to use a half-Cauchy distribution for the localshrinkage parameters instead of the inverse-gamma distribution. The inverse-gamma distribution induces a t-distribution on the regression coefficients. While this prior generally performs well, and we include it as an option in our R package, we find that eliciting priors for the local and global shrinkage with this distribution is not intuitive. The horseshoe prior is generally better at separating signal from noise, shrinking only unimportant coefficients closer to zero-and thus leads to more interpretable factors.
Fixed Effects. As noted above, we partition the covariates in X into the b 1 covariates with improper priors on their coefficients, and the remaining b 2 = b−b 1 covariates with proper priors on their coefficients. B 1 is the matrix of regression coefficients for the covariates with improper priors, where we assume each element b bj ∼ N (0, ∞).
When included, covariates in X 2 are generally high-dimensional (e.g., genetic markers), often with b 2 > n. Therefore, we use priors that regularize estimates of the coefficients B 2R and B 2F and favor sparse, interpretable solutions. B 2R plays a very similar role to the factor loadings matrix Λ in MegaLMM (see Eq. (2) in the main text), as the model involves the combined effect: where X 2 and F are both covariate matrices (with X 2 known a priori and F unknown), and B 2R and Λ are the corresponding coefficient matrices. Due to this connection, our default prior for B 2R is also the horseshoe prior. However in this case we apply independent global shrinkage parameters for each trait j (i.e., columns of B 2R ).
We also use a horseshoe prior for each column of B 2F . A key feature of MegaLMM is that we split the total effect of covariate x 2l into two components: x 2l (B 2R ) l +x 2l (B 2F ) l Λ, such that the K-dimensional rowvector (B 2F ) l partially accounts for the effects of x 2l through the common factors, and the t-dimensional vector (B 2R ) l accounts for the remain effects. Without regularization on B 2F and B 2R , either would provide equivalent explanations for the data and the two would not be simultaneously identifiable. However, when we can explain the effects of x 2l on Y effectively with the K values in (B 2F ) l (conditional on Λ) as we can do with the t values of (B 2R ) l , our prior favors the former solution as it is sparser and we end up with a more parsimonious model.

MCMC Algorithm
Our hybrid Gibbs and Metropolis-Hastings sampler uses the following steps: The priors for B 1 , B 2R , Λ and U Rm factorize into independent distributions for each of their t columns (conditional on δ). The priors for h 2 Rj and σ 2 Rj are also independent for j = 1 . . . t. Therefore, conditional on F (and δ), we can simplify Eq. (2) in the main text into t independent univariate linear mixed models of the form: where y j is the jth column of Y, λ j is the jth column of Λ, y j denotes the elements of y j that are non-missing (similar definitions are used for the tildes over other matrix and vector terms), τ 2 ) are diagonal matrices composed of the prior variances for each element of b 2Rj or λ j (excluding the term σ 2 Rj ), and K(h 2 ) is a matrix-valued function returning an r × r matrix as a function of h 2 , generally a block-diagonal matrix with each block the product of an element of h 2 and a pre-defined covariance matrix K m . Only non-missing elements y j contribute to the likelihood, so the remainder of y j can be ignored.
We collect samples of the coefficients Rj , σ 2 Rj } using a set of highly optimized Gibbs-type updates based on the fact that many of the same intermediate calculations can be re-used among different columns of Y. Full derivations and sampling distributions of each step are described below in Supplementary Section 1.4.
given F. Given the parallelism between the models for F and Y, the sampling steps for the parameters of the F model are analogous to those described above. Again, the models for the k columns of F are independent and each has the form: The main difference from above is the lack of the term σ 2 F k in each prior distribution. This is missing because σ 2 F k = 1 for identifiability.
3. Sample F given all other parameters. To sample F, we transpose Eq.
(2) in the main text to: is simply a set of linear regression coefficients. Furthermore, by conditioning on B 2R , U R , B 2F and U F , columns of F and M R are uncorrelated and we can represent this as a set of n simple linear regressions: is a diagonal matrix holding the residual variances of each of the K columns f k , and D ( Y )i = diag(h 2 REj σ 2 Rj ) is the similar diagonal matrix with the residual variances of ( Y ) i . Therefore, the conditional posterior is ( In the case where we wish to impute (F ) * i for an individual with no observations (i.e., when ( Y ) * i is empty), we simply draw (F ) * i ∼ N (µ (F )i , D f ).
4. Sample missing elements of Y given all other parameters. Although missing observations are generally not needed for sampling the model parameters (except in cases described below), imputing them is useful for predicting unmeasured phenotypes. The conditional posterior of each element follows a univariate Gaussian distribution in which

Gibbs Sampler Updates
We now detail specific Gibbs sampler updates used in MegaLMM. Steps 1 and 2 above both involve sets of parallel linear regression models. Although the design matrices may differ for columns of Y and F, the form of both sets of conditional models (replacing specific variable names and dropping subscripts) is: where y is an n-dimensional vector of observations; {α, β, u} are vectors of unknown coefficients of length a, b and r, respectively, with known design matrices of appropriate size; and e is an n-dimensional vector of independent and identically distributed residuals. We require D β to be diagonal, while K(h 2 ) is a matrix-valued function returning an r × r matrix as a function of h 2 . In Step 1: y = y j , α = b 1j , β = [λ j , b 2Rj ] , e = e Rj , and σ 2 = σ 2 Rj . In Step 2, y = f k , α is empty, β = b 2F k , e = e F k , and σ 2 = 1. Lastly, we use iG(α, β) to denote the inverse-Gamma distribution with probability density function Our goal is to draw samples for all unknown parameters as efficiently as possible while minimizing autocorrelation in the MCMC chain. We accomplish this by blocking (or collapsing) many sampling steps where we first integrate out certain regression coefficients and then draw samples for these in subsequent steps. Note that, in a single iteration, we draw samples for all parameters for many (say p) y vectors. Since the parameters {X 1 , X 2 , Z, K(·)} are constant for each column of Y or F, we cache as many of the intermediate calculations as possible to reduce unnecessary operations. Our sampling strategy has the following steps: e | α, β, σ 2 , integrating out u. 5. Sample u | α, β, h 2 , h 2 e , σ 2 . By caching intermediate calculations, the most expensive calculation for all five steps is a single Cholesky decomposition of a square matrix with min(b, n) rows.
We describe each step in detail below. First, we define several quantities that are re-used in multiple steps: V(h 2 ) = ZK(h 2 )Z + h 2 e I is the covariance of y from the random effects as a function of h 2 ; and V β = X 2 D β X 2 + V(h 2 ) is the covariance of y after integrating out the regularized regression coefficients β. Sampling steps in Gibbs sampler are as follows: 1. We assume that a << n. The conditional posterior of α is given as Inverting the n × n matrix V β is expensive. We describe tricks to speed up this calculation below. Note that the A α is a small a × a matrix, so its Cholesky decomposition is inexpensive.
2. This step follows Makalic and Schmidt (2016) and modified for correlated residuals and un-regularized coefficients α. The conditional posterior of σ 2 is given as where ε = y − X 1 α and V −1 β is re-used from the previous step above. 3. This step also follows Makalic and Schmidt (2016) where where ε is defined as above. This step requires calculating V(h 2 ) −1 and A −1 β . The former is n × n matrix, and the latter is a b × b matrix. We discuss below how these calculations can be accelerated, particularly when b > n.
4. Updates for D β and h 2 are independent. Once h 2 is updated, h 2 e is calculated as 1 − i h 2 i . (a) We use the horseshoe prior form as a default for D β = diag(φ 2 i τ 2 ), specified as: (S12) We sample the parameters φ i and τ using the algorithm of Makalic and Schmidt (2016) by introducing two new variables ν i and ξ such that Now, we can sample these parameters using the following steps: iv. ξ | · ∼ iG(1, 1/τ 2 + 1/τ 2 0 ) However, recall that when sampling using columns of Y, β contains a column of Λ and a column of B 2R (i.e. β = [λ j , b T 2Rj ] ). When sampling using columns of F, β only contains a column of B 2F (i.e. β = b 2F k ). The global shrinkage parameter τ differs between elements of Λ and elements of B 2R and B 2F because global shrinkage applies to the whole matrix Λ, but is unique to individual columns of B 2R and B 2F . Also, the elements of Λ have the additional factor δ k in their prior variance which increases the shrinkage from row k to row k + 1 (equation S5). Therefore, we sample a different τ for all elements of Λ which is independent of the τ b 2Rj and τ b 2F k . Sampling this parameter requires all elements of Λ (not just those corresponding to the current column of Y). The update for τ is: Finally, the update for δ h is: which also depends on the whole matrix Λ. The Gibbs updates for τ and δ h tend to mix poorly because of their strong dependence. Since these updates are relatively inexpensive, we repeat these two steps 100 times in a row per overall iteration of the MCMC chain to improve mixing.
(b) The prior for h 2 is discrete on the M -dimensional simplex (for M random effects). We use a Metropolis-Hastings step to update h 2 . We propose a new value h 2 (1) uniformly from the set of all values with ||h 2 (1) − h 2 (0) || 2 < , and then calculate: where the transition probability g(h 2 (i) | h 2 (j) ) is proportional to the number of grid cells within of h 2 (j) , and with ε * = y−X 1 α−X 2 β. We then accept h 2 (1) with probability min(1, r). The most expensive part of this calculation is evaluating the determinant and inverse of the n × n matrix V(h 2 (i) ), which we solve through a Cholesky decomposition. Since h 2 can take only a discrete set of values, we simply pre-calculate and cache all possible decompositions before starting the MCMC and re-use them throughout the chain for all j = 1, . . . , p y vectors.
(c) The update for u is given as the following r m ) which is a function of h 2 and can be much larger than n × n if there are multiple full-rank random effects. Again, rather than calculating the Cholesky decomposition of A u during each iteration for each trait, we note that there are a finite number of these matrices (indexed by h 2 ) and pre-calculate and cache all before running the MCMC. Generally these matrices are sparse and can be stored efficiently.

Opportunities for Caching Expensive Calculations
We noted above that we can cache Cholesky decompositions for each V(h 2 ) and A u (h 2 ) indexed by h 2 and re-use them for all p traits and all iterations of the MCMC chain. Additionally, in Steps 1-3, the matrices V −1 β and A −1 β are closely related through the Binomial Inverse Theorem where (S18) Since V(h 2 ) −1 is pre-calculated, we only need to calculate the smaller of the V −1 β and A −1 β matrices and then form the other through matrix multiplications. Because we need the Cholesky decomposition L β L β = A β to sample β in Step 3, if b < n, we can calculate this directly and then use L β to calculate V −1 β for Steps 1 and 2. However, if b > n, we instead sample β using a modified version of the algorithm demonstrated by Bhattacharya et al. (2016). Let LL be the Cholesky decomposition of V(h 2 ). Then 1. Sample a ∼ N (0, σ 2 D β ) and d ∼ N (0, I n ) independently.

Set
Note that LA w L = V β . If we have already calculated V −1 β , we can simplify these steps to: 1. Sample a ∼ N (0, σ 2 D β ) and d ∼ N (0, I n ) independently.
Finally, if X 2 has row-rank less than n, we can factor as UV where U and V are n × m and m × b matrices, respectively, with m < n. This often occurs if m genotypes are repeated multiple times in the same dataset. In this case, we can speed up the calculation of V −1 β using the Binomial Inverse Theorem: This involves only two m×m matrix inversions, which is beneficial if m << n. Further caching is possible to prevent redundant matrix-matrix multiplications. Since X 2 is constant across the p traits, the terms L −1 X 2 and X 2 V(h 2 ) −1 X 2 , (or L −1 U and U V(h 2 ) −1 U) are also constant for any two traits that share the same value for h 2 . This can dramatically reduce the computational complexity when p is large.

Blocking missing data
Most existing Gibbs samplers for factor models require complete observations because they condition on the whole trait vectors for each individual and, therefore, must impute any missing data. While data imputation is straightforward in Bayesian models (missing data is treated as an unknown parameter and included in the MCMC), conditioning on imputed values in a Gibbs sampler leads to very long-term autocorrelations in MCMC chains. MegaLMM largely avoids this by dropping unobserved values from the likelihood. However, there is a trade-off between simplifying the likelihood and computational efficiency.
In particular, the pre-cached Choleksy decompositions of the random effect covariance matrices and other intermediate calculations can only be applied to sets of traits with complete observations across the same individuals. If every trait had a unique pattern of missing observations, pre-caching would be extremely memory-intensive and inefficient. In many data sets, distinct subsets of traits share approximately the same set of missing observations. For example, all agronomic traits may be measured on a subset of lines in a breeding trial, while hyperspectral reflectance is measured on all lines. Or a similar subset of all possible lines may be grown in nearby fields of a multi-environment trial.
In these cases, we partition the full matrix of traits Y into a list of smaller trait matrices { Y 1 , . . . , Y S }, where each sub-matrix Y s , contains only those individuals with observations for this subset of the traits. We select the partitions by attempting to minimizing the number of unobserved values within each Y s , for a given number of partitions, using a sequence of binary partitions of the original trait matrix. We impute values for all missing data in Y during Step 4, but only condition on the imputed values in each Y s during Steps 1 and 3. This greatly reduces autocorrelation in the MCMC while minimizing the number of pre-cached intermediate calculations that need to be stored.

Further Mixing Issues
As in any factor model, the factor loadings in our model are not identifiable in the likelihood. However, the horseshoe prior on elements of Λ does help make these parameters identifiable in the posterior (except for sign flips). In general, we find that coefficients of each row λ k mix reasonably well. It is important to note that the ordering of the rows in Λ from most-to-least important does not mix effectively. While the correct ordering is important to ensure that important factors are not shrunk too much, we are generally not interested in the order per se as much as the identities of each factor, and we find that genomic prediction outcomes are relatively robust to mixing issues of factor order. To improve model convergence, during the burn-in period, we periodically stop the MCMC chain, assess the order of the factors, and manually re-order the factors based on current estimates of m k | τ k . We also prune factors when when the pairwise correlations among columns of F are too high (ρ > 0.6).

Additional strategies for computational efficiency.
When there is only one random effect in our model, we can gain further computational efficiency by diagonalizing the model where we pre-multiply both sides of Eq. (2) by the transpose of the eigenvectors from K. This makes V(h 2 ) diagonal for all values of h 2 . By storing the Cholesky decomposition of these diagonal matrices as sparse objects, we can take advantage of efficient sparse linear algebra libraries for dramatic gains in efficiency. This strategy only works when Y is without missing data. However, we can modify the strategy slightly by calculating eigenvalue decompositions ofK s for each subset of the partitioned trait sub-matricesỸ s , and then pre-multiply each sub-matrix by the corresponding set of eigenvectors.
When more than one random effect is included, complete diagonalization is no longer possible and so we resort to pre-caching Cholesky decompositions of V(h 2 ) for each value of h 2 . When random effects are low-rank, Cholesky decompositions of their covariance matrices can sometimes be stored as sparse matrices to reduce memory and computational demands. We check the number of zero-elements in each Cholesky decomposition to determine whether to store it as a sparse or dense matrix.
Finally, we code nearly all expensive linear algebra calculations in our R package in C++ using the Eigen library with RcppEigen (Bates and Eddelbuettel 2013), and parallelize calculations across traits whenever possible using OpenMP. Accuracy of covariance estimates in simulated data. We generated 20 simulated datasets each with 128 genes based on the Arabidopsis gene expression data. In each dataset, we sampled two sets of 128 genes and used each to calculate an empirical correlation matrix. We converted the correlation matrices into genetic (G) and residual (R) covariance matrices by sampling a h 2 value independently from [0.1, 0.8] for each gene, and then used the matrices and the genomic relationship matrix K to generate genetic and residual values for 128 traits. Using subsets of these simulated data with varying numbers of genes, we estimated genetic and residual covariances using each of the four comparison methods using the same parameters as in Figure 2, and then assessed the accuracy of covariance estimation by calculating the square-root of the mean squared error (RMSE) between each covariance value (excluding variances).   Figure 3 was repeated for the 19 other treatment:year datasets reported by Krause et al. (2019). In each panel, bars show the mean ± SE over five crossvalidation runs, each with a unique 50:50 testing:training partition. The same partitions were used for each method in each run. MegaLMM methods are colored in blue. The hashed lines indicate a multivariate prediction using the CV1 approach (i.e. no secondary traits included in the testing data for prediction). The dataset in Figure 3  Relationship between genetic correlation and benefit of MvLMM across Genomes2Fields site-years. In each panel, each point represents a single site-year. The x-axis shows the maximum genetic correlation between the trait values for that site-year and all other site-years. The y-axis is the difference in the Fisher Z-transformed estimates of genomic prediction accuracy between the full MvLMM (implemented in MegaLMM) and a univariate GBLUP model (implemented in rrBLUP). Traits included: days to silking (DTS), anthesis-silking interval (ASI), grain yield, and plant height. Tables   Table S1. Default hyperparameters for user-customizable prior distributions. Default values provided in the MegaLMM R package are provided. These values were used for the reported analyses unless otherwise noted.
The expected odds of being non-zero for an element of (Λ ) h+1 decreases by a factor of ≈ a δ /b δ relative to (Λ ) h .
The proportion of effectively non-zero elements in the first row of B 2R or B 2F is π 0 .
h 2 Rj , h 2 F j 1/l l = 20 Uniform over l equally spaced grid cells on the unit simplex. When M > 1, l counts the number of valid grid cells so is less than l M .