Skip to main content

Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction


The problem of missing heritability requires the consideration of genetic interactions among different loci, called epistasis. Current GWAS statistical models require years to assess the entire combinatorial epistatic space for a single phenotype. We propose Next-Gen GWAS (NGG) that evaluates over 60 billion single nucleotide polymorphism combinatorial first-order interactions within hours. We apply NGG to Arabidopsis thaliana providing two-dimensional epistatic maps at gene resolution. We demonstrate on several phenotypes that a large proportion of the missing heritability can be retrieved, that it indeed lies in epistatic interactions, and that it can be used to improve phenotype prediction.


During the past decade, genome-wide association studies (GWAS), an approach used in genetics to find genetic associations with observable traits, have allowed the discovery of many genetic variants associated with human [1,2,3], plant [4], and animal [5] phenotypic traits. GWAS success is thus a reality and many discoveries made with this technique led to disruptive insights in biology, impacting basic knowledge as well as translational approaches to agronomy and medicine [6]. However, when we observe on the one hand the striking resemblance of human twins, and on the other hand the amount of variation explained by GWAS signals, we are inclined to admit that “mono-dimensional GWAS,” which studies genetic variation effect taken one at a time, is somehow limited. The missing heritability [7, 8], defined as the unexplained variance of a trait, is probably at least in part attributable to interactions among variants, ie: epistasis.

Epistasis refers to how genes interact to affect a particular trait [9]. In simple terms, it can be seen as when the effect of one gene is influenced, or masked, by one or more other genes. This interplay can considerably add complexity to our comprehension of how the combined influence of genes shapes traits. Addressing epistasis is a difficult problem given that current mathematical models linking genetic variations to phenotypes exhibit high sensitivity to False Discovery Rate corrections and to up-scaling, in particular to the number of individuals in the study [10].

Performing such combinatorial studies is challenging because the number of tested interactions grows to the square of the marker number (for 1st order of interaction). The most recent developments to approach epistasis consist of genetic variable pre-selection [11,12,13,14] or algorithmic acceleration [12, 15]. However, to our knowledge an attempt to solve large epistatic maps, without variable selection, is still lacking.

The recent development of signal processing has traditionally focused on the reconstruction of signals from a sub-sampling action. A key milestone in the field is the formulation of the Nyquist–Shannon sampling theorem, which proposes that a signal can be perfectly reconstructed if its highest frequency is inferior to half the sampling rate. This theorem underscored the importance of prior knowledge about the signal's frequency constraints in reducing the needed sample count for signal reconstruction. A significant advancement in this domain was achieved in 2006 [16]. In this work, Candes et al. demonstrated that by understanding the signal’s sparsity, the signal can be reconstructed with a sample count lower than that stipulated by the sampling theorem. This principle constitutes the foundation of compressed sensing (CS).

CS represents a paradigm shift in signal processing for the efficient acquisition and reconstruction of signals via solutions to underdetermined linear systems. This methodology is predicated on the premise that optimization can exploit the inherent sparsity of a signal, facilitating its reconstruction from a quantitatively lesser number of samples than those necessitated by the Nyquist–Shannon sampling theorem. The operational efficacy of compressed sensing is contingent upon two conditions: the sparsity of the signal, necessitating that the signal is predominantly constituted of zero elements in a given domain, and incoherence, which involves the application of the isometric property essential for sparse signals.

Through our analysis, we hypothesized that genetic data exhibit such properties enabling strong compression of the epistatic problem leading to an important acceleration of the process.

In this work, we apply CS to GWAS analysis, using machine learning approaches, reaching an acceleration that makes it possible to provide, for the first time, the full epistatic maps (with > 60 billion combinations) with a gene resolution in Arabidopsis thaliana. This analysis retrieves part of the missing heritability and largely improves phenotypic predictions.


The NGG model and its fast resolution

To attempt to make full epistatic maps a reality, we decided to use a different mathematical formalism combined with solving systems meant to take advantage of Graphic Processing Units (GPUs) being increasingly popular thanks to the rise of gaming and deep learning [17]. Our solution, named NGG for Next-Generation GWAS, is based on the massive use of modern acceleration architecture (GPU, see Fig. 1 and Additional file 1: Text 1, Material for details) enabled by the use of recent mathematics techniques (Compressed Sensing) for regularized least square estimation in a sparse linear model paradigm, that we rewrote in a new way, called “model compression”. This achieves linear scaling with the number of SNP and interactions, instead of an exponential complexity (Fig. 1). The outcome is a sparse estimate collecting the effects of each variant and each SNP interaction, instead of retrieving p-values as regular GWAS. As such, the NGG algorithm can be seen as a sparse signal detection analysis and does not use multiple statistical testing which precludes the use of FDR correction. This classical correction is replaced here by a drastic procedure for variable selection and extensive simulation and testing.

Fig. 1
figure 1

Scheme depicting the role of compressed sensing in NGG and computational complexity with big O notation. In both cases, we compare our method to EMMA with Kinship matrix computation, but we take into account only the matrix inversion part in terms of computational complexity. A is the classic 1D GWAS with at the top the EMMA path, with a final complexity of at least O(p^3), and at the bottom, our 2D GWAS solution, with a final complexity of O(p log(p) n). B is the case of interactions modeling using the proposed model, the computational complexity is shown in terms of the basic p size of input, before the interactions. In this case, the naïve standard approach becomes O(p^6) in computational complexity, whereas our method is now O(p^2 log(p) n), and presents a clear gain in terms of computational time. It is noticeable that, because the compression is in log(p^q), for q whatever the number of interactive SNPs we are interested in, our method will result in a O(p^q log(p) n) algorithm, hence the fact that we say we linearize the computational complexity in term of q

Hereby, we established the NGG model that first states and defines heritability in this framework as done before by Zuk et al. [18] and others: We define \(X\) the matrix with n rows and p columns containing the genetic information (Fig. 1). Each column displays the coded genetic variants (here SNP) for the n individuals. We also define \(Y\) a vector containing the phenotype. The broad-sense heritability \(H\) may be defined via the following nonparametric “random signal plus noise” model: \(Y=f(X)+\varepsilon\) (NP: stands for non-parametric.) where the function \(f\) is unknown and general and \(\varepsilon\) is a random noise, independent from \(X\) that collects all other effects (other than genetic) on the phenotype \(Y\), such as environmental effects for instance. Thus, the broad-sense heritability is expressed as \(H=var(f(X))/var(Y)\). The narrow-sense heritability \(h\) also sometimes called additive heritability accounts for part of the variance explained by genetics in the linear model \(Y=X{\theta }+\varepsilon\) (L: stands for linear). The definition is \(h=var(X{\theta })/var(Y)\). We note that, of course, \(model(L)\subset model(NP)\). Notice for further use that since the slope parameter \({\theta }\) is unknown, the narrow sense heritability cannot be computed but only estimated (for example, via a plug-in estimator \(\widehat{h}=var(X\widehat{\theta })/\widehat{var}(Y))\). At last when the estimation method is Ordinary Least Squares (OLS) or some of its (regularized/penalized) variants, the definition above matches the classical \({R}^{2}\) and adjusted \({R}^{2}\). Below the adjusted \({R}^{2}\) is preferred for reasons related to both the dimensionality of the data (usually p is much larger than n) and the well-known inflation of \({R}^{2}\).

We further consider two models:

  • Model 1:\(Y=X{{\theta }_{1}}+\varepsilon\)

  • Model 2:\(Y=X{{\theta }_{1}}+Z{\theta }_{2}+\varepsilon\)

Where \(Z=X\star X\) is the partial face-splitting (or transposed Khatri Rao product) of matrices [19]. For self-containedness notice that when:

$$X=\left[\begin{array}{ccc}{a}_{1}& {a}_{2}& {a}_{3}\\ {b}_{1}& {b}_{2}& {b}_{3}\end{array}\right] then \; X*X=\left[\begin{array}{ccc}{a}_{1}{a}_{2}& {a}_{1}{a}_{3}& {a}_{2}{a}_{3}\\ {b}_{1}{b}_{2}& {b}_{1}{b}_{3}& {b}_{2}{b}_{3}\end{array}\right]$$

Matrix \(Z\) contains all the pairwise Kronecker products of columns of \(X\), excluding the products of a column with itself. This matrix \(Z\) will be referred to as the matrix of interactions or shortly 2D matrix as performed before by others [20]. When \(X\) has p columns, \(Z\) has p(p-1)/2 columns. The matrix Z captures all interactions between the SNP’s. Although Model 2 remains linear, it is not additive anymore and bridges between Model L (or Model 1) and Model NP.

Algorithmic validation of NGG on simulated data

Now that the model has been established, we need to evaluate its performance in retrieving epistatic signals. For this, we first worked on simulated data (see methods and repository for details). The first simulation has been performed in two steps.

First, we simulated \(X\) and \(Y\) (Fig. 2A and C); second, we simulated \(Y\) for real \(X\) (SNP matrix) retrieved from the Arabidopsis related 1001 genome project [21] (Fig. 2B and D; Additional file 1: Fig. S1). These simulations are built to control narrow sense heritability (h2) of the trait (Fig. 2). Using simulations, we show that the NGG formalism is able to capture simulated epistatic events for a wide range of model parametric values (Fig. 2E, F and Additional file 1: Fig. S1). We found that NGG is quite resilient to noise but sensitive to the number of individuals used for the analysis (as discussed further, see remarks on Very High Dimension), as it radically improves for larger numbers of individuals (Fig. 2E, F; Additional file 1: Fig. S1). For instance, for a 500k SNP epistatic landscape (1000 × 1000), for which we implemented 10 non-null epistatic signals, 50% of these are found in the top 10 NGG predicted signals (Fig. 2E) for a h2 = 0.2 when using 10,000 individuals. This number is maintained to 27% when the number of individuals is reduced to only 1000.

Fig. 2
figure 2

Next-Gen GWAS retrieves simulated epistatic interactions. Var1 (x-axis) and Var2 (y-axis) are a series of 100 SNPs. The triangle corresponds to SNP combinations when the diagonal contains simple SNP effects. Z-axis reports NGG estimated \(\widehat{\theta }\) values of simple SNPs (diagonal) and combinations (rest of the triangle). Purple points correspond to simulated support of simple (diagonal) and epistatic signals (in the triangle). The sample size is 5000. A and B Genotype and phenotype data are simulated using specific and modulable parameters (see Additional file 1. Material for details on the simulation). Random noise is added. NGG retrieves the 5 simulated signals (purple) including the pure epistatic effects (outside of the diagonal). C and D Phenotype data only have been simulated while genotypes are from the Arabidopsis genome (SNPs are sampled from X matrix). Again epistatic interactions (purple points) are retrieved by NGG. E and F report heatmaps for Recall and % of recovery in an analysis of 1000 × 1000 SNP interactions (0.5 M interactions) where 10 non-null signals have been simulated. Recall in E is calculated for the top 10 stronger signals. The percent of true signal recovery in F is calculated for the top 100 signals. We have found that NGG is quite resilient to noise in the data (on phenotypes Additional file 1: Fig. 1) and the power of NGG increases quickly with the number of individuals and for stronger heritabilities

We also measure the NGG ability to detect epistatic signals when the interacting SNPs are not randomly selected. Indeed, we simulated a scenario whereby a SNP can have a simple (1D) effect combined with interaction effects (2D). Again, NGG is able to retrieve the simulated epistatic interactions in an even more complex mixture of simple and combinatorial effects (Additional file 1: Fig. S2).

In this first pass of validation procedure on simulated data, the epistatic effects were computed to reflect the Arabidopsis genome structure that contains a very high homozygosity (simulation code is available in the GitHub repository). However, epistatic interactions, in particular in heterozygous organisms, can be of different kinds as described earlier by Marchini et al. (2005) [22] (Additional file 1: Fig. S3A). Thus, to analyze further the potential of the NGG algorithm to retrieve a certain diversity of interactions, a second simulation was performed including now heterozygosity and three sorts of epistatic interactions [22] (simulation code is available in Git repository as well). We measured NGG capacity to detect 3 different types of epistatic interactions fully described by Marchini et al. (2005) [22], namely Type 1: multiplicative within and between loci, Type 2: Two-locus interaction multiplicative effect, Type 3: Two-locus interaction threshold effect. We show that the sparsity of the signal is important (although not crucial) for NGG to detect epistatic signals (Additional file 1: Fig. S3). This can be explained by the mathematical construction of the compress sensing problem. We also demonstrate that Type 1 and Type 2 interactions are easier to discover than Type 3 interactions and that having different types of interactions in the same simulation run does not affect NGG detection capacities (Additional file 1: Fig. S3).

Having defined the potential of NGG to discover epistatic signals on simulated data we then moved to compare NGG with previously benchmarked results of regular 1D GWAS analyses.

Estimation of NGG efficiency for 1D GWAS on real data

We further benchmark our method on state-of-the-art available datasets and modeling approaches [4, 23]. For this we first compared unidimensional (i.e., 1D or classical) GWAS results using the 107 Arabidopsis phenotypes studied in the landmark paper Atwell et al. [4]. We observed that major signals retrieved with the EMMA algorithm [4, 23] are also retrieved by NGG (Fig. 3A and Additional file 1: Fig. S5 for the 107 phenotypes). For instance, EMMA and NGG methods both identify a major peak for the phenotype 88: bacterial disease resistance (Fig. 3A). This peak directly identifies the resistance gene RESISTANCE TO P. SYRINGAE PV MACULICOLA 1 (RPM1) [24]. It is worth noting that for this particular phenotype, some signals emerge in NGG that are not detected by EMMA (Fig. 3) and that for certain phenotypes, NGG and EMMA converge towards a \(x\)2 relationship (Additional file 1: Fig. S6 S7). The opposite is also true although less frequent (see for the 107 phenotypes Additional file 1: Fig. S6 S7). Similar analyses have been realized on the 18 phenotypes from the Campos et al. paper [25], leading to the same conclusions (Additional file 1: Fig. S6 S7). Interestingly, NGG clearly identifies in the top hits the effect of FLOWERING LOCUS C (FLC), a major gene in the control of flowering time [26, 27] in contrast to EMMA. Here we took this gene as an example of which NGG may be good at retrieving such important signals since it is intrinsically built to retrieve \(\widehat{\theta }\), considering the other SNP effects (Model 1 and Model 2). For this, we compared the capacity of NGG and EMMA to detect signals in the vicinity of the FLC locus (20 kb window). Interestingly, Fig. 2B shows that the NGG model indeed retrieves FLC as being the second strongest signal when EMMA reports it as the 30th signal (Fig. 3B). Finally, for every single phenotype, we quantified the overlap between the increasing k-top \(\widehat{\theta }\) and the EMMA signal. For the vast majority of signals, we found a good congruence between signals varying between 40 and 100% for Atwell et al. phenotypes and between 9 and 59% for Campos et al. phenotypes (Additional file 1: Fig. 5).

Fig. 3
figure 3

Next-Gen GWAS retrieves 1D-GWAS signal in Arabidopsis comparable to routinely used MML (EMMA [4, 23, 28]) and points to the FLC locus for flowering phenotypes. A Data from Atwell et al. (2010) have been used to compare the efficiency of our algorithm to the standards of GWAS in Arabidopsis. NGG and EMMA algorithms largely retrieve similar signals. B The phenotype 48 (days to flowering trait [8W]) NGG results are displayed. SNPS in the close vicinity of the FLC locus (a major component of flowering in plants) is represented by black dots. The scatter plot presents the fact that NGG better detects the FLC effect as compared to EMMA

It is noteworthy that the observed good correlation between EMMA and NGG results may indicate that NGG performs genome population structure correction comparably to mixed models. Population structure correction is explained in the light of the NGG procedure (fully described in Additional file 1: Text 1) and as follows. Equation #16, in Additional file 1: Text 1, solves the compressed problem by utilizing a compressed version of the kinship matrix estimation (the AXtXAt matrix), which automatically incorporates a renormalization via this estimated projected kinship matrix during resolution. Furthermore, the final algorithm (Additional file 1: Text 1 and provided code), which has a specific piecewise structure and involves averaging, enables the estimation of effects by breaking any connections that may exist between the coordinates. However, further simulations incorporating diverse genetic architectures and models of population structure would be required to fully validate this observation.

Having shown that NGG is able to retrieve GWAS signals on original datasets we decided to evaluate its speed in comparison to other algorithms.

Estimating acceleration towards 2D-GWAS

We then compared the runtime of NGG with permGWAS which is the fastest GWAS to date [29]. PermGWAS was itself challenged on the same dataset, containing 1 million SNPs and 1000 individuals, to GEMMA and SNPTESTv2. Their respective runtimes were 6, 29, and 23 min. Regarding these results then further report only the comparison of NGG to permGWAS only.

All runtime experiments were measured on the same server using Ubuntu 20.04.3 LTS with 40 CPUs, 377 GB of memory, and 4 Quadro RTX 6000 GPU, each with 24 GB of memory.

First we estimate the effect of the number of markers on the runtime (Additional file 1: Fig. S4a), we fixed the number of samples to 1000 and varied the marker between 50 K and 2 M SNPs. As summarized in Additional file 1: Fig. 4, both permGWAS and NGG are proportional to the number of SNPs with a logarithmic relationship. The NGG is more than two orders of magnitude faster than permGWAS. For the maximum marker size (2 M) NGG took approximately 5 s, while for permGWAS the runtime took more than 460 s (~ 7 min).

Then we estimated the effect of the number of samples on the runtime (Additional file 1: Fig. S4b), we fixed the number of markers to 1 M and varied the number of samples between 1 and 10 K. Additional file 1: Fig. 4 allows us to observe that NGG outperforms again permGWAS by at least two orders of magnitude. For the maximum number of samples (10 K), NGG took approximately 6 s, while for permGWAS the runtime was more than 985 s (~ 16 min).

In conclusion, these results demonstrate that NGG delivers similar and potentially more accurate results in regards to other regular GWAS techniques and is a hundred times faster. This speed improvement brings the calculation for ~ 60 billion SNP or combination of SNPs below an hour making possible the computation of entire epistatic maps.

2D GWAS on real data

Being confident that NGG has the potential to point to true epistatic effects (Fig. 2) and having in mind that the number of individuals greatly improves the detection capacity of our model (Fig. 2E, F; Additional file 1: Fig. S1), we tested NGG analysis on the dataset with the greater number of Arabidopsis ecotypes extracted from work by Campos et al. 2021 [25]. In this work, Campos et al. (2021) provide the elementary composition (18 different elements) for > 1100 different Arabidopsis ecotypes having been fully sequenced by the 1001 genome project [21].

Figure 4 reports results of unidimensional NGG for Phosphorus content (noted P31) of Arabidopsis leaves, that can be displayed at the same time as (i) \(support \,for \,the \,model \,x \,|SNP \,effect|\) or as (ii) pure \(SNP \,effect\) (\(\theta\)). The latter provides a Manhattan plot with negative values that can be interpreted as the SNP having a negative effect on the phenotype as compared to the reference genome (here Columbia-0 ecotype) (Fig. 4). Also, the effect reported in this Manhattan plot is now expected to be directly proportional to the effect of the genetic variation as compared to Col-0 phenotype, helping to choose for the best variant or gene to study.

Fig. 4
figure 4

NGG provides a direct estimation of the SNP effect (\(\theta\)) on the phenotype (Col-0 being the reference genome). The upper plot presents the NGG signal combining support (effect or not) × absolute value of the estimated effect of the genetic variation. The lower plot reports the estimated effect of each SNP (θ). Colored data points (according to chromosome number) are emerging from the noise in a bootstrap procedure comparable to the permGWAS procedure (18)

We further proceeded with the computation of full epistatic maps or 2D-NGG for phenotypes retrieved from the Campos et al. [25] and Atwell et al. [4] datasets. We focused on these datasets as they present a relatively high number of ecotypes (> 1000), and an important diversity of well-known phenotypes respectively. To do this, we prefiltered SNPs having a particular Minor Allele Frequency (MAF) because the probability for the combination of SNPs (that we call MIAF for Minor Interaction Allele Frequency) to be of interest for epistatic measurements directly depends on the MAF as \(Z=X\star X\) (above). For this we prefiltered 346,094 SNPs, for Campos et al., and between 341,067 and 371,956 SNPs, for Atwell et al., having a MAF greater than 0.3. The full epistatic landscape is thus 59.890 billion interactions for Campos et al. [25] and between 58.163 and 69.175 billion interactions for Atwell et al. [4].

Nowadays, this quantity of data represents a challenge on its own to compute, store, and display the results as it relates to a “Very High-Dimensional” (VHD) framework [30]. VHD is mathematically defined in terms of the size of the genotypic matrix \(X\) (n rows and p columns) and in terms of sparsity of the unknown parameter to be estimated or tested, here the number of “active” SNPs and interactions for a given phenotype: k. In this framework [30], we can evaluate the effects of VHD genotypic input matrices on the performance of several popular methodologies (for hypothesis testing, support estimation, and prediction) and show that when k log(p/k) is large with respect to n then statistical estimation and testing errors inflates dramatically. We believe that this is at least in part the reason for which full epistatic maps (2D-GWAS) were so far out of reach.

Following this line, in our study (Fig. 5), n = 999, and p is around 60 billion. Following a reasonable estimate for sparsity k, granting satisfactory and reliable outputs is estimated to be not more than 50. This is why, in our forthcoming study we mainly consider and analyze in a final stage around 30 significant SNPs interactions or composite components.

Fig. 5
figure 5

2D-NGG results provide an estimation of 61.2 billion SNP combination effect for A Atwell et al. phenotype ID:31, days to flowering time FT10 and B Phosphorus content Campos et al. [25] measured by ICP-MS. The results are presented as heatmaps and histograms to observe the epistatic interactions between SNPs

Figure 4 displays 2D-NGG results for (i) Arabidopsis flowering times (Fig. 5A–C) [4] and (ii) Arabidopsis phosphorus (P31) leaf content (Fig. 5D–F) [25]. Results are displayed as a square heatmap triangle for which ~ 60 billion signals \(|\widehat{\theta }|\) are provided. One 2D-NGG result dataset represents ~ 500 Go of data. To navigate through this large dataset, a visualization tool named Luciol has been developed that can be understood as a “Google Earth” for full epistatic maps. Briefly, results are organized in layers as such the max intensity of a genomic region is reported on the higher layers. Here in Fig. 5, layer 11 represents our maximum zoom-out condition. A zoom between layer 11 to layer 0 (the layer for which a given pixel represents a direct SNP/SNP combination) corresponds to a 4.2 million times zoom. In other words, a pixel in layer 11 reports the max intensity of 4.2 million SNP/SNP interactions underlying layer 0. Observation of full epistatic maps as well as local signals informs on the genetic architecture underlying a given phenotype (Fig. 5).

In the case of the flowering trait (Fig. 5A), around 6 major epistatic signals emerge where 2 of them are close to the diagonal. The proximity of the diagonal refers to potential epistatic interactions of neighboring genes (although a few Mbp away). To display unambiguous epistasis we thus decided to report here the fourth stronger effect that lies very far from the diagonal. A zoom at the 2D-locus reveals the structure of a 2D-GWAS peak that appears bi-modal (i.e., supported by at least 2 distant SNP combinations, 2 local bright spots in the epistatic map, Fig. 5B). This peak points to 2 loci predicted to be epistatic. The first locus is at position CHR4:6,524,710, and the second one is at CHR1:6,243,417. Using these coordinates, the matrix X and phenotype Y are parsed to plot the phenotypic distribution following the combination of SNPs (a sort of 2D-haplogroup). This is reported by the box plot in Fig. 5C. Herein we observe that this epistatic effect involves 2 loci having a moderate effect individually as reported to the SNPi and SNPj boxplots (left and right panels). However, the combination of the simple effect cannot predict the effect of the combination since the positive effect of SNPj, from 0 to 1 modality, seems enhanced by the SNPi (0) modality and totally repressed by the SNPi (1) modality. This clearly indicates the potential presence of an epistatic effect between these 2 loci.

We also report (Fig. 5D to F) the epistatic interactions in the control of plant leaf phosphorus content. This epistatic map reports around 8 strong epistatic signals. As an example, we zoomed into 2 of them being the stronger ones with respect to their predicted value (\(|\widehat{\theta }|\)). The first one is relatively close to the diagonal although both epistatic SNPs lie in chromosome 1 but eleven Mbp away (Fig. 5F top panel). The second one concerns an epistatic effect predicted to involve SNPs on 2 different chromosomes namely CHR5 and CHR3 (Fig. 5F bottom panel). These 2 epistatic signals are built upon the effect of a strong combination of SNP effects as it appears impossible to predict the combinatorial output of these SNPs by solely analyzing the effect of the simple SNP modalities (compare box plots of SNPi and SNPj to box plot of SNPi:SNPj). Here, these effects can totally be missed by previous studies of epistasis that, to date, necessarily implied a selection of genetic variables [12].

We succeed in demonstrating that the NGG algorithm allows the computation of full epistatic maps with a gene resolution at least for high gene density genomes such as that of Arabidopsis.

Missing heritability recovery and phenotype prediction

We set out to understand the “missing heritability” explained by our recovered epistatic interactions. To do so, we calculated the increased variance explained (as an \({h}^{2}\) proxy) being retrieved from 2D-GWAS as compared to regular 1D-GWAS (Fig. 6). The differential heritability between 1 and 1D + 2D GWAS was estimated by Principal Component Regression (PCR) [31], carried out on a set of selected SNP and SNP-interactions (Fig. 6A). The principle of PCR dates back to the late 50’s [31]. PCR combines Principal Component Analysis (PCA) on the input features of a model followed by linear regression [31]. First, a PCA of X provides a low number of principal components and a dimension reduction by selecting fewer components associated with the highest eigenvalues modulus of X. Regression is then performed on this reduced set of components (related to the VHD problem that we described above) that play the role of new synthetic inputs. PCA concentrates the information of the large matrix X or Z in a smaller matrix, removing collinearity as well because the components are, by definition, not correlated.

Fig. 6
figure 6

Estimation of retrieved missing heritability. A Analysis scheme employed to estimate retrieved heritability and phenotypic predictions from 1D signals (blue diagonal), and from 2D NGG signals orange triangle representing > 59 billion interactions. B Heritability (h2 seen as adjusted R.2) is measured for an increasing number of PCA components and for signal retrieved only from 1D-GWAS or 1D-GWAS (V data points) + 2D-NGG (W data points)

Here, PCR is carried out (i) on a set of p SNPs and then (ii) on a set of the same SNPs as in (i) + q SNP/SNP-interactions (Fig. 6B). The plots show the retrieved “missing heritability” (difference between blue [1D signal] or red lines [1D signal plus 2D_random], the controls, and the green line [1D signal plus 2D]) as measured by the adjusted R2 when the number of selected components increases (x-axis Fig. 6B). For the vast majority (16 of the 18 phenotypes), a good proportion of heritability is retrieved in the 2D signals. Only, Cobalt or Selenium do not display a radical improvement in the explained variance. By applying this method, we observed that information in the epistatic landscape indeed contains a good proportion of the missing heritability (Fig. 6B). For the Phosphorus content of Arabidopsis leaves, for instance, the heritability measures in the 1D GWAS ranges around 22%. Estimated h2 then increases to 33% when the information in the 2D-GWAS is considered. We note here that we do not strictly evaluate the missing heritability recovered but rather a proxy of it.

Having a comprehensive view of gene resolution epistatic maps opens up possibilities for at least two developments. The first is experimental validation. This process is extremely labor-intensive and could take a considerable amount of time to precisely dissect epistatic interactions. Although these are currently under investigation, we chose to publish our findings primarily due to the second potential development. The second area of advancement is in phenotypic prediction. Essentially, NGG can be viewed as a highly effective variable selection process (Fig. 6A), which could significantly benefit precision medicine and various agronomic selection programs for plants and animals.

We thus further evaluate the role of NGG signals for phenotypic predictions through the use of a broad set of machine learning algorithms including Deep Neural Networks (DNN), Support Vector Machine (SVM), Gaussian Processes (GP), Gradient boosting (GB), Random Forest (RF), Linear regression, Lasso, Elastic Net. These techniques were used to predict the 18 phenotypes from the Campos et al. [25] work (described above). We also crossed these machine learning techniques with an increasing number of 1D and 2D signals/SNPs (10, 100, 500, 1000, 5000, 10,000). To perform a proper control, we repeated this in silico experiment but instead of providing the models with proper 2D signals, we randomly sampled epistatic signals (named 2D_random) to evaluate our capacity to predict plant mineral content. As classification problems are easier to solve and that the number of individuals is still a bit limited for regression approaches, we also used quantiles to rank phenotypes into 5 or 3 classes (Fig. 7A). By crossing all these parameters, we ended up with 1728 different models for 1D + 2D signals (y-axis of the plot Fig. 7B) and the same number of models for 1D + 2D_random (x-axis Fig. 7B). Our capacity to predict phenotype is performed on 50% of the dataset (validation set) that were not used to (i) perform the NGG analysis, (ii) neither to fit or train the models. The quality of the models is evaluated through classical precision/recall curves and F scores. Figure 7A presents the F1 scores of best-predicted classes (measure a precision/recall compromise) for 1D + 2D random against 1D + 2D models. All the models lying above the diagonal (x = y) correspond to models for which predictive power is improved by epistatic signals (Fig. 7A, Additional file 1: Table 1).

Fig. 7
figure 7

NGG retrieves genetic markers in epistatic signals improving machine learning procedures. A Analysis scheme employed to measure the effect of the 2D GWAS signal to improve phenotypic predictions. The dataset is divided into a train set (50%) and a test set (50%). The train set is used to perform 1D and 2D GWAS and retrieve stronger GWAS signals. The SNP (1D) and SNP combinations (2D) positions are used to predict phenotype classification from the test set that did not serve to identify the SNPs. Phenotype prediction is performed on the test set. B In this plot, each dot corresponds to a combination of (i) given machine learning model (among SVM, RF, DNN, Gaussian processes, LASSO, and Elastic Classifier) trying to predict (ii) a given phenotype (18 elemental concentrations of Arabidopsis leaves represented with different colors) combined with different learning data format including (iii) a different number of classes (3 or 5 classes) and (iv) different number of SNPs (30, 100, 500, 1000, 5000, 10,000). The x-axis reports the max F1 score for the model provided with SNPs simple 1D signals and randomly picked 2D epistatic SNP combinations (our control). The y-axis reports the max F1 score for the model provided with SNPs simple 1D signals and 2D epistatic SNP combinations. B We observe an improvement (above the y = x line) of > 57% of the models provided with 2D epistatic signals. Arrows point to the two best models (max F1 score). C Prediction improvement is even more dramatic (80%) for models predicting phenotypes from 30 top 1D plus 30 top 2D signals. D, E Examples of the two best predictions of the Molybdenum (Mo98 phenotype) classified concentrations are provided as confusion matrices

We observe that 2D epistatic signals improve phenotypic classification (Fig. 7B) as 57% of the models are improved. Interestingly, models having a low F1 score and models having a high F1 score tend to beneficiate most of the epistatic signals. Furthermore, this improvement is even more dramatic when we consider the models having a lower number of SNP and SNP interactions (30 + 30) (Fig. 7C). In this particular case with a low number (60) of explanatory variables 80% of the machine learning models are improved by 2D signals as compared to randomly picked ones (Fig. 7C). We wish to highlight some particular points for which we observe an increase in our capacity to predict phenotypic classification and an overall good classification outcome (Fig. 7B, D, E). The arrows in Fig. 7B point to 2 models for which the 1D signal alone does not allow a very good classification (xD = 0.534, xE = 0.535), when the same models with an epistatic signal reach an F score of 0.732 (yD) and 0.728 (yE), respectively. We also observe some phenotypes such as Na23 (sodium leaf content) for which most models and parametric values of the machine learning procedures globally beneficiate the epistatic signal showing that retrieved 2D signals are globally bringing new information (purple circle Fig. 7B). Figure 6D and E display an example of our capacity to predict phenotype classification for molybdenum leaf concentrations. This level of precision and recall opens avenues for plant selection procedures assisted by epistatic markers.

In summary, our study successfully presents the creation of comprehensive epistatic 2D maps with sufficient SNP density to achieve gene-level resolution. We applied our method to the model organism Arabidopsis thaliana, leveraging its readily available dataset. Importantly, our approach is universally applicable and can be readily adapted to other biological models, especially in the context of human genetics. As hypothesized before [18], we demonstrate that a substantial part of the missing heritability lies in epistatic interactions (Fig. 6). Finally, we show that this never observed fine grained 2D epistatic signal brings us a bit closer to the prediction of phenotypic values by machine learning procedures on plants, but we hope soon, in other biological models as well.


In this study, we introduce NGG, a method that recovers some of the missing heritability. The signal we have uncovered appears to hold epistatic information, which is not fully captured by traditional variable selection methods, as the individual impact of SNPs is frequently not significant (Fig. 5B). This two-dimensional signal has been demonstrated to enhance genetic prediction through the application of machine learning techniques (Fig. 7).


One limitation of this approach has to do with the VHD problem. Indeed, it has been stated by Candes et al. [16] that the compression works only for sparse signals. This compression allows very important acceleration but it comes at the expense of our capacity to retrieve the entire signal. As stated before for 60 billion interactions we can expect to retrieve around 50 signals given the compression that we apply. However, this conclusion has to be moderated as our algorithm is built to solve the problem by pieces being defined to fit GPU architectures. Thus, we also think that more effects can be retrieved as each piece will be under the VHD constraint but that this constraint can be at least partially canceled out by the reconstruction. As such we think that more work is needed that may improve our capacity to further access epistatic signals in particular for the smaller effects.

Interpretation of results as compared to standard GWAS methods

The NGG algorithm only provides effect sizes, not p-values, so the next question would be how to select a cut-off point for association, given that we observe that the highest signals are very likely to contain true signals (Fig. 2E, F). Since NGG does not multiply statistical tests, it does not in itself require FDR correction. So in the first instance, we advocate to select the stronger signals as the most important and to study 2D haplogrouping (Fig. 5) to determine the sort of the underlying epistatic signal. Furthermore, as has been done for years, we can apply an empirical analysis of the signal "shape" to select peaks of interest. Indeed, for 1D-GWAS, the most interesting signals usually require several variants supporting the same peak. The same logic can be applied to 2D-GWAS peaks, which appear in the epistatic map in the form of an island (Fig. 5). This island is made up of several combinations of variants likely to support an epistatic signal.

It is important to note that an interesting phenomenon emerges when EMMA and NGG are compared. For some phenotypes such as the ones described in Fig. 3, the correlation between both signals is not perfect. However, for other phenotypes (Additional file 1: Fig. 6), we observe a perfect convergence of both algorithms on an x2 relationship. We did not find any valid explanation for this phenomenon. This perfect correlation does not seem to be explained by the types of phenotype (continuous or discrete) or the explained heritability. So more work will be needed to understand the rules of convergence of both techniques.

Opportunities for future work

The next challenge will then be to study higher orders (3D and more) of epistatic levels. However, we believe that even with NGG this is still far from reach mainly for 2 reasons. The first one is that the complexity of the interactions in heterozygous organisms will attain 27 cases to study (33) instead of 9. This will increase the number of individuals to genetically characterize and reach a good enough statistical power. Secondly, the MAF cutoff will need to be even more increased to allow the observation of the different SNP combinations. This cutoff is stringent enough that it may remove rare sequence variations having a potential impact on the phenotype. A new route of investigation for this kind of variation will need to be opened.

Despite this situation, it is worth noting that machine learning procedures that we used in the present work to predict phenotypes (such as DNN for instance) may already exploit some higher level of non-linear interaction between explanatory variables but only between variants having 2D interactions.


We believe that our technique is a valuable tool for recovering some of the missing heritability hidden in epistatic interactions (Fig. 6). Furthermore, its adaptability to both existing and forthcoming datasets suggests promising avenues for genetic exploration.



Arabidopsis dataset corresponds to data issued from the 1001 genome project [21] and kindly provided by Arthur Korte lab. It consists of a genotype matrix above mentioned as a genotype or X matrix containing 9,124,892 SNPs and 1135 ecotypes. For NGG analysis MAF is controlled (0.3 < MAF) resulting in a MAFed X’ matrix containing 346,094 SNPs for Campos et al. [25] and between 341,067 and 371,956 SNPs, for Atwell et al. [4].

The phenotype dataset corresponds to the 18 phenotype from Campos et al. [25] and the 107 phenotype from Atwell et al. [4].


The simulations (Fig. 1) are performed on R. Code can be found at ( Mathematics supporting NGG and algorithmic logic are provided in the Additional file 1: Text 1.

The genetic model simulating the SNP and the phenotype is generated from a binomial matrix with a 0.5 ratio. From this matrix, the interaction matrix is built. The resulting matrix has a simple effect, interaction effect, and pure quadratic effect. Then a sparse parameter is built by dispatching non-null coordinates between simple effect, interaction effect, and quadratic effect. The resulting theta is used to get the support vector, which is a boolean vector indicating where signals need to be found. Finally, two kinds of noise can be added: a fixed noise or a random noise. The code is available on the repository (


The algorithm described in the result and discussion section (fully detailed in its mathematical innovation and algorithmic processing in Additional file 1: Text 1) consists in applying compress sensing techniques to accelerate calculation in the form of GPU accelerated code (Fig. 1). A python version of the code is provided at NGG provides \(\widehat{\uptheta }\) values for each pair of variables (here SNPs). The variable selection procedure is made of two steps: (i) the effects collected in are ranked in decreasing order and (ii) only the N* first largest effects are retrieved. Here the choice of N* follows the lines of [22] (see, e.g., proposition 6.2: our N* stands for the k in the cited article) since we are in a “very high dimensional “ framework in the sense of data scientists and statisticians.

Computer power

This work has been performed on a PowerEdge T640 DELL Server, RAM 377 Go, 4 NVIDIA Quadro RTX 6000 (24Go).

Availability of data and materials

Phenotypic and genotypic data have been retrieved online. Full interaction model generator and Heritability estimation code can be found at, Machine learning predictor from NGG data can be found at, and a GPU parallelized code in python at Raw data (~ 500–600 Go per epistatic map) results of 2D-NGG for the phenotypes presented in Fig. 4 are available without condition upon request to any of the corresponding authors because they are too large for being hosted on All codes can be retrieved at reference [32] and are published under Open Software License 3.0.


  1. Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536:41–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–78.

    Article  Google Scholar 

  3. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45:D896-901.

    Article  CAS  PubMed  Google Scholar 

  4. Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010;465:627–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Tian D, Wang P, Tang B, Teng X, Li C, Liu X, et al. GWAS Atlas: a curated resource of genome-wide variant-trait associations in plants and animals. Nucleic Acids Res. 2019;48:D927–32.

    Article  PubMed Central  Google Scholar 

  6. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet. 2017;101:5–22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Chatelain C, Durand G, Thuillier V, Augé F. Performance of epistasis detection methods in semi-simulated GWAS. BMC Bioinformatics. 2018;19:231.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Phillips PC. Epistasis–the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9:855–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Hind J, Lisboa P, Hussain AJ, Al-Jumeily D. A Novel Approach to Detecting Epistasis using Random Sampling Regularisation. IEEE/ACM Trans Comput Biol Bioinform. 2020;17:1535–45.

    CAS  PubMed  Google Scholar 

  11. Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Niel C, Sinoquet C, Dina C, Rocheleau G. A survey about methods dedicated to epistasis detection. Front Genet. 2015;6:285.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Slim L, Chatelain C, Azencott C-A, Vert J-P. Novel methods for epistasis detection in genome-wide association studies. PLoS One. 2020;15:e0242927.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Snaebjarnarson AS, Helgadottir A, Arnadottir GA, Ivarsdottir EV, Thorleifsson G, Ferkingstad E, et al. Complex effects of sequence variants on lipid levels and coronary artery disease. Cell. 2023;186:4085-99.e15.

    Article  CAS  PubMed  Google Scholar 

  15. Koo CL, Liew MJ, Mohamad MS, Salleh AHM, Deris S, Ibrahim Z, et al. Software for detecting gene-gene interactions in genome wide association studies. Biotechnol Bioprocess Eng. 2015;20:662–76.

    Article  CAS  Google Scholar 

  16. Candès EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math. 2006;59:1207–23.

    Article  Google Scholar 

  17. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    Article  CAS  PubMed  Google Scholar 

  18. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A. 2012;109:1193–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Slyusar VI. A family of face products of matrices and its properties. Cybern Syst Anal. 1999;35:379–84.

    Article  Google Scholar 

  20. Martini JWR, Crossa J, Toledo FH, Cuevas J. On Hadamard and Kronecker products in covariance structures for genotype × environment interaction. Plant Genome. 2020;13:e20033.

    Article  PubMed  Google Scholar 

  21. 1001 Genomes Consortium. Electronic address:, 1001 Genomes Consortium. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. Cell. 2016;166:481–91.

    Article  Google Scholar 

  22. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;37:413–7.

    Article  CAS  PubMed  Google Scholar 

  23. Korte A, Vilhjálmsson BJ, Segura V, Platt A, Long Q, Nordborg M. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat Genet. 2012;44:1066–71.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Grant MR, Godiard L, Straube E, Ashfield T, Lewald J, Sattler A, et al. Structure of the Arabidopsis RPM1 gene enabling dual specificity disease resistance. Science. 1995;269:843–6.

    Article  CAS  PubMed  Google Scholar 

  25. Campos ACAL, van Dijk WFA, Ramakrishna P, Giles T, Korte P, Douglas A, et al. 1,135 ionomes reveals the global pattern of leaf and seed mineral nutrient and trace element diversity in Arabidopsis thaliana. Plant J. 2021.

  26. Michaels SD, Amasino RM. FLOWERING LOCUS C encodes a novel MADS domain protein that acts as a repressor of flowering. Plant Cell. 1999; Available from:

  27. Sheldon CC, Burn JE, Perez PP, Metzger J, Edwards JA, Peacock WJ, et al. The FLF MADS box gene: a repressor of flowering in Arabidopsis regulated by vernalization and methylation. Plant Cell. 1999;11:445–58.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet. 2012;44:825–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. John M, Ankenbrand MJ, Artmann C, Freudenthal JA, Korte A, Grimm DG. Efficient Permutation-based Genome-wide Association Studies for Normal and Skewed Phenotypic Distributions. bioRxiv. 2022 p. 2022.04.05.487185. Available from:, [Cited 2022 Jul 13].

  30. Verzelen N. Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. EJSS. 2012;6:38–90.

    Google Scholar 

  31. Park SH. Collinearity and Optimal Restrictions on Regression Parameters for Estimating Responses. Technometrics. 1981;23:289–95.

    Article  Google Scholar 

  32. Carré C, Carluer JB, Mas A, Krouk G.. Next Gen GWAS. Zenodo; 2024..

Download references


We wish to thank Dr Sandrine Ruffel and Dr Benoit Lacombe for their valuable remarks on the manuscript. We are also grateful to Dr. Julie Clar for her assistance in project management throughout the development of the initial version of the algorithm (PAM).

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2


Research on phenotype prediction is supported by the CNRS and by Montpellier University (ISITE MUSE project AI3P) to G.K. and A.M. This work/project was publicly funded through ANR (the French National Research Agency) under the “Investissements d’avenir” program with the reference ANR-16-IDEX-0006. The development of a first version of the computer algorithms was supported by SATT-AxLR and by Agropolis Fondation (SATT prematuration project) to G.K. A different version of the NGG resolution algorithm and the visualization algorithm named Luciol allowing data exploration has been developed by BionomeeX™. Luciol software development has been supported by the Banque Publique d’Investissement (BPI- Emergence project to BiononeeX).

Author information

Authors and Affiliations



Conceptualization: C.C., G.K., and A.M. Methodology: C.C., J.B.C. (performed prediction, machine learning, and simulations), G.K., A.M., C.E-S, E.H, N.R., Ch.Ch. Writing — original draft: C.C, J.B.C, A.M, G.K.

Authors’ Twitter handles

Twitter handle: @GabKrouk (Gabriel Krouk).

Corresponding authors

Correspondence to Clément Carré, André Mas or Gabriel Krouk.

Ethics declarations

Ethics approval and consent to participate

Not applicable for this publication.

Competing interests

C.C., G.K., and A.M. are co-founders of a CNRS and University of Montpellier spin-off company BionomeeX™ (

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional text, figures and table.

Additional file 2.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carré, C., Carluer, J.B., Chaux, C. et al. Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction. Genome Biol 25, 76 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: