Genome-wide association studies in plants: the missing heritability is in the field
© BioMed Central Ltd 2011
Published: 28 October 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 28 October 2011
Genome-wide association studies (GWAS) have been even more successful in plants than in humans. Mapping approaches can be extended to dissect adaptive genetic variation from structured background variation in an ecological context.
The genetic sources of phenotypic variation have been a major focus of both plant and animal studies aimed at identifying the causes of disease, improving agriculture and understanding adaptive processes. In plants, quantitative trait loci (QTL) were originally mapped in biparental crosses, but they were restricted in allelic diversity and in having limited genomic resolution . The genome-wide association approach (GWAS) overcomes several limitations of traditional gene mapping by (i) providing higher resolution, often to the gene level, and (ii) using samples from previously well-studied populations in which commonly occurring genetic variations can be associated with phenotypic variation. The advent of high-density single-nucleotide polymorphism (SNP) typing allowed whole-genome scans to identify often small haplotype blocks that are significantly correlated with quantitative trait variation. These approaches have enabled both large studies of human disease, which have identified important loci , and recent plant studies that have been successful in identifying loci that explain large portions of phenotypic variation.
Significant associations between genetic variations and phenotypic diversity have been found in some human studies, but they explain only a few percent of the phenotypic diversity, leading many geneticists to ask 'Where is the missing heritability?' [3, 4]. This question has several possible answers. First, rare variants [3–5], major alleles that are unique to local families, can be detected only when sampling is adequate at the local level. Second, allelic heterogeneity, the phenomenon in which multiple functional alleles of the same gene exist and are associated with different phenotypes, is common, especially in wide population samples [6–8]. Third, single-marker approaches suffer from genetic heterogeneity when multiple major loci are involved and in linkage disequilibrium (LD) with each other . Fourth, variation resulting from epistatic interactions between genes might go undiscovered because epistasis can only be investigated practically in a sequential scan of major common loci and the genome . Finally, epigenetic variation, which requires sophisticated genotyping, is likely to be a source of missing heritability . The influence of each of these factors on heritability strongly depends on the population sampled. Thus, even true positives will often fail to replicate across populations. Owing to the confounding effect of population structure, true causative SNPs are difficult to identify because they are in LD (that is, in non-random association) with many loci in the genome .
When human GWAS find associations that have genome-wide significance, the SNPs explain only a tiny fraction of the phenotypic variation revealed by family-based studies . But the results of recent GWAS in plants (in Arabidopsis thaliana, rice, and maize) have explained a much greater proportion of the phenotypic variation than that explained by human GWAS studies. It seems that, in plants at least, the assumption that common genetic variation explains common phenotypic variation holds. In plants, rare variation can become sufficiently common in large families or populations to be identifiable by GWAS. For example, GWAS have identified SNPs and population structure that can explain up to 45% of the phenotypic variation in flowering time . However, flowering time has even higher heritability (approximately 90%), leaving an additional 45% of heritable variation unexplained.
In this review, we consider why GWAS in plants have been successful, focusing on the experimental designs and sampling strategies used in these studies. Those working on GWAS in human genetics and in plants have much to learn from each other. We then discuss future developments for generalized GWAS in plants, taking on board the lessons learned in model species. Empirical geographic knowledge of gene flow and population structure, together with hypotheses about the ecological zones that have imposed selection, enables the sampling of different populations in which the same or different adaptive traits are inherited. A general population re-structuring approach can then be used to uncouple adaptive variation from the genomic background through synthetic outcrossing among lines that have balanced genetic diversity.
Finding the genetic basis of complex traits in plants, such as flowering time, growth rate, and yield, has been a major focus of attempts to improve crops and understand plant adaptation. A. thaliana has long been an attractive model for the study of natural variation and adaptation because of its wide distribution , the diversity of its habitats, and the unequaled genomic resources available for this species. GWAS requires a genomic map in which the marker density is higher than the extent of LD. This, in turn, depends on the population sample, specifically the standing genetic diversity and the number of recombination events that shuffle that diversity. In a global set of A. thaliana accessions, LD was shown to decay within 10 kb on average, so the optimal number of SNPs necessary to cover the whole genome was estimated to be 140, 000 . A genotyping array, designed to type 250, 000 SNPs, was used to genotype an initial set of 192 natural accessions . In this seminal study, an extensive set of 107 phenotypes were used to run GWAS in A. thaliana. To test the ability of GWAS to detect the genetic basis of natural variation efficiently, the power to detect previously identified candidate genes was assessed through the calculation of enrichment ratios. In most cases, large enrichment ratios were found [16, 17], meaning that SNPs with high association scores were more likely to be close to previously identified candidate genes than random loci. Furthermore, some of the alleles identified in GWAS overlapped with lower-resolution QTL identified with recombinant inbred line (RIL) mapping [13, 17, 18]. Together, this evidence conclusively demonstrates that GWAS can identify many true genotype-phenotype associations.
The potential of GWAS in A. thaliana was demonstrated by the successful functional validation of the gene ACCELERATED CELL DEATH6 (ACD6) . Natural variation in ACD6 was shown to underpin differences in vegetative growth and in resistance to microbial infection and herbivory . A Col-0 (reference accession) background with a loss-of-function allele of ACD6 displayed increased leaf necrosis, reduced growth and reduced susceptibility to different pathogens when transformed with the ACD6 allele from the Est-1 accession. GWAS was performed for leaf necrosis on a set of 96 natural accessions. Nine of the fifteen SNPs with the lowest P-values in this scan were located close to or within ACD6. None of the new genes identified by GWAS have been functionally validated to date, but this study confirms the ability of GWAS to detect true positives as ACD6 was previously known from forward-genetic mutant screens .
Allowing for the average LD distance (10 kb) is sufficient to enable the identification of individual genes, but the gene density seen in A. thaliana suggests that some genomic regions display extended clusters of high-scoring SNPs instead of sharp peaks. The broad 'mountain range' of associations makes the selection of candidate genes difficult . The width of the 'mountain' can be broad due to extended LD from a recent selective sweep or because of low recombination. In addition, genetic or allelic heterogeneity can create 'mountain ranges' that have multiple peaks. The sweeps acting on common loss-of-function deletions at FRIGIDA (FRI), along with other linked flowering time loci, probably explain the complex pattern of association with flowering time that was observed at this locus . Tightly linked genes have been shown to underlie a complex association with growth rate variation . Another limitation to the ability of GWAS to identify individual genes is the occurrence of false positives that are an artifact of population structure . The worldwide set of natural A. thaliana accessions is highly structured , and when phenotypic variation for the trait of interest overlaps with patterns of population structure, strong confounding can occur. Statistical methods that have been developed to control for population structure [21, 24–27] produce a P-value distribution that is closer to a uniform distribution, although they can have reduced sensitivity. Nevertheless, GWAS in A. thaliana have been shown to have significant power in detecting previously known candidate genes, and they have also detected hundreds of loci that are involved in the natural variation of complex traits. This new knowledge of the number of genes that underlie adaptive traits, and the size of their effects, allows us to better understand the bases of flowering time, growth rate, and yield.
Maize and rice, two of the most important crop species in the world, have been the focus of intense efforts to map the ancestral genetic variation that underlies agronomic traits such as grain yield, disease resistance, and plant architecture. Maize is an outcrossing plant, with an LD that decays at approximately 2, 000 bp on average (a distance 5-fold shorter than that in A. thaliana ). It also has a large genome (2.3 Gb of unique sequence ), and thus the typing of many SNPs is required to define a haplotype map for maize. A set of 1.6 million SNPs has been designed for maize GWAS, but the dense genotyping of a large number of lines was initially prohibited by cost.
The approach that was taken instead was to genotype a limited number of lines (25 founders) and to cross them to produce 25 RIL families, known as the nested association mapping (NAM) populations . A total of 5, 000 RILs (200 per family) were then genotyped at low density. High-density genotypes were then imputed on the basis of high-density genotypes of the founding lines. The complete set of RILs was phenotyped, and SNP associations were then tested across all the RILs, with the test including a term to account for variation caused by the RIL family effect. The main advantages of this approach are: (i) the imputation of high-density genotypes gives some fine-mapping resolution among the 25 founders; (ii) outcrossing reshuffles variation in the founder genomes and therefore provides some control of population structure effects; (iii) joint-linkage mapping identifies low-resolution QTL across all RIL families, and this genetic background can be controlled while performing nested associations for fine mapping; and (iv) the use of RILs allows repeated measures of phenotypes on the same lines, in common and different environments, allowing precise estimation of variation in traits such as flowering time , leaf architecture , and blight resistance [33, 34]. NAM also has some limitations, primarily that the small number of founders limits genetic diversity and ancestral recombination. One special strength is that high-density genotypes are imputed onto progeny typed with fewer markers, where new recombinations have shuffled SNPs that were previously in LD because of population structure. Many designs of NAM are likely to emerge that fit the particular population biology of the target species .
Rice is a selfing species and, like A. thaliana, a good candidate for GWAS. Huang et al.  identified an unbiased set of common SNPs that they used to identify strong associations between genetic loci and 14 agronomic traits, including heading date, grain size, and starch quality. Here, the step forward was to use a strategy based on second-generation sequencing technology to develop a haplotype map for 517 Chinese land races across the Oryza indica and Oryza japonica rice subspecies. The idea was to perform low depth (1X) whole-genome sequencing, and then take advantage of the > 100 kb LD in rice to impute missing data. This strategy was successful because the imputation algorithm that was developed reduced the missing data from 60% to 3%, with 98% accuracy. GWAS was subsequently performed using 671, 355 SNPs in a subset of 373 indica lines to avoid the major confounding of population structure between subspecies. This identified between 1 and 7 loci for each agronomic trait, each of which explained between 6% and 68% of the variation in that trait. A few genes that have large effects in controlling traits that are involved in determining yield, morphology, stress tolerance, and nutritional quality were also identified in recent rice GWAS [37, 38]. Together, these studies establish a research platform that can link genomic variation and germplasm collections to enable molecular breeding.
Controlling for population structure is a standard procedure in GWAS, although doing this when the traits are strongly confounded reduces the power of the analysis and can lead to false negatives. This issue is especially likely to arise when studying traits such as flowering time and cold tolerance, which are filtered by environmental gradients that overlap with patterns of population structure. In this case, controlling for population structure can reduce the association signals around major adaptive genes [6, 17, 39]. In this situation, the only solution is synthetic, that is, to re-structure populations by making crosses. Another weakness of GWAS is its lack of power to detect rare alleles that are involved in natural variation. Parametric tests of association, including efficient mixed-model association (EMMA) [40, 41], are sensitive to SNPs that have low minor-allele frequencies, which can show an artificially increased association score (-log(P-value)). Because of this phenomenon, most studies have not considered SNPs that have minor allele frequencies under 5% or 10%, although these variants do contribute to phenotypic variation . Balancing samples across population subdivisions can homogenize allele frequencies, elevating globally rare variants that are common in certain subdivisions. Their direct trait association can be detected when they are decoupled from population structure. Allelic heterogeneity is another limitation that applies to GWAS and other multi-parent mapping strategies [42, 43] because GWAS assumes that common (biallelelic) genetic variation explains quantitative trait variation [6, 17]. Association tests involving SNPs that tag multiple alleles in LD with each other can therefore be positively misleading .
The current collection of more than 1, 300 A. thaliana accessions, genotyped at 250, 000 SNPs (M Horton, J Bergelson and M Nordborg, personal communication) and eventually the data from the 1001 Genomes Project , are large enough samples to begin to deliver empirical knowledge of the deeper patterns of genomic variation on the landscape . By gleaning the genetic information, one can select a core mapping subset, like the RegMap lines in A. thaliana , that has balanced regional diversity and reduced confounding effects of population structure, but an average length of LD decay that is short enough to allow precise mapping of the underlying genes. The distribution of some phenotypes might, however, overlap with patterns of population structure at a local scale. For example, this could be the case in a newly colonized region where a patchy distribution offers little opportunity for gene flow. Large parts of the genome (or the whole genome with complete isolation) might be selected along with the genes controlling locally adaptive phenotypes. In this case, approaches involving wider crosses seem to be the only way to identify the underlying genes.
New genotyping-by-sequencing (GBS) technologies and bioinformatic methods, based on light shotgun sequencing or reduced representation and multiplexing, have the ability to discover, genotype, and impute near-complete population genomic data in any species [46–49]. For a given sequencing investment, there is a trade-off between the sample sequencing depth and the number of samples; with multiplexing, more samples can be sequenced but with lighter coverage. Importantly for imputation, the sequencing depth required for each individual depends on the extent of LD. The increased LD within families allows the haplotype map to be imputed from lower-coverage data, and this is an important advantage of the NAM design . With moderate LD, the rice haplotype map could be assembled from hundreds of landraces typed at 1X coverage . To integrate linkage-based pedigrees and association studies, GBS can be used to type progeny from several maternal lines of population samples. As has been achieved in rice, high-resolution genotypes of the maternal line could be assembled and near-complete genotypes imputed for the progeny.
Genotyping arrays only include a fraction of the SNPs identified in a restricted set of lines. Some missing heritability probably originates from the characterization of the genetic diversity using ascertained SNPs, which reduces the ability to detect rare alleles and causal polymorphisms. This can lead to an underestimation of the diversity and relatedness . This component of missing heritability can be largely overcome by next generation sequencing technology, but repetitive and highly divergent portions of the genome might remain largely inaccessible. Aligning short reads to a single reference genome might introduce some ascertainment bias but this should be less of an issue as reads become longer.
The study of adaptation in traditional model plants such as A. thaliana, maize, and rice has been moving back 'into the field', with new wild collections and greater ecological context being introduced. At the same time, model systems of plant adaptation, such as columbine (Aquilegia) , monkey flower (Limulus) , and sunflower (Helianthus) , can now take advantage of genomic tools that enable association mapping. This convergence of disciplines points towards an emerging synthesis of adaptation genetics.
Studies in model species such as A. thaliana, rice and maize have validated these approaches to identifying the genetic bases of adaptive traits. These methods, when combined with the increasing capacity and decreasing costs of next-generation sequencing, will allow GWAS in non-model species. Ultimately, population genomic studies across multiple species that occupy the same habitats will allow comparative studies of adaptive genetic variation among species that have potentially evolved in parallel under the same selective pressures. A better understanding of adaptive processes at the community level might be obtained by comparing the genetic architectures of adaptive traits among species that may have different life histories. We believe that developing landscape and population genomic resources together in new species will enable high-power association mapping experiments to find the missing heritability underlying the adaptive traits seen in the field.
We thank Paul Grabowski, Alex Platt, and Magnus Nordborg for discussion about the manuscript and the reviewers for their comments. BB is supported by a Dropkin fellowship and an NIH grant to Joy Bergelson; GPM is supported by the Argonne-University of Chicago Strategic Collaborative Initiative, and JOB is supported by NIH RO1 GM073822.