The promise and limitations of population exomics for human evolution studies
© BioMed Central Ltd 2011
Published: 14 September 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 14 September 2011
Exome sequencing is poised to yield substantial insights into human genetic variation and evolutionary history, but there are significant challenges to overcome before this becomes a reality.
For the past few decades, advances in molecular biology have continuously refined our understanding of human evolutionary history. A simple model of expansion and global migrations from a single ancestral human population with adaptation at a few protein polymorphisms has transformed into a complex scenario involving introgression among numerous divergent groups, multiple population-specific bottlenecks, and thousands of candidate genomic sites of possible evolutionary importance [1–6]. Although the broad patterns of demographic trends, geographic population structure, and adaptation have now been well established [1–4], emerging genome-scale datasets will enable detailed inferences about particular populations and genes. Major ongoing goals include inferring intracontinental patterns of migration and admixture, reconstructing the history of human population growth and bottlenecks, and categorizing whether polymorphisms are selectively neutral, deleterious, or adaptive (Box 1). Until recently, such questions could be addressed only with the limited statistical power and precision afforded by single nucleotide polymorphism (SNP) arrays or small sets of sequence data. However, exome sequencing has the potential to address many of these questions.
Exome sequencing is a new and powerful technique in which genomic DNA that binds to a predefined target of known exons is sequenced using next-generation technology, in order to capture the protein-coding portion of the genome . The magnitude and cost-effectiveness of exome datasets vastly overshadow many other methods for studying polymorphism that have recently been popular, such as SNP arrays or single locus resequencing studies. Here, we discuss the application of exome data to human population genetics. We argue that exomes will allow many important and detailed analyses that are not possible with SNP arrays because of ascertainment biases. Moreover, although whole-genome sequencing in large population samples is clearly on the horizon, exomes are the most cost effective and practical way of obtaining sufficiently high coverage to rigorously characterize the spectrum of rare variation. However, the absence of noncoding data does limit the application of exomes in nontrivial ways and can lead to misleading inferences if research is not carefully conducted. Thus, we are cautiously optimistic that exomes will address many remaining questions about human evolution, if incompletely.
It is important to note that the ideal filtering strategy used to generate an exome dataset differs slightly between population genetics and phenotype association studies. In association analyses, the goal is usually to maximize the number of putatively real variants, any of which could be causal for the trait in question, and to ignore invariant sites. However, for endeavors such as resolution of population structure, it is preferable to discard sites with missing data in a substantial proportion of samples in order to minimize clustering of individuals based on 'missingness', defined here as the proportion and identities of genotypes with missing data, and knowledge of invariant sites is essential. As exome sequencing becomes routine and optimized, it will be important to maintain some flexibility in filtering options based on particular research goals.
One of the most promising applications for exome data is the study of natural selection in humans . Inferring patterns of natural selection on genes is a powerful approach for gauging the functional impact of polymorphisms. Although a nontrivial amount of non-coding DNA is functional, it is clear that exons contain a substantial proportion of the genome's phenotypically relevant sites, subject to strong selective pressures . Natural selection is also easier to study using exons, as many existing statistical tests for estimating selection, such as those based on the ratios of nonsynonymous to synonymous sites, are appropriate only for coding sequence (Box 1). However, most of the signature of a selective event can lie in noncoding regions, even if the target of selection is in an exon. Exome data provide substantial power to detect regions of low polymorphism or high linkage disequilibrium only if exon density in the region is sufficiently high. Even then, estimating the precise length of the region affected by selection is not possible without full sequence data, although sequencing the flanking noncoding areas after identifying an interesting region is always an option.
Positive selection, the fixation of new favorable alleles, is an important evolutionary phenomenon that has proven difficult to thoroughly characterize. Numerous studies have identified genomic regions displaying extreme values in statistical tests of selective neutrality, but the overlap among these lists of candidate regions is often poor, suggesting a high proportion of false positives . In addition, it is often unclear whether outlier SNPs are themselves the targets of selection or merely linked to the true targets [4, 15]. Analysis of exome sequences promises to enhance power for resolving these issues. A typical signature of a positive selective sweep includes low π and an excess of rare variants, which can most directly be identified with sequence data. Assuming that the real causal variant is in an exon, it can be pinpointed with high precision. Owing to their rich information content, even a small sample of exomes can show differences between selected and neutral regions and allow adaptive substitutions to be identified . For example, the causal nonsynonymous polymorphism in SLC24A5, a gene that influences skin pigmentation, is a clear outlier with respect to both interpopulation divergence and patterns of polymorphism in flanking exons, such that its adaptive significance is apparent in a sample of as few as ten exomes [13, 16].
Whether human populations actually harbor a large proportion of adaptive coding variants flanked by regions of low π or skewed site frequency spectra depends on where and how selection usually acts in humans, which is still unresolved. If selection acts primarily on non-coding regions  or on standing genetic variation, such that dramatic polymorphism-reducing selective sweeps do not occur [18–20], exomes will have less of an advantage over other methods such as SNP arrays or full genomes for studying positive selection. So far, the clearest example of positive selection inferred from exome data is the hypoxia response gene EPAS1, which has evolved rapidly in high-altitude Tibetan populations . The strongest candidate SNP at EPAS1 is in an intron that happened to be included, and the primary evidence for positive selection is high divergence between populations rather than low polymorphism. The fact that the gene was still identified highlights the versatility of exomes, but SNP-based or noncoding-inclusive approaches might have had similar, if not greater, power to detect selection in this case.
Balancing selection, the maintenance of multiple favorable variants, can also be studied with exome data. Under the classic model of balancing selection, two or more alleles are maintained at intermediate frequency in a population. Most of these cases in humans have probably already been identified because the variants in question would be common, although flanking sequence data can help strengthen or refute the case for balancing selection on a particular SNP, as in the case of the prion protein gene PRNP, in which a widely publicized claim of cannibalism-associated balancing selection  was shown to be an artifact of ascertainment bias . Under other forms of balancing selection, one allele might be very rare and therefore as yet undiscovered. For example, under fluctuating selection , a currently deleterious, and therefore rare, allele may have been advantageous in the past and could be again in the future. Similarly, the equilibrium allele frequencies in the case of overdominance, or heterozygote advantage, are proportional to the relative selective disadvantages of each homozygote genotype ; thus, if one homozygote is quite deleterious (for example, lethal), whereas the other is only slightly less deleterious than the heterozygote, a highly skewed allele frequency will be maintained by balancing selection. It is unknown whether these more complex forms of balancing selection have an important role in the patterns of human genetic diversity, and exomes are ideal for this line of inquiry because their cost-effectiveness allows even rare alleles to be observed.
Purifying selection, the elimination of deleterious mutations, is by far the most common type of selection. Therefore, it is the most relevant to human health because, for the vast majority of functionally relevant polymorphisms in a genome, the derived variant will be deleterious. Distinguishing harmful variants from benign variants is a central goal of disease genetics, and population genetic studies to identify purifying selection are directly relevant to this goal. With a large sample of exomes, it is possible to estimate the probability of deleteriousness for a nonsynonymous variant given its frequency. Assuming that only benign variants ever reach high frequency, the ratio of nonsynonymous to synonymous sites at high frequency can be used to calculate the relative excess of nonsynonymous sites, which are presumed to be deleterious, at lower frequencies . Given the enormous number of variants in an exome dataset, this approach can be tailored to highly specific site classes, based on biochemical properties of the encoded residue or patterns of conservation across species, rather than simply comparing all nonsynonymous and all synonymous polymorphisms. Furthermore, genes with very few nonsynonymous variants overall that do not show evidence of a selective sweep are likely to be under strong purifying selection, so there is an enhanced probability that subsequently discovered rare nonsynonymous variants are deleterious. Such highly conserved genes can be identified only with data on invariant sites from many individuals, which exomes provide.
Natural populations are not static and often have complicated demographic histories, including changes in population size and non-random mating leading to geographic structure. Rare variants and unascertained common variants identified from exome sequencing will be a powerful resource for inferences of demographic history. So far, resequencing efforts of smaller subsets of the human genome have already yielded a comprehensive portrait of historical changes in population size, and the relationship between geographically diverse populations, migrations, and admixture [2, 27–29]. For example, both African and non-African populations have experienced bottlenecks followed by an exponential increase in population size, although the magnitude of these events has been greater for non-African populations [2, 28, 29]. Exome sequence data will facilitate more precise estimates of important parameters governing human history, such as the mode and timing of population expansions.
Of particular interest, exome data are well poised to enable new insights into recent demographic events. Because exome sequencing is currently more cost-efficient than whole-genome sequencing, it is possible to study patterns of variation in very large samples. To explore this idea in more detail, we performed a simple coalescent simulation of a population that experienced a bottleneck of moderate intensity 50,000 years ago and a more recent population expansion 2,000 years ago (Figure 2a). The goal here is not to perfectly recapitulate human demography, but to demonstrate how exome sequence data might facilitate inferences of recent events. From this model, we explored how the site frequency spectrum varies as a function of sample size. As shown in Figure 2b, there is a dramatic shift towards rare alleles, particularly singletons (sites where the minor allele is only observed once in the sample), for larger sample sizes. To quantify this affect more rigorously, we calculated Tajima's D statistic (Box 1), which is expected to be negative in cases of an excess of rare variation relative to what is expected in constant sized populations. For small sample sizes (Figure 2c), the recent population expansion is 'invisible' and Tajima's D is close to zero, which is the expected value in populations of constant size. However, as sample size increases, Tajima's D becomes sharply negative, revealing the recent explosive population growth. Intuitively, these results make sense because the larger sample size provides sufficient numbers of mutations to reveal the recent underlying genealogical structure. Interestingly, in populations of constant size, Tajima's D is not influenced by sample size and stochastically varies close to zero (Figure 2c). Thus, because exome sequencing can be performed in large samples, these simple simulations suggest that there is considerable promise in more detailed and quantitative estimates of recent human demographic history.
Moreover, as described above, because exome data do not suffer from the same ascertainment bias inherent in SNP arrays or small-sample datasets, it will possible to explore more nuanced questions related to population structure. For example, an interesting hypothesis to test is whether rare variants have signatures of structure that are different from those derived from common variants. Intuitively, as rare variants are predominantly derived from mutations in the recent past, they may be particularly useful in assessing intracontinental, or perhaps even finer-scale, population structure, even if allele frequency differences at common variants are negligible. Similarly, exome data will also be a powerful resource for understanding how the process and dynamics of admixture manifest themselves in patterns of variation  across the genome. At the individual level, exome data may allow reconstruction of the mosaic structure of ancestry blocks (stretches of the genome inherited intact from a parental population ), which will provide mechanistic insights into the admixture process and the differences in demographic history of the parental populations . As with SNP array datasets and other genomically incomplete data, haplotypes in unsequenced (noncoding) regions must be inferred from the existing data, with a precision that depends on the density of sequenced (coding) genotypes.
An important general caveat of exome data in understanding human demographic history is that purifying selection acting on deleterious variants will complicate inferences of population parameters, such as effective population sizes, and the site frequency spectrum . A simple strategy to attenuate these concerns is to focus analyses on classes of sites that are expected to be less strongly influenced by purifying selection (such as synonymous sites and targeted introns). However, new methodological approaches that jointly estimate demographic parameters and selection are clearly more desirable and important to develop .
Although exome datasets remove many of the biases and limitations that have plagued previous population genetic datasets, they can still be misinterpreted if not analyzed appropriately. One potential challenge is presented by cryptic paralogs. Copy number variation is prevalent and remains poorly characterized in humans. Reads from exons that are absent from the capture target, perhaps because they only occur in some individuals, can map to paralogous exons on the capture target, falsely inflating apparent π in these exons. In many cases, these exons can be removed from analysis by filtering on violations of Hardy-Weinberg equilibrium.
Another concern is missing data. It is common to remove invariant sites from exome files in order to reduce them to a manageable size. However, estimates of π require differentiating between truly invariant sites and sites that might be variable but were not sequenced at high coverage in many individuals. For some analyses, it is sufficient to estimate 'missingness' at invariant sites rather than measure it directly, but doing so carries the important caveat that regions of low π could merely be regions of low coverage.
A third challenge is the difficulty of merging datasets. As yet, there is no one accepted definition of the exome. Rather, there are numerous capture targets with different combinations of exons. Even if two targets share the same exon, coverage may be better in one of the targets for a variety of technical reasons. Thus, when sequences from multiple targets are combined into a single dataset, missing data at some sites will be high and highly correlated with the target used. If different populations were sequenced with different targets, analyses of population structure are then confounded. The use of multiple sequencing platforms could potentially cause a similar pattern. Furthermore, multi-sample calling methods for assigning genotypes are more likely to call a variant if it is also seen in other samples, so calling genotypes in batches can cause artifacts if these batches are then merged with each other or with single-sample called genotypes. These effects can be minimized by excluding sites with a high proportion of missingness, but the best approach is to use the same target and sequencing platform on all samples, and to call genotypes on all samples either all together or else individually.
A fourth caveat is that even with a low overall error rate, the sheer size of the exome means that false positives are inevitable. These can be minimized with strict filters on depth and quality, at the cost of discarding some real variants (for example, increasing the false negative rate). The stringency of filtering depends on the research goal. For most population genetic analyses, a subset of the exome with consistently high-quality data is preferred to a complete exome with a large number of false positives.
A further caveat, perhaps self-evident, is that exomes provide no information about noncoding regions, including many functionally important noncoding sites. Exomics researchers should be careful not to assume that all evolutionarily relevant variation has been captured by exomes. Indeed, some of the most well-documented targets of selection, such as the regulatory region of the lactase gene LCT, may leave little detectable signature in exomes .
Finally, exomes present the difficulty of a deluge of data. Storing and accessing large exome files is a computational challenge, although exomes are easier to work with than whole genomes. In addition, interpreting the functional consequences of one particular variant among hundreds of thousands is a daunting task. Given that even strict filtering does not eliminate error, it is recommend that sites or regions showing unusual polymorphism patterns be validated with Sanger sequencing before drawing any definitive conclusions about these loci.
Exome sequencing represents an important milestone in genomics, and provides a powerful tool for population geneticists that will facilitate estimates of numerous evolutionary parameters with much greater precision than was previously possible. Until large full-genome datasets in all populations of interest are feasible, exomes will represent the best available resource for inferring patterns of human demography and natural selection in an unbiased and comprehensive manner.
Extant patterns of human genetic variation provide information about our demographic and evolutionary history. The goals of population genetics are to infer past events from DNA sequence variation and identify and quantify how evolutionary processes, such as natural selection, population structure, migration, genetic drift, and changes in population size, have shaped human genomic diversity. To this end, numerous population genetics statistics have been developed for analyzing genetic variation. A brief synopsis of population genetic statistics well suited to exome data is as follows.
The expected number of differences between two sequences randomly selected from the same locus in a population is represented as π. If π is calculated per base pair, data on both variable and invariant sites, and therefore sequence data rather than SNP array data, are required. Numerous evolutionary inferences rely on π. Its overall magnitude reflects the mutation rate and effective size of a population. Unusually high or low π at a locus can be a signature of natural selection. Most genes in most human populations have per base π values between 10-4 and 10-3 .
A site frequency spectrum represents the relative numbers of variants occurring at all frequencies in a population. The proportion of rare variants as compared with common variants can be used to infer the rate and timing of population growth. Unique spectra for certain genes or certain site classes are thought to reflect variation in the strength and form of natural selection. For example, a selective sweep may eliminate all variation, and all new variants arising after the sweep will be rare initially, resulting in a skewed spectrum with a relative dearth of common variants. Tajima's D is a summary statistic of the site frequency spectrum, with negative values indicating a relative excess of rare variants, positive values indicating a relative excess of common variants, and values near zero indicating mutation-drift equilibrium. Site frequency spectra are most accurately inferred with large amounts of unbiased sequence from numerous individuals, as provided by exomics.
Natural selection is expected to act more strongly on nonsynonymous sites than synonymous sites, and there are numerous statistical tests that compare these site classes in order to study selection. Exomes represent the exact portion of the genome where such tests are applicable. For example, the McDonald-Kreitman test  compares the ratio of polymorphism at these two site classes with the ratio of interspecies divergence at these two site classes. Under constant purifying selection these two ratios should be similar, so a discrepancy is evidence for adaptive evolution.
This work was supported by a research grant (1R01GM076036) from the NIH to JMA and the NHLBI Go Exome sequencing Project (HL-102923) to JMA and MJB.