A world in a grain of sand: human history from genetic data
© BioMed Central. 2011
Published: 21 November 2011
Skip to main content
We’re sorry, something doesn't seem to be working properly. Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© BioMed Central. 2011
Published: 21 November 2011
Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.
To see a world in a grain of sand ...
William Blake, Auguries of Innocence
The genome of each individual is a temporary assemblage of DNA segments brought together for a single generation by a combination of chance, ancestry, recombination and natural selection. These segments have different histories because of recombination and can thus provide independent information about ancestry, the focus of this review. However, the ancestry of different segments is not entirely independent. Humans are not a single randomly mating population: we are subdivided, and these subdivisions into bands, tribes, clans, ethnic groups, nations and so on are of great interest to both scientists and non-scientists. Thus the thousands of different genomic segments in any individual do not trace back to ancestors randomly spread around the globe; segment ancestry is constrained by population history. Two non-recombining segments of the genome, mitochondrial DNA (mtDNA) and the Y chromosome, have been used for decades to study genetic histories [1, 2]. Sometimes mtDNA and the Y chromosome share the same history, but often they do not, and such differences alert us to some of the complexities of the human past . But mtDNA and the Y chromosome provide only two perspectives. Recent advances in technology provide access to most of the genome, and increasingly to the genomes of companion species. Here, we consider how this wider perspective is beginning to inform our view of human history. We will see that it is possible to probe much further back into the past, into a period in which the uniparental markers are uninformative yet key evolutionary events took place, and even to speculate about when humans might have begun to wear clothes or to start reconstructing the genomics of former populations before their contact with modern expansions.
Genome-wide data can be obtained by either genotyping samples or re-sequencing them. Genotyping provides information about the allelic state of positions in the genome (currently up to five million, mostly single nucleotide polymorphisms, SNPs) that have prior evidence of variability ; such studies are relatively low-cost and routine, and have been performed on massive numbers of samples. Whole-genome sequencing, by contrast, provides information about new as well as known variants; the technologies are still developing rapidly [5, 6] and have provided our first glimpses of population-scale samples of hundreds of individuals . Here, we discuss how samples have been chosen for studies of human history and some of the resulting sample sets, together with a number of developments in analytical approaches. We also focus on a few case studies, chosen to illustrate how whole-genome analyses compare with conclusions from mtDNA and the Y chromosome analysis, how studying other species informs our understanding of humans, and how genetic analyses themselves compare with conclusions from other sources of insights into history: archaeology, language, oral traditions or written records. We begin with examples based entirely on the analysis of modern human populations, and then move to studies that have used more diverse sources of genetic information: ancient DNA and other species associated with humans.
Population genetics aims to understand the observed distribution of genetic variability and to infer histories of populations from genetic data. For this purpose, what really matters are the differences between sequences (variants). Developments in the past few years now enable millions of variants in thousands of individuals to be analyzed.
Differences can be summarized using statistics (such as allele frequencies or genetic distances) and used to quantify relationships among individuals or populations using clustering techniques, such as principal components analysis (PCA) . With PCA, the information from n polymorphisms is summarized at the individual or population level with n artificial variables (components). Usually the first two or three (principal) capture much of the information and provide a picture of the relationships between the samples (Figure 2a). Model-based clustering techniques (STRUCTURE-like methods; Figure 2b) [10–12] go a little further, estimating the probability of an individual belonging to a certain genetic cluster, the ancestry coefficient. A model with k possible genetic clusters is assumed, and for each individual, k ancestry coefficients (summing to one) are calculated from the genotypes. Patterns of admixture can be visualized at the individual or population level when clusters do not coincide with populations, suggesting past demographic events such as migrations. We see, for example, the admixture in the 1000 Genomes American samples where three components are seen in most individuals (Figure 2b); k1 (light orange) represents a likely African origin, k2 (light blue) a European or West/South Asian contribution, and k5 (yellow) a Native American origin.
One way to estimate relationships that make use of individual whole-genome sequences is through the D statistic . This statistic compares the sharing of derived alleles in two individuals with a third, and thus measures whether the two are equally distant from the third, or whether one is closer. This last situation is taken as evidence for greater genetic exchanges between them - for example, through admixture. More sophisticated inferences about past demographic events can be attained by fitting statistics derived from simulation under specific demographic models (for example, [14–16]) to empirical data. There are fast algorithms that can accommodate the growing mass of empirical data [16–18]. The underlying rationale is that because the parameters (such as population sizes or split times) in the simulation are known, a good fit to the empirical data indicates that the observed pattern of genetic variability might have been produced by that model. This approach, however, remains computationally intensive and it is unclear how fully the simple models used capture key elements of human history.
These analyses extract far more information from genome-wide data than from single loci such as mtDNA or the Y chromosome. Although demographic inferences have been made using single loci - for example, suggesting that 20 to 45 thousand years ago (KYA) most humans lived in South Asia  - such conclusions are highly dependent on the sampling strategy . In any case, extant lineages from both mtDNA and the Y chromosome coalesce < 200 KYA [21, 22], which prevents inferences about earlier demography. By contrast, inference about effective population size back to several million years ago (MYA) has been made using whole-genome sequences . This approach estimated the coalescence time of the maternal and paternal copies of each genomic segment and examined the distribution of effective population size at different time intervals inferred from these. It identified a significant reduction in population size in the past 100 KY, more marked in European and East Asian populations than in Africans, and a shared demography before that. The population size was larger between 100 KYA and 200 KYA, which might reflect population substructure at that time. Interestingly, considerable exchange between sub-Saharan Africans and Europeans/Asians was inferred until 20 to 40 KYA, consistent with some more standard models that estimated an unexpectedly low split time between Asians and Europeans of 23 KYA .
Another great advantage of sequence data is that they provide an unbiased estimate of genomic variability. African individuals carry more SNPs than Europeans: 3.3 million each compared with 2.7 million according to one study . However, somewhat counter-intuitively, the number of variants in a population sample of more than a few thousand is actually larger in Europeans [24, 25]. This is because the European (or Asian) populations carry vast numbers of extremely rare variants that are seldom shared between individuals, a consequence of their recent explosive demographic growth.
The broad pattern of global human genetic variation is well established: autosomes, the X chromosome, mtDNA and the Y chromosome all generally show higher genetic diversity in African populations than in non-Africans. In addition, non-African populations carry only a fraction of the common (and hence old) African genetic variants. Furthermore, phylogenetic trees from mtDNA, the Y chromosome and autosomal regions most commonly root in Africa, with non-African populations having subsets of the lineages. These conclusions, derived before the days of large-scale sequencing, were unsurprisingly reinforced by whole-genome sequences . Such observations are readily explained by a recent and predominantly African origin for modern humans and expansion of a subgroup into the rest of the world, but detailed genetic analyses have allowed more sophisticated models to be developed.
Analyses of global genome-wide datasets - initially short tandem repeats (STRs) in the Human Genome Diversity Project (HGDP) panel (Figure 1a) - revealed a strong negative correlation between genetic diversity and migration distance from East Africa [26, 27] (Figure 1d). The authors proposed a serial founder model to account for this relationship: a subgroup set out from a source population to colonize a neighboring region, expanded to form a secondary source population and the process was repeated in successive steps away from the origin. The correlation was further confirmed by whole-genome SNP data in the same panel  and by extensive re-sequencing data (thus largely free from ascertainment bias) in an independent sample set . Although the model was established using autosomal variants, Y-STR diversity in the HGDP panel was consistent with it and showed corresponding decreases in Y-chromosomal coalescence time and effective population size, which further supported the model .
An extension of the model would be to identify the region within Africa that provides the best candidate for the origin; the above analyses simply assumed an origin in East Africa. Two such studies have pointed to south-western Africa origin [31, 32]: for example, the ≠Khomani Bushmen from South Africa have the lowest linkage disequilibrium among the populations examined (a characteristic of an ancient population) . This conclusion must, however, be interpreted with caution because of the limited representation of East African populations in these comparisons, and the admixture and migration that have occurred in Africa since the origin, influencing the genetic properties of the modern populations analyzed.
This genetic model has stimulated new analyses of non-genetic datasets. An examination of 37 morphometric characteristics in 4,666 male skulls drawn from 105 populations worldwide revealed a decrease in phenotypic variability mirroring the loss in genetic diversity and suggested central/southern Africa as the origin . Similarly, the number of phonemes used in a global sample of 504 languages was also well explained by a serial founder model of expansion from an inferred origin in southern/western Africa . These strong patterns of human genetic, morphological and linguistic variation support a single African origin for most of human diversity, but do not preclude low levels of admixture with archaic humans [13, 35] - a 'leaky replacement' model where most but not all archaic diversity was replaced by an expansion from Africa 50 to 70 KYA.
A genome sequence derived from a 100-year-old Australian Aboriginal hair sample has provided new insights into the early migration patterns . Using a variant of the D statistic (designated D 4P ), which assesses allele sharing between four individual genomes, the authors  investigated the sharing patterns between African, European, East Asian (Han Chinese) and Australian Aboriginal genomes. They observed more sharing between East Asian and European than between East Asian and Aboriginal genomes, and so proposed an initial split after the exit from Africa between the Aboriginal ancestors on the one hand, and the ancestors of modern Europeans and Chinese on the other, a conclusion supported by an independent study based on genotyping a much larger number of individuals .
Population movements have, of course, continued since the initial expansion out of Africa, and one, initiated by the destruction of the First Temple in Jerusalem by Nebuchadnezzar II of Babylon (c. 586 BCE) and known as the Jewish Diaspora, has been studied intensively using both historical and genetic data. The Diaspora led to the dispersal of Jewish people from the Levant to many parts of the world, and traditionally three main groups were recognized: Ashkenazim (Eastern European Jewish populations), Sephardim (Southern European Jewish populations) and Mizrahim (Middle Easterners) . Other Jewish groups outside Israel include Ethiopian and Indian Jews. 'Jewishness' is inherited from the mother, so a shared Middle Eastern origin for mtDNA contrasted with a more diverse origin for Y chromosomes might be expected, at least in exogamous communities, and has indeed been reported [39, 40].
However, a major exception to this scenario was provided by the Ethiopian and Indian Jews. These two groups show a much greater extent of host genetic introgression, effectively clustering together with the local non-Jewish populations in both PCA and STRUCTURE-like analyses (Figure 3c). The genomic clustering explains observations from the mtDNA data [39, 40] in which a high proportion of local maternal lineages were shared with the Ethiopian and Indian populations (Figure 3a), potentially as a result of more relaxed criteria of self-identification in these Jewish communities. This pictures the migration of these people as a star-shaped pattern centered on the Levant, with founder effects and genetic drift  acting differently on the individual branches.
To further emphasize the importance of social rules in shaping the genetic landscape of a population, we can make a parallel with the more recent (about 1,000 YA) Roma (Gypsy) expansion from a North Indian source population towards South-Western Europe. Although the available data still rely on a limited set of markers [43–45], the more flexible inheritance of Roma status is reflected in a serial dilution of the source (Indian) gene pool, characterized by stepwise admixture with local 'host' populations for both uniparental and autosomal markers. It is thus striking that the different migration pattern and relaxation in self-identification criteria allowed a greater extent of mixture of Roma with hosts in half of the time of the Jewish Diaspora.
The hair dated to around 4,000 YA and provided both a high-coverage mtDNA sequence  and 20× coverage genomic sequence . Analyses of these data suggested that this individual was not a likely ancestor of the current Greenland Inuit, confirming the theory of different waves of migration into Greenland, the last of which eventually replaced the previous populations. Indeed, the Paleo-Eskimo Saqqaq mtDNA was attributed to the D2a1 haplogroup, which has not been found in the Inuit . The closest lineage to D2a1 is currently found in the Aleut and Sireniky-Yuit populations from the Aleut Islands and extreme East Siberia, respectively, suggesting an Old World Beringian origin for the maternal ancestry of the Saqqaq (Figure 4b). The Y lineage was Q1a*, now known from the Koryaks and Yukaghir of East Siberia . This connection with Siberian populations was confirmed by the autosomal data. Indeed, clustering analysis of polymorphisms from the Saqqaq genome (in comparison with 35 populations from Europe, Asia and America) showed shared ancestry with three Siberian populations (Figure 4b) and excluded detectable admixture or shared origin with Western European and Native American populations. In addition to this ancestry information, analyses of functional variants in genomic DNA provided some insight into phenotypic traits of the Saqqaq individual , suggesting blood type A+, brown eyes, dry earwax and dark hair, with, ironically in view of the source material, a predisposition to baldness.
The study of human DNA variation is not the only way to use genetics to decipher our history. Insights can also come from the study of organisms that have established long-term relationships with us (such as parasites and pathogens), providing an independent perspective on their host's evolutionary history. Broadly speaking, samples are collected in different geographical areas and their phylogeographic relationship is investigated to locate their probable origin (usually the place where the parasite's genetic diversity is greatest) and characterize their spread by describing and dating the branching pattern and/or analyzing population-genetic parameters.
As expected, in many cases a strong correlation has been found between parasite origins and spread and both ancient human migrations, such as the out-of-Africa movement , and more recent migrations, such as the Austronesian expansion  or along the Silk Road trade route . In addition, louse phylogenetic history has also helped to reconstruct aspects of our appearance that do not fossilize, a challenging task using genotype information, and even more so for ancestors without genotype information. Chimpanzees have only head lice and gorillas only pubic, whereas humans have both. Head lice seem to have co-evolved with their respective hosts since the chimpanzee-human split 6 to 7 MYA. By contrast, the divergence time between gorilla and human pubic lice is substantially lower than the gorilla-human split (about 8 MYA), suggesting that pubic lice originated in gorillas and switched to the human lineage around 3.3 MYA . This may indicate that, by that time, the human body was partially free from hair and a new niche was available. Despite the ancient loss of body hair, according to this interpretation, it took a long time for humans to introduce clothes, according to another study of lice. Pediculus head and clothing lice split around 70 to 80 KYA [55, 56]. If the split occurred as soon as the new clothing niche became available, this date provides us with an estimate of the time of the introduction of clothes.
What can we learn from these examples about the information contributed by different genetic loci, and the comparison between genetic and non-genetic sources? It is useful to consider this question by time period. In the last about 50 KY, mtDNA, the Y chromosome and whole-genome data are all informative, and the agreements between them are more striking than the disagreements. For example, in both the Diaspora and Saqqaq genome studies, the genome-wide data extended, rather than overturned, the conclusions from the uniparental loci. This does not have to be the case, and is not always found: a family from England, for example, unexpectedly carried a Y chromosome typical of sub-Saharan Africa , but the rarity of such examples - the strong correlation between the histories of different genomic regions - provides evidence for marked geographical structure prevailing over this period. Nevertheless, some unresolved issues remain: genome-wide data suggest extensive gene flow between sub-Saharan Africans and non-Africans , whereas uniparental data do not [1, 2].
Further back into the past, the uniparental loci provide less and less information because fewer lineages have survived to the present: all mtDNA and Y lineages outside Africa trace back to a single ancestor about 50 to 70 KYA, for example. So for understanding the early expansions, the main insights have come from genome-wide data (for example, ), and these are our only source of genetic data for the period > 200 KYA.
In general, non-genetic data provide far more detailed information for the last few millennia, and the genetic inferences are largely consistent with them (such as the Diaspora). Genetics can add information, such as about the origins of the Paleo-Eskimos, that has not been available from other sources. Again, there are unresolved issues - for example, the genetic contact inferred between populations 20 to 40 KYA contrasts with the distinct regional archaeological records for this period . And genetics can generate hypotheses, such as the adoption of clothing 70 to 80 KYA, that can now be tested by archaeologists.
As whole-genome sequencing becomes cheaper and more routine, it can more readily be used to address questions of evolutionary interest: the complex demography in Africa and the peopling of the Americas and the Pacific, for example. When did the ancestral populations split, how much subsequent gene flow was there, what patterns of migration can be identified? There may be admixed populations in which mtDNA and the Y chromosome from one ancestral source have been entirely lost by drift, but ancestral information can be reconstructed from the autosomal regions: the admixed American populations in the 1000 Genomes Project are allowing this possibility to be tested. Perhaps we can study the genomics of pre-contact Native American or Tasmanian populations, as proposed by the Taíno Genome Project  investigating the legacy of a pre-Colombian Puerto Rican population using 1000 Genomes data. Our history is encoded in our genomes, each much smaller than a grain of sand, and we are only now collecting the data and developing the tools to decipher this rich source of information.
Our work is supported by The Wellcome Trust (grant number 098051), and additionally an EMBO Short-term Fellowship (ASTF 324-2010) to VC, and the Cambridge European Trust, a Domestic Research Scholarship and Emmanuel College, Cambridge, UK (LP). We thank Jean McEwen for information about the 1000 Genomes samples.