Whole-genome sequence analysis of a Pan African set of samples reveals archaic gene flow from an extinct basal population of modern humans into sub-Saharan populations

Background Population demography and gene flow among African groups, as well as the putative archaic introgression of ancient hominins, have been poorly explored at the genome level. Results Here, we examine 15 African populations covering all major continental linguistic groups, ecosystems, and lifestyles within Africa through analysis of whole-genome sequence data of 21 individuals sequenced at deep coverage. We observe a remarkable correlation among genetic diversity and geographic distance, with the hunter-gatherer groups being more genetically differentiated and having larger effective population sizes throughout most modern-human history. Admixture signals are found between neighbor populations from both hunter-gatherer and agriculturalists groups, whereas North African individuals are closely related to Eurasian populations. Regarding archaic gene flow, we test six complex demographic models that consider recent admixture as well as archaic introgression. We identify the fingerprint of an archaic introgression event in the sub-Saharan populations included in the models (~ 4.0% in Khoisan, ~ 4.3% in Mbuti Pygmies, and ~ 5.8% in Mandenka) from an early divergent and currently extinct ghost modern human lineage. Conclusion The present study represents an in-depth genomic analysis of a Pan African set of individuals, which emphasizes their complex relationships and demographic history at population level. Electronic supplementary material The online version of this article (10.1186/s13059-019-1684-5) contains supplementary material, which is available to authorized users.


S1) Sample collection and sequencing
Twenty-five human DNA samples were collected: 21 from African volunteers and 4 of Eurasian origin. Nine of the samples are newly sequenced in this project with their DNA obtained from blood. We supplemented our dataset with sequencing data of 16 individuals whose wholegenome shotgun read data was already published. We downloaded this data from the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra; accession numbers are SRX015734, SRX016231, and SRX103808) and from cdna.eva.mpg.de/neandertal/altai/ModernHumans/bam. The DNA extraction of the published genomes was carried out from lymphoblastoid cell cultures except for three samples, Dinka DNK02 and DNK07 (from mouthwash sample), and TuuSan KB1 (from blood) (S1.1 Table). S1.1 which in this case were incorporated in the first read. We allowed up to one mismatch.
Subsequently, we removed the barcodes from the first read, and as a result, we got paired reads of 94/100 or 95/101 bps (read1/read2). The remaining samples, for which a self-library was prepared, have pairs of 100 or 101 bps each, except for the TuuSan KB1, which has shorter reads of 76 bps, and the Yoruba individual NA18507, which has paired reads of 100/102 bps. All samples were sequenced at deep coverage (21-47X) (S1.2 Table). S1.2

S2) SNP calling
For each sample we mapped the paired-end reads against the human assembly GRCh37 using the BWA aligner (version 0.6.1, parameters aln -n 6 -q 15 and sampe) [3]. As some samples were extracted from lymphoblastoid cell cultures (S1.1 Table), we included the Epstein Barr virus into the reference genome. We removed the PCR duplicates using the program MarkDuplicates.jar from Picard tools (version 1.70) (http://picard.sourceforge.net/). Afterwards, we applied a local improvement of the alignments around indels and a base quality recalibration, called genotypes and filtered variants by quality using GATK version 2.5-2 [4].
Commands used in GATK are shown below; required files were downloaded from GATK server. We kept SNPs that passed the VQSR filters and are not multiallelic, focusing from now on chromosomes 1 to 22, X and Y.
To identify not only confident variants but also non-variable sites, we determined the portion of the genome that is callable applying two criteria to each genomic position: i) at least 5 reads should map with no poor mapping quality and ii) the minimum calling confidence threshold used for previously identified variants was used for the rest of the genome. Finally, we imposed a conservative criterion: a position is callable if it is callable in all samples. We used GATK (version 2.5-2) to get the callable regions for each sample with the following commands.

Mitochodrial reconstruction
The complete mitochondrial sequences of the 25 samples of this project were reconstructed from the whole-genome shotgun paired-end sequencing. For each sample, we recovered a subset of reads by mapping the whole set of paired-end reads against a high-quality human mitochondrial reference genome [5]. To maximize the number of reads captured at the extremes of the reference assembly, we applied a second round of mapping in which we took advantage of the mitochondrial circularity. We aligned the reads to the reference sequence but, this time, the origin was modified to be at the middle of the assembly (8 kbp from the start). Alignments were performed using the BWA mapper with parameters -n 6 -q 15 [3]. We retained only high-quality paired-end reads by imposing that both pairs should be mapped and properly paired (with samtools -f 2 [6]) and by requiring that both pairs have a median Phred quality score greater than 32. In S3.1 Number of captured high-quality reads and mitochondrial coverage per sample a Assembly with the start position at the middle of the reference mitochondrial assembly b Coverage relative to the length of the mitochondrial reference (16,571 bps) We constructed contigs from the retrieved reads with Hapsembler v.1.1, a haplotype-specific genome assembly toolkit (parameters -p Illumina -t 4 -d no --PHRED_OFFSET 33 --MIN_CONTIG_SIZE 1000 -EPSILON 0.05) [7]. Because the efficiency of this assembler decreases with lofty coverage, such as the one obtained for the mitochondrial reconstruction (S3.1 Table), we randomly reduced the number of reads per sample to around 350X of mitochondrial coverage (except for the TuuSan sample KB1 for which the resampling was done at 300X). This reduction with its subsequent construction of contigs was applied 20 times per reference assembly (the standard and the one with the modified origin). The rationale behind this iterative process is to compensate the random representation of reads that entails potential problems such as the assembling of existing numts into the mitochondrial sequence.
For each of the 40 times we applied the read reduction and posterior de novo assembling, we oriented the resulting contigs via local alignments to the human mitochondrial reference [5] with BLAST [8]. We then joined the oriented contigs using mafft [9] and incorporating Ns in both existing gaps and sites that remain unresolved due to differences in overlapping contigs. In such a way, we have reconstructed 40 mitochondrial assemblies per sample. Finally, we got the consensus sequence, which is the reported mitochondrial sequence per individual.

Y Chromosome reconstruction
SNP calling was performed as described in S2 section. A total of 14,488 SNPs were identified in the whole Y chromosome once multiallelic positions were excluded, with 4,088 SNPs remaining after considering callable positions only (i.e., 4.24 Mbp of the Y chromosome, see also S2 section). As the Y chromosome is especially rich in repeats, duplications, and lowquality regions, we restricted our haplogroup reconstruction to nine high-quality regions determined by Wei et al. [10]. These nine segments, that span 8.97 Mbp, are the result of excluding the pseudoautosomal, heterochromatic, X-transposed, and ampliconic segments from the male specific region of the Y chromosome (MSY, 95% of the chromosome's length) [10,11].
When considering the callable loci within these high-quality regions, we got a final set of 3,259 SNPs in 3.44 Mbp of genomic sequence.

S4) Genetic diversity and runs of homozygosity Pairwise differences
For each sample, we created haploid autosomal chromosomes by randomly choosing one of the two possible genotypes at any given locus. Then, we compared any two distinct individuals by counting the number of different alleles between their sampled genomes (Figure 1b). This is a measure of genetic diversity within and between populations, taken as an estimate of the heterozygosity within a population if both samples belong to same population, or between populations otherwise. Similarly, we calculated the heterozygosity on each sample (Figure 1b).
Moreover, we calculated jackknife estimates and standard errors by dividing the genome into 100 blocks of equal number of SNPs (S4.1 Fig.)

Runs of homozygosity (ROH)
For each sample, we considered windows of 1 kbp of callable genome and counted the number of heterozygous alleles included in each window. To determine long homozygous regions, or runs of homozygosity, we identified consecutive windows having less than a given number of heterozygous sites by sliding over 100 windows. We imposed three different minimum lengths for calling ROH: 0.5, 1 and 1.5 Mbp of callable genome, which corresponds to imposing 500, 1,000 and 1,500 consecutive windows. We allowed a maximum of 50, 100, and 150 heterozygous loci, respectively. Assuming an average heterozygosity of 1 per kbp, we are calling a ROH when there is less than approximately 10% of the expected heterozygosity. To ensure that a long run of homozygosity is not due to regions of lower quality when calling SNPs, we imposed that at least 67% of a ROH should be callable. This threshold is equivalent to the proportion of the callable genome in the whole genome (Figure 1b and S4.2 Fig.).

S5) Spatial analyses
To investigate the relationship between genetic variation and geographic sampling locations in the African samples, we assessed three different estimates: i) the correlation between geographic and genetic based maps, ii) the genetic differentiation relative to geographic distances, and, iii) the direction in space of this genetic differentiation.
We first conducted a multidimensional scaling (MDS) analysis with an Identical By State (IBS) distance matrix between all the African individuals using SNPs with a MAF > 0.05 and in linkage equilibrium. LD pruning was performed with Plink [14] using the option --indep with default parameters (50 5 2). We then quantified the correlation between the geographic map and the genetic relationship between individuals using the first two MDS dimensions by means of a Procrustes analysis [15]. Secondly, we computed a Mantel correlogram [16] in PASSAGE 2.0 Given the observed dependence between geography and genetics, we aimed to identify whether the genetic variability observed among African individuals followed a particular spatial pattern. Mantel correlogram analysis (S5.2 Fig.) suggests that genetic differentiation tends to increase monotonically with geographic distance, a pattern consistent with a main genetic gradient among African populations. Therefore, the Bearing procedure on our data (S5.3 Fig.) supports that the direction in space of maximum genetic differentiation in the African continent is in the North-West to South-East axis (~160 degrees).

S5.3 Fig. Bearing plot using the IBS genetic distance between pair of individuals.
Red dots indicate statistically significant correlation values after Bonferroni correction.

D-statistics and F4-ratio estimation
We formally evaluated whether admixture occurred between hunter-gatherer groups and their geographical close populations, as well as from Eurasian to African populations, with the Dstatistics test (also called ABBA-BABA test) [23]. We performed the analyses using the weighted block jackknife considering a block size of 5 cM [24]; this is, the D-statistics is computed for 5 Mbp non-overlapping sliding windows across the genome, and the resulting values are used to compute the weighted mean, variance, and standard error of the statistic over the entire genome. By dividing the average D value by the standard error, a Z-score is obtained. Results are considered significant when |Z| > 3.
Additionally, we determined the proportion of west European ancestry into African populations by constructing a F4-ratio estimate (see below). This method estimates the ancestry proportion of a mixing population under certain correct phylogeny (see Patterson et al. [23] or Pickrell et al. [25] for a longer introduction). Similar to the D-statistics tests, Z-scores were calculated following the same weighted block jackknife strategy. All analyses were also performed using ADMIXTOOLS 4.1 [23].  Table. S6.1 To infer the mixing proportion of west European ancestry present in the African populations, we constructed a F4-ratio statistic [23] adapted to our data but following similar rationale to the one used in Pickrell et al. [25] when calculating west European ancestry in African populations.

Admixture between hunter-gatherers and their surrounding populations
When testing hunter-gatherers, we used f4(Han, Yoruba; X, Chimp) / f4(Han, Yoruba; French, Chimp) whereas f4(Sardinian, Han; X, Yoruba) / f4(Sardinian, Han; French, Yoruba) was used for the rest of African groups. In both cases X is the target African population. This method requires a correct phylogeny (S6.7 Fig.) and an unadmixed population. Since west Eurasian genetic signal in west African populations, particularly in Yoruba, has been claimed to be low [25] or even undetectable [26], we chose this population as a proxy for an unadmixed pool. For interrogating agriculturalist populations, the Sardinian sample is the only one in our dataset that allows us to construct a proper phylogeny for these tests. In theory, Sardinians are not the ideal candidate population since they are also west Eurasian. Nevertheless, we expect the test to take advantage of the fact that Sardinians are highly similar to Neolithic samples [27], which opposes to the west Eurasian ancestry expected to be present in African populations. Results for the F4-ratio are shown in S6.5 Table. S6.7 Fig. Population phylogeny used for F4-ratio estimation. It is applied to two different cases: when the admixed X population is a hunter-gatherer (whose split from the global tree took place before the split of Yoruba), A represents the Han, B represents the French, C represents the Yoruba and D represents the Chimpanzee; whereas when the admixed X population is any other African non-hunter-gatherer population, A represents the Sardinian, B represents the French, C represents the Han and D represents the Yoruba. To calculate the West Eurasian ancestry proportion, let Q be the length of the red branch, then f4(A,C;X,D)/f4(A,C;B,D) = Q·α/Q = α.

S7) PSMC
We applied the Pairwise Sequentially Markovian Coalescent (PSMC) model to our genomes in order to infer estimates of the effective population size through time for each population, on the basis of the genomic regions found with similar local density of heterozygous sites in each sample [28]. In order to facilitate visualization, we selected the following individuals as representative of their populations: Yoruba_HGDP00936, Mandenka_HGDP01284 and Dinka_DNK07, Moreover, Eurasian populations are illustrated by French_HGDP00521 and Han_HGDP00778. We prepared the autosomal sequences for being used on the PSMC as follows: we obtained the consensus diploid sequence using samtools (samtools mpileup -C50); we excluded positions with less than one third or more than twice the average coverage independently in each sample; and we masked all non-callable positions (bcftools view -c -| vcfutils.pl vcf2fq -d $min -D $max). We run PSMC with following parameters: psmc -N25 -t15 -r5 -p "4+25*2+4+6". We used two different mutation rates per generation, of 2.5x10 -8 and 1.2x10 -8 , with a generation time of 25 years to scale time and Ne parameters (Figure 5, S7.1 and S7.2 Figs.). We also used a different set of time parameters to resolve particular periods at a finer time scale (-p "4+50*1+4+6") (S7.3 Fig.). Because we have detected signals of European ancestry in several sub-Saharan individuals (see S6 section), we aimed to explore if our observations from the PSMC analysis are affected by this confounding factor. To do that, we run PSMC with same parameters but excluding in each genome those regions with an inferred European ancestry (S7.4 Fig.). These fragments were identified through local ancestry analysis, using SHAPEIT2 [29] to phase the data and RFMix v1.5.4 [30] to detect the tracts using AFR (GWD, YRI, ESN, LWK) and EUR (CEU, GBR, IBS, TSI, FIN) superpopulations from the 1,000 Genome Project phase III (http://www.internationalgenome.org/data-portal/population) as references.

S8) Neanderthal and Denisova introgression
We tested introgression of archaic hominins, particularly of Neanderthals and Denisovans, into African modern populations using the D-statistics test (see S6 section for further general description). We used the chimpanzee genome as the outgroup, a high-coverage Neanderthal or a Denisovan sequence as the donor and our 25 samples as the recipients. A Papuan individual was also included as a control recipient sample for the Denisovan ancestry. significantly higher in Papuans than in the rest of populations as previously described.
Additionally, the Denisovan signal is also higher in East Asians, Europeans, and North Africans compared to sub-Saharans, potentially due to Denisovans carrying Neanderthal introgression themselves [32].  Table). Therefore, this analysis corroborates the D-statistics results, where no sub-Saharan samples were found to carry any Neanderthal ancestry signal. S8.1

S9) Demographic model
Approximate Bayesian Computation (ABC) is a statistical framework that is used for i) inferring posterior distributions of parameters of a given model, and/or ii) comparing models when data is generable from the models of interest by means of a simulator if a closest form of the likelihood of the data given the model does not exist [35]. ABC has been shown to be extremely helpful for distinguishing among competing models and for parameter estimation in the field of population genetics, where the likelihood of the observed genetic variation given an evolutionary model is not usually computable, but it is computationally affordable to produce simulated DNA sequences from the model [36].
The basic ABC rejection algorithm proposes comparing real data with simulated data by means of a set of summary statistics (SS), and ascertaining the parameters of the simulations whose ||SSobs -SSsim|| = 0 to generate the posterior distribution of the parameters/models. In practice, this requirement implies that the basic ABC rejection algorithm is prohibitively computationally expensive and several modifications have been proposed in order to reduce the number of required simulations. Among them, we highlight generalizing the classical ABC rejection algorithm by introducing an error threshold (||SSobs -SSsim|| < e) and weight the accepted simulations depending on e [37].
There are three requirements to conduct ABC: i) prior distributions of each parameter considered in the model, ii) a simulator suitable for producing data similar to the observed one, and iii) a set of SS obtained from the data that is informative of the parameters of the model.
Identifying the proper set of SS for the specified model and parameters of interest is an active subject of analysis in the ABC community [38] since an inefficient specification of the SS can negatively affect the output of ABC analysis. Several approaches have been proposed for minimizing the number of required SS as well as for identifying the most informative SS for a given parameter or model [39]. In the present study we used the approach developed by Mondal et al [40] based on the algorithm proposed by Jiang et al [41] to recover informative SS by means of applying a Deep Learning (DL) approach. The basic idea is that DL can identify extremely complex patterns in a raw representation of the data that fits the parameter/model that was used to generate the data thanks to its non-linear nature. By definition, this prediction must be the most informative SS of the parameter/model we want to estimate. Thus, the predicted DL value can be used in the classical ABC framework as SS (SS-DL).
A simplified representation of the ABC-DL algorithm is shown in S9.1 Fig.  S9.1 Fig. Approximate Bayesian Computation coupled to Deep Learning. As a first step, we generated simulations from the prior distributions of parameters and estimated raw summary statistics. These simulations are then used for training a DL, whose output is the expected value of the parameter/model that generated the simulated data. Once the DL is trained, we generated new simulations from the prior distributions, predict the output in the trained DL and compare it with the predicted output using the observed data. The parameters/model that generated the simulation are retained given the e threshold.
A raw summary statistic that recapitulates many aspects of old and recent demographic events is the joint multidimensional site frequency spectrum distribution among populations (jSFS) [42].
jSFS quantifies the number of SNPs that follow a particular distribution of derived alleles among the considered populations. jSFS has been previously used for estimating demographic parameters of interest in composite likelihood scheme approaches [42,43]. However, given the A potential caveat of the DL approach for extracting informative SS is the fact that the DL is trained with simulated data from proposed (known) models and the obtained value is compared with the output from observed data that is likely produced by a much more complex model than the considered ones. In this context, one of the proposed models is at best a nested model of the real one that generated the observed data. However, this difference between the data used for training and the final observed data that we want to classify can produce unpredictable results of the DL when applied to the real data. Mondal et al [40] overcome this problem by means of implementing a version of the noise injection algorithm. Applying the same approach, we split the observed dataset in two, if possible, one of them comprising samples that will be used to add "observed data-like" noise to the simulated jSFS, and a second one that will be

Demographic model comparison
We considered six demographic topologies that could explain the actual genetic diversity observed in African and Eurasian populations (Figure 4). The currently assumed topology (model A), which includes Neanderthal introgression into Eurasian populations, was tested versus five other models (models B-F) that incorporate additional introgressions into African populations from a maximum of two unknown archaic hominins.
The jSFS was computed considering 11,642 genomic regions that correspond to 393.5 Mbp of the genome with a mean fragment length of 33.8 kbp (S9.2 Fig.). These fragments are ascertained after i) excluding CpG islands (using http://epigraph.mpiinf.mpg.de/download/CpG_islands_revisited/ as described in [44]) and genes (coordinates for GRCh38.p7 downloaded from Ensembl) from the callable genome, ii) concatenating the outcome fragments that are less than 5 kbp apart, and iii) imposing a minimum distance of 20 kbp between these concatenated genomic regions in order to be considered for the analysis. It must be taken into account that, even if the introgressed fragments could be larger or smaller than these 20 kbp, it is not relevant for our analyses, since we ultimately collapse the jSFS information of each fragment into a whole-genome version, so we are only considering the average amount of introgression over fragments in the genome.

S9.2 Fig. Length distribution of the genomic fragments used in the analysis.
Each simulation comprised 11,642 genomic regions that follows same length distribution than the one observed in our real data (S9.2 Fig.). We used fastsimcoal2 software, which is fast for simulating large genomic regions under neutrality [43,45]. Mutation rate per base and generation at each genomic region was ascertained from a normal distribution with mean 1.61e-8 and standard deviation = 0.13e-8 [46]. Since each fragment had a different proportion of callable positions, we scaled the mutation rate at each fragment by the length of callable positions. Recombination rate was set constant at 1cM/Mb.
The prior distributions used for each model are shown in S9.1 Table. S9.1 Table. Parameters and prior distributions of the six considered models in Figure 4 Parameters

Distribution Model A B C D E F
MigrationFrench_WestAfrica U(0.0,5.0E-5) X X X X X X MigrationFrench_Mbuti U(0.0,5.0E-5) X X X X X X MigrationFrench_Kho U(0.0,5.0E-5) X X X X X X MigrationWestAfrica_Mbuti U(0.0,5.0E-5) X X X X X X MigrationWestAfrica_Kho U(0.0,5.0E-5) X X X X X X MigrationMbuti_WestAfrica U(0.0,5.0E-5) X X X X X X We then generated 100,000 additional simulations per model as replication dataset for conducting the ABC analysis, comprising a total of 900,000 simulations. For each of these simulations we computed the output from the 10 previously trained DLs and combined the predictions to produce a unique output by bagging [48] a predicted model of the four under evaluation. The combined prediction was used as SS in the ABC analysis. We considered an error threshold of 1,000 out of 900,000 simulations and applied a multinomial logistic regression on the accepted simulations; all these analyses were conducted with the R package abc [49], using the script postpr with option mnlogistic for model comparison.
We first evaluated the power of ABC-DL for distinguishing between the proposed independent models using the output of the DL as summary statistic. To do that, we run the cv4postpr command from abc package [49] using 100 simulations per model and the same ABC parameters we used when analyzing the observed data. The cv4postpr runs the ABC using simulated data as observed data and counting the number of times that the model with the highest posterior probability is, in fact the model that generated the simulated data. The confusion matrix suggests that after applying the ABC-DL approach we have high accuracy for identifying the model that produced the simulation in all the models except F, which is the most general one.

S9.2
Finally, we conducted the ABC-DL analysis with the real data using Altai Neanderthal sample, Han_HGDP00778, French_HGDP00521, Mandenka_HGDP01286, MbutiPygmy_HGDP00982, and JuhoansiSan_HGDP01036. In order to test the performance of the ABC-DL approach for parameter estimation, we sampled 1,000 simulations at random for each parameter, conducted the ABC-DL approach using the abc script of abc package with the same parameters as the ones used with the real data, and computed the factor2 statistic [50]. This statistic accounts for the number of times that the centrality statistic of the estimated posterior distribution (in our case, the mean) falls within the range of 50% to 200% of the real value. A value close to one indicates that the mean of the posterior distribution is close to the value used in the simulations. A value close to 0.5 indicates that the value of the mean of the estimated posterior distribution is randomly distributed.

S9.3 Table
shows the factor2 value estimated over 1,000 simulated datasets for each parameter. We observed that for the migration parameters we obtain a factor2 that is similar to the one that would be estimated if a value taken at random from the prior distribution is used as the mean of the posterior distribution. This suggests that we cannot use the mean of the posterior distribution of these parameters as indicative of the true value.