We used 803 barley landrace accessions from the 2,446 landrace and cultivated lines in the National Small Grains Collection (NSGC) Core Collection from the USDA. These 803 individuals include all landraces collected in Europe, Asia, and North Africa constituting the range of dissemination of cultivated barley in human pre-history (Fig. 1, Additional file 1).
We also make use of the 284 wild barley accessions from the Wild Barley Diversity Collection (WBDC) [39] analyzed in [14]. Accessions represent the entire geographic range of wild barley including the Fertile Crescent, Central Asian, and adjacent North African regions.
Genotypic data
A collection of 2,446 landraces and cultivated accessions from the NSGC were genotyped with 7,864 SNPs using the Illumina Infinium SNP genotyping platform (hereafter referred to as the 9K). The 9K chip contains 5,010 SNPs discovered in a panel of 10 barley varieties, composed primarily of European two row cultivars. In addition, a set of 2,832 SNPs used for the existing BOPA (Barley Oligo Pooled Assay 1 and 2) on the Illumina Golden Gate genotyping platform [30] was included. Additionally, 22 SNPs from resequencing studies were added, giving a total of 7,864 SNP assays on the chip [40]. The BOPA SNPs derived principally from one wild barley accession and eight malting barley cultivars, from Europe, the United States, and Japan [30].
We used automated genotype calling implemented in the software ALCHEMY [41]. ALCHEMY uses a Bayesian model of the raw intensity data files. This approach does not assume Hardy-Weinberg Equilibrium; and each single nucleotide polymorphism (SNP) call is independent of other genotype calls at the SNP. SNP calls with posterior probability >0.95 were recorded; calls below this threshold were marked as missing data. The accuracy of calls was verified following the method explained previously [14].
SNP quality control procedures consisted of the removal of SNPs that were monomorphic, had more than 10 % missingness, or had more than 10 % heterozygosity [see reference 14]. We retained 6,152 SNP for all 2,446 landraces and cultivated lines after quality control. The curated SNP dataset was used to identify potential duplicate individuals in the NSGC barley core. The details of the procedure used to identify duplicate accessions are explained in Muñoz-Amatriaín et al. [42]. We retain 803 landrace accessions after quality control.
The 284 accessions from the WBDC were genotyped with 3,072 SNPs [30], a subset of the 9K platform. After quality control this dataset consisted of 2,624 SNPs for each of the 284 accessions [see reference 14] for specific information about these populations and SNP quality control steps.
We used the consensus genetic map described in [42] which is the result of merging the 11 genetic maps of the 2011 consensus map developed by Muñoz-Amatriaín et al. [43] with the iSelect SNP platform map based on the Morex x Barke mapping population [40]. This map, referred to here as the ‘iSelect map’, identifies genetic position for 4,527 of the SNPs used to genotype the NSGC accessions.
We infer the phase of heterozygous sites (approximately 0.1 % of sites) using PHASE v.2.1.1 [44, 45] for all 1,896 SNPs which were shared between landraces and wild barley, and had genetic map positions (Additional file 21). The runs are set to the default values for number of iterations = 100, thinning intervals = 1, and burn-in = 100. We consider only phased calls with probabilities of at least 90 %. All imputed sites for missing data are re-set to missing values using a customized R script (R Project for Statistical Computing, http://www.r-project.org/). Experimentally phased haplotypes are used in two analyses where they are critical to inference, that is, the estimation of admixture proportions and assessment of identity by state between wild and landrace accessions.
Linkage disequilibrium (LD) as measured by r
2 [46] is calculated for all possible pairwise comparisons on each linkage group based on the 4,527 SNPs included in the iSelect genetic map. We considered SNPs with minor allele frequency (MAF) >5 %. The LDheatmap package in R [47] was used to generate plots of LD relative to genetic distance (Additional file 20).
Genetic assignment
To determine the geographic population structure among the 803 landraces in our dataset, we used a Bayesian clustering algorithm implemented in STRUCTURE [22, 48]. We explored the numbers of clusters (referred to as K) ranging from 1 to 7 (Additional file 2). For each value of K we used 10 replicated runs, with a burn-in length and run length of 100,000 iterations. We used an admixture model because archeological and genetic evidence suggest extensive movement of barley and thus likely admixture [4, 5, 15, 20, 49]. We used the uncorrelated allele frequency model, which is more conservative. STRUCTURE analysis was run based on the 6,152 SNPs for the 803 landraces. Considering the high selfing rate of barley >98.2 % [36] we used a haploid model (option PLOIDY=1). To summarize the assignment results for all replications we used CLUMPP [50]. CLUMPP deals with label switching (that is, when cluster names change between replicates); and multimodality (that is, when individual samples change clusters in each replication).
We used two ad hoc approaches, ΔK [51] and Clusterdness [52], to determine the number of clusters that best explain the population structure among the landraces. ΔK is based on the second order rate of change of the log probability of data between successive K values [51], and Clusterdness [52] is the extent to which individuals are estimated to belong to a single cluster rather than to a combination of clusters (Additional file 22).
The primary population structure identified here (K = 2, Additional file 2) agrees with previous observations of population differentiation of landrace and wild barley accessions east and west of the Zagros Mountains [15, 20]. The large sample considered here permits greater resolution of the geographic differentiation among barley landraces (Additional file 2).
We estimated the degree of differentiation among individuals by PCA. For this analysis we use all the 4,527 SNPs with known genetic position for the 803 landrace accessions. The PCA was performed in the SmartPCA program from the EIGENSOFT package [53]. SmartPCA permits PCA analysis with SNP loci that include missing data, thus our analysis is based on the full SNP genotyping dataset. Procrustes analysis [54] implemented in the vegan package in R [55] was used to identify the optimal rotation that maximizes the similarity between genetic variation on PCA plot and geographic maps of sample locations (Additional file 3).
We used SharedPoly and compute from the libsequence library [56] to calculate summary statistics, including number of segregating sites, number of private alleles in each cluster, and the percent pairwise diversity scaled by number of segregating sites (Table 1).
To further analyze the degree of differentiation between these populations we calculated F-statistics [23, 57] for individual SNPs (6,152 SNPs) genome-wide implemented in the R package HierFstat [58]. To detect genetic differentiation in individual groups of landraces we used focal comparisons of each population to the overall dataset (Additional file 4).
Admixture inference
Using a Maximum Likelihood approach implemented in TreeMix [59], we infer the patterns of population split and mixtures between the six wild barley populations identified in Fang et al. [14]. Populations were identified as Caspian Sea (seven accessions), Central Asian (53 accessions), Northern Levant (42 accessions), Northern Mesopotamia (41 accessions), Southern Levant (107 accessions), and Syrian Desert (34 accessions) (see Table S1 in Fang et al. [14], for geographic location of WBDC accessions). The TreeMix analysis included all wild populations and the four landrace populations identified here. We ran 25 replications of the tree without bootstrapping, and 25 replications with bootstrapping including five SNPs and 25 SNPs at a time. From this we determine that the wild population from the Caspian Sea is more closely related to landrace populations than to other wild populations (Additional file 23) thus suggesting the possibility of a more recent introgression with the landraces [60]. Including the Caspian Sea wild population in the ancestry analysis of the landraces results in greater contribution from the Caspian Sea wild population than expected based on historical human migration information (data not shown). Although, the Caspian Sea wild individuals resemble wild barley morphologically (with a shattering inflorescence and extensive branching) other traits such as seed size and erect tillers could suggest either convergent evolution of phenotypes or that the Caspian Sea wild population has been in more recent contact with domesticated material, a result which could potentially bias our inferences of ancestry. Based on this observation the seven individuals from the Caspian Sea wild population are excluded from analysis of population ancestry, retaining 277 wild barley accessions. We note here that the original WBDC encompasses 318 accessions. Fang et al. [14] identified 30 accessions that appear to be duplicated within the sample or have genotypic composition suggestive of recent introgression. These along with four other samples were removed from the study due to missing latitude and longitude information, resulting in 284 wild barley accessions in our sample.
We utilized a machine learning approach implemented in the software SupportMix [24] to identify the contribution of each of the wild barley populations to individual genomic segments in the landraces. SupportMix can perform admixture analysis by simultaneously analyzing a large number of possible source populations, regardless of their relationship to the focal population and without making assumptions regarding population demographic history or specific population genetic parameters [24]. SupportMix is a two-level method. First, it uses a support vector machine for the classification of the ancestral populations at each genomic segment independent of each other. Once the model is trained to distinguish the source populations it takes one sample at a time (for a specific genomic segment) and assigns it to a putative source population. Second, after all genomic segments are classified for each accession it continues with a smoothing step using a Hidden Markov Model to detect transitions between the different ancestral groups, this approach considers correlations between genetic blocks to limit the effect of regions with poor information content. The wild population with the highest genetic similarity is assigned as the source for that genomic segment and given a probability of assignment to that source population.
The five wild barley populations identified as clearly distinct groups from the landraces are used as potential source populations for the landraces. For this analysis we used 1,896 SNPs found on the iSelect genetic map that were common between the wild barley (SNPs are polymorphic in all 277 wild accessions) and the collection of landraces (803 individuals) (Additional file 21). We ran SupportMix on genomic segments comprised of 50, 75, and 100 SNPs. The wild population with highest similarity is assigned as the source population for that segment. Individual assignment probabilities below 95 % are treated as missing. Inference of admixture using 50 SNP windows results in a large proportion (45 %) of genomic segments with probabilities of assignment below our threshold of 0.95. Thus, SNP windows shorter than 50 SNPs are not used. Increasing the window size to 75 and 100 SNPs results in a higher confidence of ancestry assigned for each genomic region. In these two analyses there were 17.5 % and 19.5 % of the genomic segments across the seven linkage groups in our sample of landraces with probability of assignment below 0.95, respectively. These segments coincide primarily with the boundaries of linkage groups and are treated as missing data. Therefore we use windows of 75 SNPs. The proportion of ancestry genome-wide is estimated as the percentage of contribution of each wild population to the complete landraces dataset.
The predictive accuracy of SupportMix for genetic assignment of individual genomic segments was evaluated by cross-validation using a subset of wild barley individuals as testing samples, maintaining the remaining wild accessions as the validation sample. The test was run 50 times, sampling four accessions (eight haplotypes) from each wild population per iteration, without replacement. As in the landrace assignment, we used windows of 75 SNPs. An average of 16.4 % of genomic segments per individual could not be assigned with confidence (probability of assignment <0.95) to any population of origin. Window sizes smaller than 75 SNPs resulted in > 80 % of the genome being unassigned (data not shown). In summary, among genomic segments that are assigned with high confidence, 69 % are correctly assigned to the population of origin. A notable exception to assignment of genomic segments of wild individuals back to population of origin occurred in the Northern Levant population, where proportional assignment to the Northern Levant wild population averaged 43 %, with 31 % of segments assigning to the geographically proximate Southern Levant (Additional files 10 and 11).
Genetic contribution from proximate wild populations into landraces
We determined the genetic contribution of wild populations in landraces for those growing in the same geographic range as the natural range of wild barley (Additional files 13 and 14). East African landraces are outside this range; therefore they were not considered in this analysis. We calculated the great circle distance between each landrace and the nearest wild individual from each wild population using the R package pracma [61]. We then calculated the correlation between distance and the proportion of ancestry assigned in SupportMix.
Private/Shared alleles analysis
Using the 1,896 SNPs in common between landraces and wild barley, we identified alleles private to each of the five wild barley populations using the software SharedPoly from the libsequence library [56]. We found 115, 20, 20, 17, and nine private alleles corresponding to Southern Levant, Northern Levant, Central Asian, Northern Mesopotamia, and Syrian Desert wild barley populations, respectively (Additional file 16). We search for the presence of these SNPs that are private to individual wild populations in each of the landrace populations; this class of variants is referred to as shared alleles and their observed frequency in each landrace population is shown in Additional file 17 (see also Additional file 15). The estimation of frequency is based on diploid sample size, thus at a given SNP, heterozygous individuals contribute one allele private to the wild population analyzed, and homozygous sites are counted either as zero or two.
Identity by State
An Identity by State (IBS) analysis between the wild and landrace barley lines is used to test for shared genomic segments between populations, consistent with recent introgression. The IBS analysis used PLINK v.1.90 [62] with window sizes of 30 SNPs. Larger window sizes resulted in no shared segments between these two datasets. Therefore, we report the results for windows of 30 SNPs. Only segments with 100 % match for the 30 SNPs were considered as significant. There are 37 non-overlapping IBS segments between landraces and wild, with 18 % of wild individuals sharing segments with 36 % of the landraces within each landrace population (Additional file 17). On average the IBS segments composed of 30 SNPs represent 10.48 cM genomic regions (Additional files 18 and 19).
All code used for analysis and for figures can be found in the GitHub repository, https://github.com/AnaPoets/BarleyLandraces. The raw genotyping data for the 2,446 accessions in the NSGC are available in Figshare, http://figshare.com/articles/Raw_Genotyping_Data_Barley_landraces_are_characterized_by_geographically_heterogeneous_genomic_origins/1468432.