hzAnalyzer: detection, quantification, and visualization of contiguous homozygosity in high-density genotyping datasets
© Johnson et al.; licensee BioMed Central Ltd. 2011
Received: 22 October 2010
Accepted: 11 March 2011
Published: 11 March 2011
The analysis of contiguous homozygosity (runs of homozygous loci) in human genotyping datasets is critical in the search for causal disease variants in monogenic disorders, studies of population history and the identification of targets of natural selection. Here, we report methods for extracting homozygous segments from high-density genotyping datasets, quantifying their local genomic structure, identifying outstanding regions within the genome and visualizing results for comparative analysis between population samples.
Homozygosity represents a simple but important concept for exploring human population history, the structure of human genetic variation, and their intersection with human disease. At its most basic level, homozygosity means that, for a particular locus, the two copies that are inherited from an individual's parents both have the same allelic value and are identical-by-state. However, if the two homologues originate from the same ancestor in their genealogic histories, then the two copies can be described as being identical-by-descent and the locus referred to as autozygous . While autozygosity stems from recent relatedness between an individual's parents, shared ancestry from the much more distant past can nevertheless result in portions of any two homologous chromosomes being homozygous by descent, reflecting background relatedness within a population . Researchers need to integrate information across multiple contiguous homozygous SNPs in an individual's genome to detect such homozygous segments, which, by their very nature, represent known haplotypes within otherwise phase-unknown datasets. As such, they potentially represent a higher-level abstraction of information than that which can be obtained from analysis of just single SNPs. Since this has potential for identifying shared haplotypes that harbor disease variants that escape current single-marker statistical tests, the field would benefit from additional software tools and methodologies for strengthening our understanding of the distribution and variation of homozygous segments/contiguous homozygosity within human population samples.
Early attempts to understand the contribution of contiguous homozygosity to the structure of genetic variation in modern human populations identified regions of increased homozygous genotypes in individuals that likely represented autozygosity . However, due to technological limitations at the time, their micro-satellite-based scan limited resolution of segments to those of an appreciably large size: generally, much greater than one centimorgan (1 cM). Since then, the International HapMap Project, which was initiated in 2002, provided researchers with a high-density SNP dataset [4, 5] consisting of genome-wide genotypes from 270 individuals in four world-wide human populations (YRI, Yoruba in Ibadan, Nigeria; CEU, Utah residents with ancestry from northern and western Europe; CHB, Han Chinese in Beijing, China; JPT, Japanese in Tokyo, Japan).
Using the HapMap Phase I dataset, Gibson et al.  searched for tracts of contiguous homozygous loci greater than 1 Mb in length and found 1,393 such tracts among the 209 unrelated HapMap individuals. Their analysis also showed that regions of high linkage disequilibrium (LD) harbored significantly more homozygous tracts and that local tract coverage was often correlated between the four populations. Our own analysis of the HapMap Phase 2 dataset further quantified the relative total levels of contiguous homozygosity between the four HapMap population samples and showed that average total length of homozygosity was highest and almost equal between JPT and CHB, lowest in YRI, and of an intermediate level in CEU (mean total megabase length of homozygous segments ≥106 kb: JPT = 520, CHB = 510, CEU = 410, YRI = 160) . A number of groups have also examined extended homozygosity (that is, regions of contiguous homozygosity that appear longer than expected) using non-HapMap population samples with commercially available whole-genome genotyping platforms. Among these studies, a non-trivial percentage of several presumably outbred population samples were observed to possess long homozygous segments [7–12]. In addition, high frequency contiguous homozygosity was noted to reflect the underlying frequency of inferred haplotypes [9, 13], and the total extent of contiguous homozygosity (segments greater than 1 Mb in length) was recently used to assist in the analysis of the population structure of Finnish subgroups . Other recent reports have described methods for finding recessive disease variants by detecting regions of excess homozygosity in unrelated case/control samples in diseases such as schizophrenia, Alzheimer's disease, and Parkinson's disease [15–17]. As for available homozygous segment detection methods and computer programs, several studies have utilized their own in-house programs [7, 9, 10, 13, 15] while the genetic analysis application PLINK  has been used in several other reports [11, 12, 14] for detecting runs of homozygosity (ROH).
Here, we introduce hzAnalyzer, a new R package  that we have developed for detection, quantification, and visualization of homozygous segments/ROH in high-density SNP datasets. hzAnalyzer provides a comprehensive set of functions for analysis of contiguous homozygosity, including a robust algorithm for homozygous segment/ROH detection, a novel measure (termed extAUC (extent-area under the curve)) for quantifying the local genomic extent of contiguous homozygosity, routines for peak detection and processing, and methods for comparing population differentiation (Fst/θ). Using the HapMap Phase 2 dataset, we compare hzAnalyzer with PLINK's ROH output and describe the advantages of using hzAnalyzer for performing homozyous segment detection. We then extend our previous analysis  by examining the relative contribution of different sized homozygous segments to chromosomal coverage, followed by mapping extAUC and its associated statistics to the human genome. We examine the consistency of these analyses with the structure and frequency of phased haplotype data, their relationship with recombination rate estimates, and show how one can use extAUC peak definition in combination with Fst/θ to extract genomic regions harboring long multi-locus haplotypes with large inter-population frequency differences. We additionally describe detection of candidate regions of fixation and highlight genes in these regions that appear to have been important during human evolutionary history. To show how these methods can be used for practical real-world applications, we introduce a method for searching for regions of excess homozygosity that could be used to compare case-control samples for genome-wide association studies.
In this report, we describe the methodology behind hzAnalyzer by examining variation in the local extent of contiguous homozygosity across the human genome using approximately 3 million SNPs from the 269 fully genotyped samples of the HapMap Phase 2 dataset [5, 20]. For hzAnalyzer methods and implementation details, we refer readers to the Materials and methods section of this report as well as to the hzAnalyzer homepage , from which the R package, tutorials, and example datasets can be downloaded.
Homozygous segment detection, validation, and annotation
After processing the HapMap release 24 SNPs for certain quality control parameters (see Materials and methods), we built a dataset of homozygous segments' coordinates and characteristics using hzAnalyzer's Java-based detection function to extract runs of contiguous homozygous loci (see Materials and methods). To remove the many short segments that were due simply to background random variation, we filtered this dataset prior to downstream analyses using a new cross-population version of the previously described homozygosity probability score (HPSex; < = 0.01; see Materials and methods) .
Comparison of segment overlap counts between hzAnalyzer and PLINK homozygous segment/runs-of-homozygosity detection routines
Number of intersecting segments in other dataset
Sample variation in the amount of the genome and individual chromosomes covered by autozygous segments is of potential interest for both population geneticists as well as those interested in disease research. Figure 1c identifies several individuals in each population sample (YRI = 5/90, CEU = 4/90, CHB = 1/45, JPT = 2/44) who possessed markedly higher genome-wide coverage by autozygous segments. Of those samples, several extreme outliers were detected that were previously reported (YRI, NA19201; CEU, NA12874; JPT, NA18992, NA18987) [5, 6]. In Additional file 3, chromosome profiles of autozygous coverage show that each of the two extreme JPT NA18987 and NA18992 samples possessed multiple chromosomes with coverage ranging from 6.0 to 43.7%, while YRI NA19201 and CEU NA12874 had only high coverage levels on single chromosomes, with 12.9% coverage on chromosome 5 and 41.3% coverage on chromosome 1, respectively. Scanning through the chromosome profiles shows that the majority of HapMap 2 samples possess one or more chromosomes containing some small proportion of autozygosity. These profiles may be evidence of a continuum of relatedness between individuals within the sample populations, with one end represented by a small group of individuals whose parents share ancestry from just several generations in the past, and the other by individuals with parents who have little or no measurable shared ancestry. Although short autozygous segments stemming from the distant past are, by their nature, random, their presence in a majority of the population could have a cumulative impact on disease when taken across large enough sample sizes.
Extent of chromosome-specific coverage by homozygous segments
Quantifying the local extent of contiguous homozygosity
extAUCpeak detection for delineation of local haplotype structure
Earlier reports showed that homozygous segments with intermediate or high-frequencies correlate with LD statistics and co-locate with haplotype blocks [6, 9]. Based on those past results, we considered that the peak/valley patterns that we observed in plotting extAUC values could be used to delineate regions of the genome with locally similar structure for contiguous homozygosity; analogous to haplotype block  definition but using the information contained within overlapping, co-localized homozygous segments rather than statistical pairwise comparisons between loci.
Genome-wide peak counts at different stages of peak processing
Peak dataset type
Outlier peak regions
Examples of top outlier peak regions for each population
Top 5 gene(s)
BC142657, C11orf49, AMBRA1, PTPRJ, CKAP5
AR, MTMR8, ARHGEF9, HEPH, MSN
ZNF431, ZNF714, ZNF708, ZNF430, ZNF429
FRMD5, TTBK2, UBR1, CASC4, TP53BP1
RABGAP1L, SLC9A11, TNN, KLHL20, RC3H1
DOCK3, MAP4, SMARCC1, CACNA2D2, RBM6
GPHN, MPP5, FAM71D, C14orf83, EIF2S1
LOC51057, EHBP1, VPS54, UGP2, MDH1
No overlapping gene symbols
FAAH2, WNK3, FAM120C, PFKFB1, PHF8
No overlapping gene symbols
ADK, CBARA1, MYST4, CCDC109A, VCL
BCAS3, USP32, TMEM49, APPBP2, CLTC
BC142657, C11orf49, AMBRA1, PTPRJ, CKAP5
ZRANB3, TMEM163, R3HDM1, RAB3GAP1, DARS
CPNE8, KIF21A, SLC2A13, PKP2, C12orf40
AK309286, GABBR1, ZNF322A, ZNF184, TRIM38
ZMYM4, EIF2C3, KIAA0319L, THRAP3, EIF2C4
FRMD5, TTBK2, UBR1, CASC4, TP53BP1
OPHN1, AR, MTMR8, ARHGEF9, HEPH
ITFG1, PHKB, LONP2, FLJ43980, N4BP1
DOCK3, MAP4, SMARCC1, CACNA2D2, RBM6
PHF20, ITCH, PIGU, NCOA6, UQCC
BCAS3, USP32, TMEM49, APPBP2, CLTC
AGBL4, FAF1, ZFYVE9, OSBPL9, EPS15
EXOC6B, SFXN5, RAB11FIP5, EMX1, CYP26B1
HERC1, CSNK1G1, ZNF609, DAPK2, USP3
HCN1, GHR, OXCT1, NNT, MGC42105
No overlapping gene symbols
OPHN1, AR, MTMR8, ARHGEF9, HEPH
DOCK3, MAP4, SMARCC1, CACNA2D2, RBM6
ITFG1, PHKB, LONP2, FLJ43980, N4BP1
BC142657, C11orf49, AMBRA1, PHF21A, PTPRJ
AGBL4, FAF1, ZFYVE9, OSBPL9, EPS15
PHF20, ITCH, PIGU, NCOA6, UQCC
EXOC6B, SFXN5, EMX1, CYP26B1, SPR
HERC1, CSNK1G1, ZNF609, DAPK2, USP3
GPHN, MPP5, C14orf83, FAM71D, PLEKHH1
BCAS3, PPM1E, USP32, TEX14, TMEM49
Population differentiation in genomic regions with high-ranking extAUCvalues
We then examined how regions of high-ranking extAUC values overlap between the four populations. Within the coordinates of each peak in the dataset, we calculated the maximum value rank for each of the four population's extAUC values. Considering that outlier peaks represent extAUC values with ranks above approximately 0.85 to 0.90, then Additional file 10 shows that the majority (>50%) of outlier peaks in one population intersect with similarly high-ranking extAUC values in the other groups, with more than 75% of outlier peaks in JPT and CHB intersecting with high-ranking extAUC values in the other East Asian population (see Materials and methods).
Based on those observations, we then used this method to search for a set of the longest genomic regions possessing high haplotype frequency differences between the two East Asian populations, which have often been considered similar enough to combine for analytical purposes. We selected peaks that had both high-ranking extAUC values in the two groups as well as extreme Fst/θ values. Across the set of 70 JPT and 70 CHB peaks shown in Table S5a,b in Additional file 2, there was an average haplotype frequency difference of 15 ± 5% (mean ± SD%) between CHB and JPT. Additional file 12, which shows the top five peaks (after sorting by the proportion of extreme Fst/θ value loci) for CHB and JPT, indicates that the observed structure of phased haplotypes tends to agree with the estimated haplotype frequency differences in Table S5a,b in Additional file 2. For example, the first plot for Chr 1:187.22-187.85 Mb spans 377 loci (minor allele frequency (MAF) >0.01 in JPT or CHB) and shows two distinct haplotypes with an estimated 0.20 frequency difference that extends across the whole 600-kb window. The top JPT region at Chr 22:39.04-39.11 Mb is a region with only 32 SNPs (MAF >0.01 in JPT or CHB), but 83% of them have extreme Fst/θ values. One background haplotype can be seen to increase in frequency from 0.21 in CHB to 0.45 in JPT, representing an estimated 0.23 frequency difference for this region. Such frequency differences may reflect natural selection but also may represent the effects of random genetic drift in allele frequencies since ancestral Japanese populations migrated from the Asian continent.
These results show that detection of outlier peaks/peak regions using hzAnalyzer in conjunction with measures of population differentiation can be used for extracting genomic regions with substantially similar or dissimilar haplotype structure between sample populations from high-density genotyping datasets.
Genomic regions containing areas at or approaching fixation
Figure 4 (for example, left panel of CEU, right panels of CHB and JPT) and Additional files 5, 6, 7 and 8 display some peaks that extend across all or almost all percentiles (from the 100th down to the 0th) in PEmat, representing genomic regions that are homozygous across all or almost all samples within a population and for which a single haplotype may be at or near fixation. Such regions may represent the impact of past natural selection in the human population, by which selected mutations have been driven to high frequency and the surrounding polymorphims have 'hitch-hiked'  as a haplotype to higher frequency. Eventually, other forces such as genetic drift may finally reduce other variation, leaving just a single haplotype to predominate.
Summary of RCL0 in candidate fixed areas
We compared the overlap of our candidate fixed areas and regions with data from five earlier reports that searched for genomic regions that had evidence for positive selection resulting in a selective sweep or that appeared differentially fixed in one population versus another [5, 27–30] (Tables S6 and S7 in Additional file 2). To ease examination of the correspondence and overlap between our results and those of other studies, we created a combined set of fixation candidate regions by resolving any overlap between the three sample populations' candidate fixed regions and present those in Table S8 in Additional file 2. This set represents much larger genomic regions compared to the RCL0 candidate fixed areas, with genome-wide coverage of 84 Mb for this set versus approximately 11 Mb of RCL0 that are embedded within these larger regions. Over half of the candidate fixed areas did not intersect directly with known coding regions, which is in keeping with previous reports that searched for regions experiencing positive selection [28, 31]. However, summarization across Table S8 in Additional file 2 shows that over 50% of the combined regions had RCL0 that intersected at least one coding region.
Of the combined fixation candidate regions, 87.5% overlapped with regions from at least one of the other studies. Overall, 39.3% of the combined regions overlap Sabeti et al.'s  results (Sabeti, n = 255), while 85.1% overlap with the Kimura et al.  report (Kimura, n = 1,379; no chromosome X data). Similar to the Sabeti et al. comparison, the regions from Tang et al.  intersected 44.7% of the combined regions (Tang, n = 604; no chromosome X data), but only 5.4% of our regions intersected those in O'Reilly et al.  (O'Reilly n = 60).
Extending hzAnalyzer analyses to case-control association studies
In addition to the population genetic analyses described above, we have recently developed a methodology that we term agglomerative haplotype analysis (AHA) for detecting genomic regions that possess increased or decreased levels of contiguous homozygosity in a case-control association study approach . For the current version of hzAnalyzer, we provide a supplementary R script on the hzAnalyzer web-site along with a tutorial for performing a comparison between two groups of samples . For researchers wishing to implement these methods, we recommend that population stratification needs to be controlled either through appropriate statistical methods  or by genetically matching case and control samples . For the following example, we will use the JPT and CHB population samples as proxies for a real case-control dataset.
The similarity and dissimilarity of genetic variation between different human populations has been an important ongoing question of genetic analysis [38, 39], with recent research addressing how specific populations that have historically been considered relatively homogeneous might still harbor important hidden heterogeneity [14, 40–43]. Such questions also tie directly into the topic of relatedness and to what extent haplotypes are shared between individuals within a population compared to between populations. As we have shown here, analysis of contiguous homozygosity can provide a means for estimating the extent and approximate frequency of such shared haplotypes that are present within human population samples. hzAnalyzer provides researchers with a new set of computational tools for performing comprehensive analyses of contiguous homozygosity in high-density genotyping datasets.
We conclude that hzAnalyzer's detection method proved more appropriate than PLINK's for detecting complete autozygous segments and that our algorithm may be less likely to call regions with low but real heterozygosity as homozygous segments. However, we should note that more time spent optimizing PLINK's settings might produce better overlap between the two datasets, although default settings are those that are most likely to be used by normal users. We also note that hzAnalyzer provides the user with finer control over their choices for calling segments across regions with heterogeneous levels of heterozygosity and inter-SNP gaps (see Materials and methods and hzAnalyzer tutorials).
In addition, a number of ROH analyses [6–13, 15] have been reported that did not include freely available software for use by outside researchers. Of those, that of Curtis et al.  is the only one that we know of that included genome-wide graphical output upon which we could base any comments for comparing underlying ROH detection methodologies. Because of their choice not to define any SNP density or inter-SNP gap thresholds, the figures in Curtis et al. impart the impression that all centromeres and large gaps possess very high-frequency contiguous homozygosity extending across large genomic distances. However, we do not feel that the amount of information present at those gap's edges supports such conclusions. Curtis et al. mentioned that setting a criterion for marker density would have had the effect of excluding the centromeres from analysis. However, our hzAnalyzer results show that it is possible for a detection algorithm to allow segments with a higher likelihood of being truly homozygous to span such large gaps while reducing false positive 'extended' homozygous segments caused by small numbers of homozygous SNPs at a gap's edges.
While we identified a general agreement between detected autozygous segments and low-coverage sequencing data from 1000 Genomes Project (Figure 7), subsequent analyses comparing larger numbers of samples between microarray and next-generation sequencing data would help add to our understanding of how to perform more accurate analyses in structurally heterogeneous regions of the genome. We posit that comparisons of ROH and autozygous segments with genotype data from both low-coverage and high-coverage sequencing data may provide a better means for adapting hzAnalyzer parameters to local, rather than global, error rates (that is, heterozygosity thresholds, SNP density).
In examining and quantifying contiguous homozygosity, chromosome X stood out as possessing much longer homozygous segments compared to similarly sized autosomes, representing higher frequency extended haplotypes and genomic regions with decreased diversity. The different biology and evolutionary history of chromosome X has previously been noted , with specific differences reported for mutation rate [45–48], LD [4, 5, 49], and natural selection . Such differences between autosomes and chromosome X likely relate to the latter's haploid/hemizygous status in human males, wherein natural selection has had a greater chance to work on both beneficial and deleterious recessive mutations [48, 50]. On chromosome X, we feel that discussion of positive selection flows naturally into the topic of haplotype fixation, which is perhaps the most extreme result of positive selection, in which a selected variant and the region (haplotype) around it rise in frequency in the population  to the point that one single haplotype accounts for all or almost all of that region's genetic variation.
Of all chromosomes, chromosome X was the largest single contributor to the list of combined fixation candidate regions, with 9 (16%) out of the 56 unique non-overlapping regions detailed in Table S8 in Additional file 2. As such, these regions all had high extAUC values and very long extent in at least one of CEU, CHB, or JPT. Six of these regions overlapped with those detected in previous studies: five in Sabeti and colleagues' dataset [5, 28] and one in O'Reilly et al. . The three regions that appear novel for this report reside at 104.2 to 105.4 Mb, 113.7 to 114.4 Mb, and 126.1 to 127.7 Mb; phased haplotype plots can be found in Additonal file 13. The last listed region appears especially interesting, as in Additional file 13 one can see that it contains a broad region (approximately 1 Mb wide) of loci near or at fixation, but this region contains only a single small candidate gene, which, due to its function, makes an interesting potential target of natural selection. The actin-related protein T1 gene (ACTRT1), is only 1,437 bp long, consists of a single exon and is a major component of the calyx, which is a key structure in the perinuclear theca of the mammalian sperm head . Interestingly, one report showed that while many sperm-expressed proteins appear to be evolving more rapidly on chromosome X versus those on autosomes, ACTRT1 appeared in the lower end of the spectrum of amino acid substitution rates , which suggests that the gene is under strong selective constraint due to the key structural role that this protein plays in sperm function.
One of the combined regions that overlapped the most with the earlier reports contains the exocyst complex component 6B gene (EXOC6B; Chr 2:71.9-73.1 Mb), the region of which was reported as a top target of positive selection by Sabeti et al.  (although the gene was not listed) and included by Kimura et al.  and Tang et al.  in supplementary tables (Kimura et al., gene then known as SEC15L2; Tang et al., gene listed was the neighboring small CYP26B1 gene) [27–29]. EXOC6B, a component of the mammalian exocyst complex, participates in active exocytosis , and the original discovery report suggested that it may play a role specific to exocyst complexes in nerve terminals . EXOC6B has one of the highest peak extAUC values across autosomes in JPT, the highest base-pair coverage by RCL0 for any autosomal fixed area, and phased haplotype plots show that it is also nearly fixed in CHB (Additional file 13). Interestingly, that plot shows that the same extended haplotype is at high frequency in CEU, but that YRI possesses the remnant of another very different extended haplotype in this gene. This observation suggests that this gene may have experienced differential selective events during human history.
In comparing our combined fixation candidate regions with those in earlier reports, we observed the greatest overlap with that of Kimura et al. This greater overlap was likely due to the similarity between the two methodologies, since they both searched for areas of the genome that are essentially monomorphic in each target population. However, while their method searched for regions that were fixed in a single target population by using another population as reference, the current report's methods made no such explicit comparisons. Nevertheless, our use of HPSex to filter segments prior to calculating extAUC and creating PEmat placed an implicit target/reference structure on downstream analyses. In order for any particular segment to have been considered informative enough to include in an analysis such as candidate fixed area detection, then at least one population of those examined must have had homozygosity frequency (FreqHOM) values that were informative across enough loci that HPSex could satisfy the 0.01 threshold. The use of HPSex along with our use of JPT and CHB as separate groups (while Kimura et al.  had combined them) allowed us to detect candidate regions of population-specific fixation between the two East Asian populations. One can search for other candidate regions by filtering Table S8 in Additional file 2 for 'Pops' that include only one of the two East Asian groups. We should also note that while we detected 366 candidate fixed areas, the Kimura et al. report  contained 1,379 hits, of which only 120 overlapped with our candidate fixed areas and only 47 with our combined candidate fixation regions. Since the average SNP count of Kimura et al. hits (mean SNP count = 12.3) was below the minimum used for extracting all of our candidate fixation areas, the overlap of our set and theirs likely represents the longest and highest confidence hits from their dataset. Resolving the small details of the intersection between our fixation candidates and the external datasets lies beyond the scope of this report.
The recombination rate analysis in Figure 7 showed that outlier peaks possess much lower recombination rates compared to regions that are haplotypically more heterogeneous. This is of potential concern, as regions that lack recombination hotspot motifs [56, 57] might mimic signatures of natural selection. Since methods that have been used to detect extended haplotype homozygosity may make assumptions about uniform recombination rates across the genome [58, 59], this might lead to higher false positive rates than expected. This supports the necessity of using multiple lines of evidence for identifying signals of positive selection, as has been done in a number of major recent studies [22, 28]. Conversely, another recent study  showed, using simulation, the profound effect of increasing selection coefficients on the ability of the LDhat  recombination inference program to detect recombination. This could impact our perception of local recombination rate estimates and hotspot locations. The potential confounding between the two phenomena may be alleviated as sample populations in addition to YRI, CEU, CHB, and JPT are used for recombination rate and hotspot inference.
While population genetic analyses provide us with knowledge and insight that helps guide our development of practical applications such as genome-wide association studies, it is the latter that will provide a measurable benefit to humankind. The predominant model for such studies has been the common disease/common variant hypothesis , but other hypotheses have been proposed, such as the multiple rare variant hypothesis [18, 62]. To develop methods for detecting variants under these different assumptions, several researchers have proposed methods for combining information across multiple loci that may be able to distinguish shared haplotype segments or clusters of loci representative of underlying rare variants [18, 63–65]. Also, several groups have proposed a version of population-based autozygosity mapping, which considers that rare variants generally are younger and reside on haplotypes that are longer than control haplotypes within the same genomic region [15, 17]. Disease susceptibility variants that have a recessive mode of inheritance then may be detectable as regions of increased homozygosity/autozygosity in cases versus controls. In this report, we introduced our AHA methodology for applying contiguous homozygosity analysis to case-control association studies. The next version of hzAnalyzer will incorporate additional functions for AHA-based case-control association analysis, such as genetic sample matching , to account for population stratification, and additional permutation-based methods to determine genome-wide significance thresholds.
We provide a comprehensive analysis of the extent of contiguous homozygosity in the four human population samples of the HapMap Phase 2 dataset using hzAnalyzer, our newly described R package. As we show, the local extent of homozygosity varies greatly between both different regions of the genome and different sampled populations, yet shows similarities between populations as well. hzAnalyzer should prove useful for interrogating the local genomic structure of contiguous homozygosity in comparisons of other worldwide populations, large population samples such as the Japan BioBank , and for detection of genomic regions harboring recessive disease variants.
Materials and methods
hzAnalyzer is an R  package that uses Java classes for detecting runs of contiguous homozygous genotypes and R functions for quantifying the frequency and extent of contiguous homozygosity across the genome and visualizing the results. For genotype processing and analysis, we utilized the R package snpMatrix . hzAnalyzer package version 0.1, which this report describes, can be downloaded from the hzAnalyzer homepage  and installed as a local source package into R. For filtering potential erroneous genotypes, we included freely available R code that implements an exact test of Hardy-Weinberg equilbrium [68, 69]. hzAnalyzer includes a small built-in example dataset for a single 10-Mb region along with example code for detecting and annotating homozygous segments as well as calculating and plotting extAUC values for one population. In addition, the package provides built-in user-guides that can be accessed using R's help facilities. Those are also available from the hzAnalyzer homepage, which also hosts the R scripts to accompany the user-guides, a single chromosome hzAnalyzer example dataset (hzAnalyzer_example_data.tgz), and a number of supplementary data files. The following sections will briefly describe how we contructed the genotyping dataset that we used to illustrate this program's functions and capabilities, followed by a description of our program, methodology, and analytical workflow. hzAnalyzer's scripts and tutorials provide greater detail about most of the methods that are described in the following subsections.
We constructed a consensus set of genotypes from HapMap Phase 2 release 24 non-redundant autosomal and chromosome X data downloaded from the International HapMap Project . The analyzed data included 30 YRI trios, 30 CEU trios, 45 unrelated CHB, and 44 unrelated JPT samples (one of the original 45 JPT samples had incomplete genotyping data). Our consensus release 24 dataset contained 2,816,866 SNPs after applying both HapMap's and our own set of quality control filters, and selecting SNPs with MAF >0.05. Only genotypes for females were used for chromosome X segment detection. Complete details on contructing this dataset are built into the hzAnalyzer documentation.
Chromosomal abnormalities, immunoglobulin regions, and copy-number variable regions
We defined chromosomal abnormalities and copy-number variations (CNVs) based on the results of an in-house algorithm for detecting gross chromosomal abnormalities and copy-number segments, and on gain or loss genotypes from the HapMap Phase III CNV genotype dataset that was downloaded from the HapMap website (file: hm3_cnv_submission.txt, 28-May-2010). Documentation built into hzAnalyzer describes the construction of this dataset.
SNPs were excluded from the consensus dataset if they intersected a gain-loss region that had ≥5% frequency or intersected any of the immunoglobulin variable region (IgV) light and heavy chain regions, which were defined as IGL(Chr 22:20,715,572-21,595,082), IGK(Chr 2:88,937,989-89,411,302), and IGHV(Chr 14:105,065,301-106,352,275). Homozygous segments were masked (set as missing) from analysis if greater than 50% of the included loci intersected chromosomal abnormalities and/or CNVs.
Detection of homozygous segments
The heuristic algorithm used by hzAnalyzer is a modified version of the one developed for the HapMap Phase 2 paper  and was designed to detect homozygous segments while intelligently accounting for inter-SNP gaps as well as a low level of heterozygote 'error' [5, 6]. hzAnalyzer's multi-step process for detecting and defining homozygous segments consists of: 1) basic detection of runs of homozygous and heterozygous genotypes; 2) joining of neighbouring homozygous segments across regions of low SNP density; 3) modeling of detected homozygous and heterozygous segments to allow for a low level of heterozygous 'error'; and 4) scan-ahead method to examine neighboring segments with heterogeneous gap and/or heterozygosity structure. User-defined parameter arrays allow control of program behavior, with the ability to vary the minimum segment SNP density and minimum segment length as a proportion of inter-segment gaps based upon gap size ranges. This allows the user to define that very long autozygous segments be allowed to span large gaps (for example, centromeres), but that shorter segments be truncated at the gap edges. hzAnalyzer's default parameter array values were used for this report, except for the lowest values of the gapWidthScanThresholds and gapWidthJoinThresholds. Those gap-width threshold values were determined from the distribution of inter-SNP gaps for the particular chromosome and dataset being used. Inter-SNP gap values were log-transformed, a standard R boxplot summary calculated, and the threshold values calculated as the exponential function of the upper whisker (Third quartile + 1.5 × Inter-quartile-range) value.
In addition, the program is controlled by thresholds for the maximum allowed proportion of heterozygotes within segments and a heterozygosity threshold to control the point at which the scan-ahead routine stops execution. For this report, the former threshold was set to 1% and the latter to 2%, which were based on analysis of the proportion of heterozygotes in different bin sizes across known autozygous segments.
Comparison of hzAnalyzer and PLINK ROH output
We compared the output of our hzAnalyzer ROH detection method with output from PLINK v.1.0.7. We downloaded the PLINK format '.bed' files (hapmap_rel23a.zip) from their website , used the list of consensus rsids from our dataset for extracting loci, and detected ROH using PLINK's default settings. The default settings return ROH >1 Mb long, so for comparison, we extracted segments from our dataset using that length threshold. We used hzAnalyzer's get.generic.intersection function to intersect the PLINK (>1 Mb) set with our complete segments dataset filtered for segments with more than 50 SNPs. We then intersected our hzAnalyzer (>1 Mb) set with the PLINK (>1 Mb) set for the reverse comparison.
Homozygosity probability score
To filter out spurious segments that were more likely due to chance, we calculated the homozygosity probability score (HPS) for each detected segment. We introduced HPS in a previous report , but modified it for this report's purposes to facilitate comparison between populations at local positions across the genome. We previously calculated HPS as the product of a segment's constituent loci's observed homozygosity frequencies (FreqHOM) from within the population to which a respective segment's sample belonged. However, that version of HPS is strongly influenced by allele frequencies in regions of the genome that have experienced population-specific fixation. Since we were interested in detecting and analyzing such fixed regions, we decided to use FreqHOM values calculated across the four populations' samples to calculate a population external version of the HPS statistic. We denote the modified version as HPSex and the previous version as HPSin.
Minimum inclusive segment length
To extract sets of higher confidence segments, we calculated what we here term a minimum inclusive segment length (MISL), which we had previously introduced  to more accurately compare sample populations on a genome-wide basis. Here we calculated MISL specific to each chromosome (MISLchr) by finding the longest segment in each individual for each chromosome and then choosing for each chromosome the lowest observed value. The genome-wide MISL (MISLgw) was simply the smallest value among the MISLchr. The MISLchr values for this current dataset are shown in Table S1 in Additional file 2. For analysis of chromosomes 7, 8, and X, we calculated the MISLchr7,8,X specific to this group of chromosomes from the lowest MISLchr value among them.
Chromosome-specific coverage by homozygous segments
To calculate chromosome-specific coverage, we selected homozygous segments with SNP density >0.2 SNP/kb (1 SNP every 5 kb) and HPSex ≤0.01 and then calculated the cumulative sum of segment length across each chromosome for each individual, interpolated values for set length cutoffs, and then used the median values across each population sample at those cutoffs for plotting. We then calculated chromosomal coverage as the proportion of mappable chromosome length by dividing those median values by the mappable chromosome length, which we defined for each chromosome as Positionlast SNP - Positionfirst SNP - Sumlength of gaps >500 kb. An individual's data were excluded if the total SNP count contained in segments intersecting chromosomal abnormalities or CNVs (as defined above) was greater than 10% of a chromosome's total SNP count. Homozygous segments intersecting chromosomal abnormalities for individual samples that fell below this threshold were individually excluded from analysis.
Intersecting segment length matrix
hzAnalyzer's segment detection process also outputs two matrices that we have termed intersecting segment length matrices (ISLMs). The ISLM matrix rows and columns correspond to individual samples and chromosomal SNP positions, respectively. Each ISLM cell contains the length of any segment in an individual that intersects a particular SNP position. The two matrices, ISLMbp and ISLMcm, are provided in base pair or centimorgan units, respectively, with centimorgan units calculated using chromosomal arm-specific average recombination rates based on the Rutger's second generation genetic map , which we downloaded from the project's website . We additionally abbreviate each column in an ISLM as an ISLV. After segment detection and ISLM creation, we create copies that are masked (values replaced with NA = 'missing') for intersecting chromosomal abnormalities and CNVs.
Identifying putative autozygous segments
The median of a segment's ISLVs' MAD scores was assigned as that segment's MAD score. Putative autozygous segments for masking purposes were defined as those with segment MAD scores >10, while for autozygosity analysis they were defined as those with MAD scores >10 and founder haplotype frequency equal to zero.
Founder haplotype frequency estimation
To estimate the frequency of founder haplotypes underlying each homozygous segment, we first created a consensus phased haplotype dataset for UCSC hg18 genome build coordinates using the release 22 (2007-08_rel22) phased haplotypes for autosomes and the release 21 (2006-07 phase II files) phased haplotypes for chromosome X from the HapMap website ; the release 21 data were converted to hg18 coordinates using liftOver [74, 75] with the hg17Tohg18.over.chain.files. We determined hg18 phased haplotypes for chromosome X for CEU and YRI trios using functions that we built into hzAnalyzer to perform deterministic phasing (see hzAnalyzer documentation). We used R's dist function to calculate the manhattan distance between each segment's founder haplotypes and the other phased haplotypes in the population sample and estimated the frequency of a segment's founder haplotypes as the proportion that were nearly matching (≤1% dissimilarity or ≤1 SNP difference, whichever was greatest).
Calculating a measure of variation in the local extent of contiguous homozygosity
Masking segments and ISLM for putative autozygous segments
To reduce the effects of extremely long segments on the homozygous extent measures that we describe in this section, we extracted putative autozygous segments (segment length MAD score >10) and masked the corresponding samples and coordinates in the ISLMbp and ISLMcm. In contrast to how we masked chromosomal abnormalities and CNVs (in which corresponding cells were set to missing), we adjusted an extreme segment's value in a particular ISLV to that of the highest non-autozygous length in that vector.
For each sample population and chromosome, we calculated the percentile values across each ISLVbp or ISLVcm using R's quantile function evaluated for probabilities 0 to 1 in 0.01 increments. In this report, we refer to the combined matrix of loci (rows) versus percentiles (columns) as the percentile-extent matrix (PEmat).
extAUC: a variable derived by integrating homozygous extent
We reversed the sign of each ISLV's masked values in ISLMcm and then split each vector based on population sample. Each population's subvector was passed to R's ecdf function to calculate an ECDF object for each population's values within the ISLV. The sign reversal at the beginning of this process was meant to orient the values such that the less frequent longer segments represented the starting point for calculating the ECDF. We then integrated (using R's integrate function) the area under the curve of each ECDF. The lower and upper boundaries of the interval for integration were determined from the reversed sign ISLV values before splitting; the lower boundary was the lowest value after reversing sign (longest masked extent value) and the upper boundary was the closest value below zero. We used the closest value to zero as opposed to zero itself, since the area between the two had no measured values and appeared to add noise to the final extAUC values. This process is diagrammed in Figure 3a-c.
Peak detection for delineation of extAUCvalue extrema
To delineate the extremes of local variation in extAUC values, we divided each population's values by determining local peaks and valleys in the data. To determine inflection points in the values, we first smoothed the data using R's smooth.spline function and calculated points at which the first derivative of the smooth spline function changed sign. The number of knots used by the smooth.spline function was allowed to vary as 3% of the loci that required smoothing. This parameter value allowed definition of peaks at an optimal resolution while reducing overfitting of the data. We then merged together neighboring peaks that possessed similar peak characteristics using an iterative process whereby two peaks were merged if the peak heights and intervening valley differed by less than 10%. We refer to the sets of peaks before and after merging as the complete peaks set and merged peaks set, respectively. Reference to processing 'peaks' in the next section or in the main text refers to the merged peaks set.
Definition of outlier peaks and peak regions
To extract peaks that appeared extreme for each of the populations, we determined outliers at the high end of the distribution of extAUC peak heights using standard R boxplot statistics (>upper whisker; >Third quartile + 1.5 × Inter-quartile range). This process was performed for each population with outlier peaks for each chromosome calculated separately. We performed an additional round of peak merging to merge together any directly adjoining outlier peaks that lacked clear separation (Intervening valley >0.5 × Peaks' heights) into a set of outlier peak regions.
Comparison of the extent and frequency of homozygous segments with haplotypes underlying extAUCpeaks
We then defined a peak's optimal parameters for Extent and Pr(X >Extent) as those that maximized the value of Extent × Pr(X > Extent), as illustrated by the red rectangle in the right-center panel of Figure 5. We label those optimized values of Pr(X > Extent) and Extent as p max and Extent min . If a single haplotype is responsible for a peak's observed contiguous homozygosity with length greater than Extent min , then its frequency should be close to the square-root of p max , which we use as the expected haplotype frequency (Freq hap-exp ). As a haplotype frequency estimate for a particular peak (Freq hap-max ), we used the maximum of the estimated haplotype frequencies among segments with length greater than Extent min .
Intersection of peaks with canonical coding regions
To determine genes underlying each peak or peak region, we intersected their coordinates with a set of genes and coding regions. We downloaded the hg18 11-May-2009 versions of the kgXref, kgTxInfo, and knownCanonical tables  from the sequence and annotation page of the UCSC Genome Browser website . We joined these tables together using the kgID, name, or transcript fields, respectively, and filtered the resulting table to include only data with the categories of 'coding' or 'antibodyParts'. Entries with the same geneSymbol that had overlapping coordinates were merged together.
extAUCrank value calculation
In order to quantify how peak delineated extAUC values in one population compared with extAUC values in the others, we transformed extAUC values into a rank value using the empirical cumulative distribution function (with each population and chromosome performed separately). Using R's ecdf function, this simply converted each of the extAUC values into its cumulative probability; for an extAUC value x, F(x) = P(X ≤ x). We then assigned each peak the maximum rank value across the loci between its lower and upper valleys and also annotated each peak with the highest rank value for each of the other populations' extAUC values within the peak's coordinates.
Definition of genomic regions containing areas of contiguous loci at or near fixation
To detect candidate fixed areas, we examined PEmat for runs of contiguous loci that had non-zero extent values in the 0th percentile (RCL0). As such, these should represent areas of the genome that possess overlapping homozygous segments in all of a population's individuals. To remove the shortest runs with small numbers of loci, we defined candidate fixed areas as RCL0 that had centimorgan extent values and SNP counts greater than the first quartile within each population. We intersected each candidate fixed area with the canonical coding region gene set described above as well as with the set of outlier peak regions. We merged the CEU, CHB, and JPT candidate region tables and resolved overlaps between the regions to create a set of combined candidate fixed regions. Candidate fixed areas, regions, and the combined regions were intersected with data from several earlier reports that had investigated selective sweeps or fixation using HapMap data [5, 27–30].
Calculation of Fst/θ combined across loci underlying outlier peaks
For each peak, we quantified differentiation between the peak's sample population and each of the other populations using a standard Fst measure. For this comparison, we used Weir and Hill's genotype frequency-based moment estimator of θ that includes terms to adjust for differences in population sample size [24, 25].
We extracted peaks for either CHB or JPT that had maximum extAUC values in the top 10% of both populations' extAUC value distribution and that had Fst/θ values exceeding 0.0360 for autosomes and 0.0538 for chromosome X. For each of those peaks, we extracted high Fst/θ SNPs underlying each outlier peak and estimated the haplotype frequency as the average allele frequency across those highly differentiated loci (after orienting the allele frequencies so that the minor allele was considered the A allele).
Concordance of putative autozygosity with 1000 Genomes Project genotypes
The 1000 Genomes Project November release genotype data (file All.2of4intersection.20100804.genotypes.vcf.gz)  were downloaded and then processed using VCFtools . Of 231 HapMap Phase 2 samples with autozygous segments, 140 were also present in 1000G, and 413 of 636 putative autozygous segments were available for comparison. Within each segment's coordinates, we calculated the proportion of heterozygous genotypes in each 1000G sample as well as a z-score and rank comparing the segment's sample heterozygosity with that in its corresponding sample population.
Analysis of extAUCpeak height and 1000 Genomes Project recombination rate
We downloaded recombination rate/genetic map data (file 1000G_LC_Pilot_genetic_map_b36_genotypes_10_2010.tar.gz) for the 1000 Genomes Project pilot data  and created population-specific genetic maps using the 1000G population-specific recombination rate maps (CHB and JPT are combined). The available genetic map data only included autosomes, so we could not analyze chromosome X peaks in our dataset. We calculated the genetic map distance (centimorgans) and recombination rate across both the full-width as well as central half-width for each peak in the merged peak dataset. Peak central half-width was defined as the midpoint of the upper 25% of the peak's extAUC values ± one-half of the peak's width. Recent reports showing that only a small fraction of the genome accounts for most recombination [5, 21] suggest that the distribution of recombination rate values varies depending on the size of the region examined. To account for this, we transformed the peaks' recombination rate values into cumulative probabilities (that is, percentile ranks), with each peak width/half-width recombination rate matched with values averaged across bins of similar size (bin widths: 5 kb, every 10 kb from 10 kb to 100 kb, every 100 kb from 200 kb to 500 kb, 750 kb, 1,000 kb, 5,000 kb, 10,000 kb) for the same chromosome and population.
Extending hzAnalyzer to case-control association analysis
To compare the distribution of segment length for a particular genomic position between two population samples, we calculated a version of the two-sample CVM test statistic ω2 [35–37], the integral of the squared differences between the two samples' ECDFs evaluated at each of the x values across the combined set of samples. For the example in this report, we used the JPT and CHB data from the ISLMcm for chromosome 20, split the data for the two samples, and created two function objects using R's ecdf function at each SNP position. The functions were then evaluated using the unique length values across the combined set of JPT and CHB values at a particular position. To determine the significance of a particular position's ω2 value, we performed a permutation test  by randomly rearranging the two samples' labels and recalculating ω2 for enough replications that an accurate approximate achieved significance level could be calculated.
hzAnalyzer includes functions to assist in constructing several types of complex figures for comparing populations as well as different regions of the genome. Phased haplotype plots used the phased haplotype data described under the section 'Founder haplotype frequency estimation'. All samples' haplotype data for a plotted region was ordered using R's agnes function (cluster package), which performs agglomerative hierarchical clustering, and each group's haplotypes plotted separately. Documentation on the hzAnalyzer website  can assist in producing some of the figures described in this report.
1000 Genomes Project dataset
agglomerative haplotype analysis
Utah residents with ancestry from northern and western Europe
Han Chinese in Beijing, China
Cramer-von Mises statistic
empirical cumulative distribution function
extent-area under the curve
homozygosity probability score
population external homozygosity probability score
intersecting segment length matrix
intersecting segment length vector
Japanese in Tokyo, Japan
median absolute deviation
minor allele frequency
minimum inclusive segment length
run of consecutive loci in the 0th percentile of PEmat
runs of homozygosity
single nucleotide polymorphism
Yoruba from Ibadan, Nigeria.
This work owes many thanks to the efforts of the International HapMap Project and its funding bodies, as without the data they have produced, this work would not have been possible. We would also like to thank Junko Ohata, Akihiro Fujimoto, Yumi Yamaguchi-Kabata, and Keith Boroevich for their careful proofreading of this manuscript.
- Hartl DL, Clark AG: Population substructure. Principles of Pgenetics. 1997, Sunderland, MA: Sinauer Associates, 111-162. 3Google Scholar
- Weir BS, Anderson AD, Hepler AB: Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006, 7: 771-780. 10.1038/nrg1960.PubMedView ArticleGoogle Scholar
- Broman KW, Weber JL: Long homozygous chromosomal segments in reference families from the centre d'Etude du polymorphisme humain. Am J Hum Genet. 1999, 65: 1493-1500. 10.1086/302661.PubMedPubMed CentralView ArticleGoogle Scholar
- Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.View ArticleGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851-861. 10.1038/nature06258.PubMedView ArticleGoogle Scholar
- Gibson J, Morton NE, Collins A: Extended tracts of homozygosity in outbred human populations. Hum Mol Genet. 2006, 15: 789-795. 10.1093/hmg/ddi493.PubMedView ArticleGoogle Scholar
- Li LH, Ho SF, Chen CH, Wei CY, Wong WC, Li LY, Hung SI, Chung WH, Pan WH, Lee MT, Tsai FJ, Chang CF, Wu JY, Chen YT: Long contiguous stretches of homozygosity in the human genome. Hum Mutat. 2006, 27: 1115-1121. 10.1002/humu.20399.PubMedView ArticleGoogle Scholar
- Simon-Sanchez J, Scholz S, Fung HC, Matarin M, Hernandez D, Gibbs JR, Britton A, de Vrieze FW, Peckham E, Gwinn-Hardy K, Crawley A, Keen JC, Nash J, Borgaonkar D, Hardy J, Singleton A: Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum Mol Genet. 2007, 16: 1-14. 10.1093/hmg/ddl436.PubMedView ArticleGoogle Scholar
- Curtis D, Vine AE, Knight J: Study of regions of extended homozygosity provides a powerful method to explore haplotype structure of human populations. Ann Hum Genet. 2008, 72: 261-278. 10.1111/j.1469-1809.2007.00411.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, Reynolds A, Indap A, Wright MH, Degenhardt JD, Gutenkunst RN, King KS, Nelson MR, Bustamante CD: Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res. 2009, 19: 795-803. 10.1101/gr.088898.108.PubMedPubMed CentralView ArticleGoogle Scholar
- McQuillan R, Leutenegger AL, Abdel-Rahman R, Franklin CS, Pericic M, Barac-Lauc L, Smolej-Narancic N, Janicijevic B, Polasek O, Tenesa A, Macleod AK, Farrington SM, Rudan P, Hayward C, Vitart V, Rudan I, Wild SH, Dunlop MG, Wright AF, Campbell H, Wilson JF: Runs of homozygosity in European populations. Am J Hum Genet. 2008, 83: 359-372. 10.1016/j.ajhg.2008.08.007.PubMedPubMed CentralView ArticleGoogle Scholar
- Nothnagel M, Lu TT, Kayser M, Krawczak M: Genomic and geographic distribution of SNP-defined runs of homozygosity in Europeans. Hum Mol Genet. 2010, 19: 2927-2935. 10.1093/hmg/ddq198.PubMedView ArticleGoogle Scholar
- Wang H, Lin CH, Service S, Chen Y, Freimer N, Sabatti C: Linkage disequilibrium and haplotype homozygosity in population samples genotyped at a high marker density. Hum Hered. 2006, 62: 175-189. 10.1159/000096599.PubMedView ArticleGoogle Scholar
- Jakkula E, Rehnstrom K, Varilo T, Pietilainen OP, Paunio T, Pedersen NL, Defaire U, Jarvelin MR, Saharinen J, Freimer N, Ripatti S, Purcell S, Collins A, Daly MJ, Palotie A, Peltonen L: The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet. 2008, 83: 787-794. 10.1016/j.ajhg.2008.11.005.PubMedPubMed CentralView ArticleGoogle Scholar
- Lencz T, Lambert C, DeRosse P, Burdick KE, Morgan TV, Kane JM, Kucherlapati R, Malhotra AK: Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia. Proc Natl Acad Sci USA. 2007, 104: 19942-19947. 10.1073/pnas.0710021104.PubMedPubMed CentralView ArticleGoogle Scholar
- Nalls MA, Guerreiro RJ, Simon-Sanchez J, Bras JT, Traynor BJ, Gibbs JR, Launer L, Hardy J, Singleton AB: Extended tracts of homozygosity identify novel candidate genes associated with late-onset Alzheimer's disease. Neurogenetics. 2009, 10: 183-190. 10.1007/s10048-009-0182-4.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang S, Haynes C, Barany F, Ott J: Genome-wide autozygosity mapping in human populations. Genet Epidemiol. 2009, 33: 172-180. 10.1002/gepi.20344.PubMedPubMed CentralView ArticleGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMedPubMed CentralView ArticleGoogle Scholar
- R: A Language and Environment for Statistical Computing. [http://www.R-project.org]
- International HapMap Project. [http://www.hapmap.org]
- hzAnalyzer. [http://emu.src.riken.jp/hzAnalyzer]
- Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.PubMedView ArticleGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.PubMedView ArticleGoogle Scholar
- Cockerham CC, Weir BS: Covariances of relatives stemming from a population undergoing mixed self and random mating. Biometrics. 1984, 40: 157-164. 10.2307/2530754.PubMedView ArticleGoogle Scholar
- Weir BS, Hill WG: Estimating F-statistics. Annu Rev Genet. 2002, 36: 721-750. 10.1146/annurev.genet.36.050802.093940.PubMedView ArticleGoogle Scholar
- Smith JM, Haigh J: The hitch-hiking effect of a favourable gene. Genet Res. 1974, 23: 23-35. 10.1017/S0016672300014634.PubMedView ArticleGoogle Scholar
- Kimura R, Fujimoto A, Tokunaga K, Ohashi J: A practical genome scan for population-specific strong selective sweeps that have reached fixation. PLoS ONE. 2007, 2: e286-10.1371/journal.pone.0000286.PubMedPubMed CentralView ArticleGoogle Scholar
- Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, Schaffner SF, Lander ES, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, et al: Genome-wide detection and characterization of positive selection in human populations. Nature. 2007, 449: 913-918. 10.1038/nature06250.PubMedPubMed CentralView ArticleGoogle Scholar
- Tang K, Thornton KR, Stoneking M: A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 2007, 5: e171-10.1371/journal.pbio.0050171.PubMedPubMed CentralView ArticleGoogle Scholar
- O'Reilly PF, Birney E, Balding DJ: Confounding between recombination and selection, and the Ped/Pop method for detecting selection. Genome Res. 2008, 18: 1304-1313.PubMedPubMed CentralView ArticleGoogle Scholar
- Voight BF, Kudaravalli S, Wen X, Pritchard JK: A map of recent positive selection in the human genome. PLoS Biol. 2006, 4: e72-10.1371/journal.pbio.0040072.PubMedPubMed CentralView ArticleGoogle Scholar
- Johnson T, Tanaka T, Kubo M, Nakamura Y, Tsunoda T: Analyzing contiguous homozygosity to quantify haplotype structure differences between case and control samples [abstract]. 60th Annual Meeting of The American Society of Human Genetics: 3 November 2010; Washington, DC. 2010, American Society of Human Genetics, program no. 2812Google Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.PubMedView ArticleGoogle Scholar
- Guan W, Liang L, Boehnke M, Abecasis GR: Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet Epidemiol. 2009, 33: 508-517. 10.1002/gepi.20403.PubMedPubMed CentralView ArticleGoogle Scholar
- Kiefer J: K-sample analogues of the Kolmogorov-Smirnov and Cramer-v. Mises tests. Ann Math Statist. 1959, 30: 420-447. 10.1214/aoms/1177706261.View ArticleGoogle Scholar
- Darling DA: The Kolmogorov-Smirnov, Cramer-von Mises tests. Ann Math Statist. 1957, 28: 823-838. 10.1214/aoms/1177706788.View ArticleGoogle Scholar
- Anderson TW: On the distribution of the two-sample Cramer-von Mises criterion. Ann Math Statist. 1962, 33: 1148-1159. 10.1214/aoms/1177704477.View ArticleGoogle Scholar
- Nelson MR, Bryc K, King KS, Indap A, Boyko AR, Novembre J, Briley LP, Maruyama Y, Waterworth DM, Waeber G, Vollenweider P, Oksenberg JR, Hauser SL, Stirnadel HA, Kooner JS, Chambers JC, Jones B, Mooser V, Bustamante CD, Roses AD, Burns DK, Ehm MG, Lai EH: The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am J Hum Genet. 2008, 83: 347-358. 10.1016/j.ajhg.2008.08.005.PubMedPubMed CentralView ArticleGoogle Scholar
- Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, Gregersen PK, Seldin MF: Analysis of East Asia genetic substructure using genome-wide SNP arrays. PLoS ONE. 2008, 3: e3862-10.1371/journal.pone.0003862.PubMedPubMed CentralView ArticleGoogle Scholar
- Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, Seldin MF: Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 2008, 4: e4-10.1371/journal.pgen.0040004.PubMedPubMed CentralView ArticleGoogle Scholar
- Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, Kamatani N: Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. Am J Hum Genet. 2008, 83: 445-456. 10.1016/j.ajhg.2008.08.019.PubMedPubMed CentralView ArticleGoogle Scholar
- Genome-wide association study of 14, 000 cases of seven common diseases and 3, 000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, Seligsohn U, Waliszewska A, Schirmer C, Ardlie K, Ramos A, Nemesh J, Arbeitman L, Goldstein DB, Reich D, Hirschhorn JN: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4: e236-10.1371/journal.pgen.0030236.PubMedPubMed CentralView ArticleGoogle Scholar
- Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, Platzer M, Howell GR, Burrows C, Bird CP, Frankish A, Lovell FL, Howe KL, Ashurst JL, Fulton RS, Sudbrak R, Wen G, Jones MC, Hurles ME, Andrews TD, Scott CE, Searle S, Ramser J, Whittaker A, Deadman R, Carter NP, Hunt SE, Chen R, Cree A, Gunaratne P, et al: The DNA sequence of the human X chromosome. Nature. 2005, 434: 325-337. 10.1038/nature03440.PubMedPubMed CentralView ArticleGoogle Scholar
- McVean GT, Hurst LD: Evidence for a selectively favourable reduction in the mutation rate of the X chromosome. Nature. 1997, 386: 388-392. 10.1038/386388a0.PubMedView ArticleGoogle Scholar
- Li WH, Yi S, Makova K: Male-driven evolution. Curr Opin Genet Dev. 2002, 12: 650-656. 10.1016/S0959-437X(02)00354-4.PubMedView ArticleGoogle Scholar
- Miller W, Makova KD, Nekrutenko A, Hardison RC: Comparative genomics. Annu Rev Genomics Hum Genet. 2004, 5: 15-56. 10.1146/annurev.genom.5.061903.180057.PubMedView ArticleGoogle Scholar
- Hammer MF, Woerner AE, Mendez FL, Watkins JC, Cox MP, Wall JD: The ratio of human X chromosome to autosome diversity is positively correlated with genetic distance from genes. Nat Genet. 2010, 42: 830-831. 10.1038/ng.651.PubMedView ArticleGoogle Scholar
- Tapper W, Collins A, Gibson J, Maniatis N, Ennis S, Morton NE: A map of the human genome in linkage disequilibrium units. Proc Natl Acad Sci USA. 2005, 102: 11835-11839. 10.1073/pnas.0505262102.PubMedPubMed CentralView ArticleGoogle Scholar
- Lu J, Wu CI: Weak selection revealed by the whole-genome comparison of the X chromosome and autosomes of human and chimpanzee. Proc Natl Acad Sci USA. 2005, 102: 4063-4067. 10.1073/pnas.0500436102.PubMedPubMed CentralView ArticleGoogle Scholar
- Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A, Mikkelsen TS, Altshuler D, Lander ES: Positive natural selection in the human lineage. Science. 2006, 312: 1614-1620. 10.1126/science.1124309.PubMedView ArticleGoogle Scholar
- Heid H, Figge U, Winter S, Kuhn C, Zimbelmann R, Franke W: Novel actin-related proteins Arp-T1 and Arp-T2 as components of the cytoskeletal calyx of the mammalian sperm head. Exp Cell Res. 2002, 279: 177-187. 10.1006/excr.2002.5603.PubMedView ArticleGoogle Scholar
- Torgerson DG, Kulathinal RJ, Singh RS: Mammalian sperm proteins are rapidly evolving: evidence of positive selection in functionally diverse genes. Mol Biol Evol. 2002, 19: 1973-1980.PubMedView ArticleGoogle Scholar
- Wang S, Liu Y, Adamson CL, Valdez G, Guo W, Hsu SC: The mammalian exocyst, a complex required for exocytosis, inhibits tubulin polymerization. J Biol Chem. 2004, 279: 35958-35966. 10.1074/jbc.M313778200.PubMedView ArticleGoogle Scholar
- Brymora A, Valova VA, Larsen MR, Roufogalis BD, Robinson PJ: The brain exocyst complex interacts with RalA in a GTP-dependent manner: identification of a novel mammalian Sec3 gene and a second Sec15 gene. J Biol Chem. 2001, 276: 29792-29797. 10.1074/jbc.C100320200.PubMedView ArticleGoogle Scholar
- Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, Przeworski M, Coop G, de Massy B: PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science. 2010, 327: 836-840. 10.1126/science.1183439.PubMedPubMed CentralView ArticleGoogle Scholar
- Parvanov ED, Petkov PM, Paigen K: Prdm9 controls activation of mammalian recombination hotspots. Science. 2010, 327: 835-10.1126/science.1181495.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhong M, Lange K, Papp JC, Fan R: A powerful score test to detect positive selection in genome-wide scans. Eur J Hum Genet. 2010, 18: 1148-1159. 10.1038/ejhg.2010.60.PubMedPubMed CentralView ArticleGoogle Scholar
- Hanchard NA, Rockett KA, Spencer C, Coop G, Pinder M, Jallow M, Kimber M, McVean G, Mott R, Kwiatkowski DP: Screening for recently selected alleles by analysis of human haplotype similarity. Am J Hum Genet. 2006, 78: 153-159. 10.1086/499252.PubMedPubMed CentralView ArticleGoogle Scholar
- McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P: The fine-scale structure of recombination rate variation in the human genome. Science. 2004, 304: 581-584. 10.1126/science.1092500.PubMedView ArticleGoogle Scholar
- Reich DE, Lander ES: On the allelic spectrum of human disease. Trends Genet. 2001, 17: 502-510. 10.1016/S0168-9525(01)02410-6.PubMedView ArticleGoogle Scholar
- Pritchard JK: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124-137. 10.1086/321272.PubMedPubMed CentralView ArticleGoogle Scholar
- Browning BL, Browning SR: Haplotypic analysis of Wellcome Trust Case Control Consortium data. Hum Genet. 2008, 123: 273-280. 10.1007/s00439-008-0472-1.PubMedPubMed CentralView ArticleGoogle Scholar
- Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007, 81: 1084-1097. 10.1086/521987.PubMedPubMed CentralView ArticleGoogle Scholar
- Browning SR: Multilocus association mapping using variable-length Markov chains. Am J Hum Genet. 2006, 78: 903-913. 10.1086/503876.PubMedPubMed CentralView ArticleGoogle Scholar
- Nakamura Y: The BioBank Japan Project. Clin Adv Hematol Oncol. 2007, 5: 696-697.PubMedGoogle Scholar
- snpMatrix: The snp.matrix and X.snp.matrix classes. R package version 1.14.6. [http://www-gene.cimr.cam.ac.uk/clayton/software/]
- Wigginton JE, Cutler DJ, Abecasis GR: A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005, 76: 887-893. 10.1086/429864.PubMedPubMed CentralView ArticleGoogle Scholar
- SNP-HWE Download. [http://www.sph.umich.edu/csg/abecasis/Exact]
- PLINK. [http://pngu.mgh.harvard.edu/~purcell/plink/]
- Matise TC, Chen F, Chen W, De La Vega FM, Hansen M, He C, Hyland FC, Kennedy GC, Kong X, Murray SS, Ziegle JS, Stewart WC, Buyske S: A second-generation combined linkage physical map of the human genome. Genome Res. 2007, 17: 1783-1786. 10.1101/gr.7156307.PubMedPubMed CentralView ArticleGoogle Scholar
- The Rutgers Combined Linkage-Physical Map of The Human Genome. [http://compgen.rutgers.edu/RutgersMap/DownloadMap.aspx]
- The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.
- Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008, 36: D773-779. 10.1093/nar/gkm966.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.PubMedPubMed CentralView ArticleGoogle Scholar
- UCSC Genome Bioinformatics Sequence and Annotation Downloads. [http://hgdownload.cse.ucsc.edu/downloads.html]
- 1000 Genomes November 2010 Data Release ftp site. [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2010_11/]
- VCFtools. [http://vcftools.sourceforge.net/]
- 1000 Genomes Project pilot data paper ftp site. [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/paper_data_sets/]
- Efron B, Tibshirani R: An Introduction to the Bootstrap. 1993, Chapman and HallView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.