The first Irish genome and ways of improving sequence accuracy
© BioMed Central Ltd 2010
Published: 7 September 2010
Skip to main content
© BioMed Central Ltd 2010
Published: 7 September 2010
Whole-genome sequencing of an Irish person reveals hundreds of thousands of novel genomic variants. Imputation using previous known information improves the accuracy of low-read-depth sequencing.
See research article: http://genomebiology.com/2010/11/9/R91
In the past 10 years, numerous human genomic variants have been discovered and catalogued, mostly through the efforts of the International HapMap project and personal genome studies . Information on human genomic variants may serve as a valuable resource for developing personalized medicine because some of these variants could potentially predispose humans to complex diseases. The 2009 version (version 130) of the dbSNP database included approximately 13.9 million (13.9 M) single nucleotide polymorphisms (SNPs) and 4.5 M small insertions and deletions (indels). However, many issues need to be addressed before personalized medicine becomes a reality. These include an understanding of: the kind and number of variants that exist in the entire human genome; the number of populations and individuals needed to detect most, if not all, human genomic variants with efficiency and accuracy; the frequency of common and rare variants in an individual genome; and finally the number of variants that influence human diseases.
The authors  generated 440 M short reads from the Irish genome and obtained 11X sequencing coverage genome-wide. Despite the lower read-depth compared with other personal genomes (Figure 1), they discovered more than 3 M SNPs. Approximately 13% of these SNPs (0.4 M, approximately 3% of the total number of SNPs catalogued in dbSNP version 130) may be designated as new variants, as they were not previously deposited in the SNP database. They also found more than 20,000 potentially disease-related new SNPs. For example, they have identified a new non-synonymous SNP in the Macrophage-stimulating 1 (MST1) gene, which may have a functional role in inflammatory bowel disease. In addition, the authors detected about 200,000 short indel polymorphisms, half of which have not been reported before. Their results  clearly suggest that the human genome still harbors a tremendous number of undetected and often population-specific variants, and they provide justification for more personal genome sequencing studies from worldwide populations.
Despite these interesting results from the Irish genome analysis , its low read-depth of sequencing coverage (11X) must be examined in some detail. With the exception of the first two personal genomes sequenced by relatively longer reads, most of the other human whole-genome analyses were carried out using more than 20X sequencing coverage . Low coverage may dramatically reduce the accuracy of genome sequencing because it risks misclassification of heterozygous variants as wild type (missing the variant; this is called undercalling) or misclassification of heterozygous variants as homozygous ones (missing the wild-type allele; overcalling). Consequently, in low-depth sequencing, both the detection of sensitivity and the positive predictive value of genomic variants are compromised .
We could also consider the relative merits of personal genome sequencing from another perspective. Do we want all personal genome sequencing to exceed 99.9% accuracy? If personal genome data are not used for diagnostic purposes, why should we invest a lot of resources, time and effort in doing additional 10X to 20X sequencing to boost the accuracy from 99% to 99.9%? With limited resources, precise estimation of an individual's genetic variation is in direct conflict with analyzing as many individual genomes as possible to obtain a broader picture of the genomic architecture of a given population. For instance, if one is not interested in understanding the detailed genomic architecture of a specific person, but only in gaining a broader understanding of genomic characteristics of an ethnic group, then it would be more prudent to sequence many individuals with lower depth than a limited number with high depth. One of the attractive features of the study by Tong and colleagues  is that they have suggested ways to improve the precision of low-coverage sequencing without investing additional resources. The authors  have demonstrated that the accuracy of known SNPs in low-depth sequencing can be dramatically improved by integrating the previously known genotype or haplotype data assembled for European populations by the HapMap and 1000 Genomes projects into low coverage sequencing projects. The authors have shown that over 99% accuracy can be achieved using imputation methods using these other datasets, with only 5X sequencing coverage. What is even more interesting is that just 2X sequencing can provide genotype calls with over 95% accuracy.
These tantalizing observations suggest that even low-depth sequencing can be effective with prior detailed information on related genomes. In addition, with accurate genomic data on Irish genomes, the power of imputation methods could be even better, and this would also be the case for other populations. These predictions further emphasize the need for additional personal whole-genome sequencing of a large number of individuals from diverse ethnic groups.
In the past, many investigators have reported signatures of selection in the human genome. Tong and colleagues  have used an interesting approach to study positive selection in the human genome using the Irish genome and the available sequence data on nine personal genomes from previous studies. Despite the small sample size and varied sequencing methods used in previous studies, this attempt can be considered as an initial step toward developing an 'official' whole-genome population genetics study. Thus, this study may give a taste of future insights into population genetics research and of some of the challenges specific to whole human genome data. This study has shown evidence for balancing selection at the sites related to olfactory and taste receptors, mostly confirming the previous results from genome-wide SNP studies. Also, their analysis of ten genomes reveals elevated positive selection in fairly recently duplicated genes. Taken together , these results clearly show that whole-genome analysis can shed new light on the field of human evolutionary genetics.
Some population statistics based on haplotype patterns would benefit from complete sequence data, since relatively accurate haplotype phase can be inferred. Recently, Higasa and colleagues  showed that errors of population-based haplotype inference affected the results of some statistics for positive selection more than others. Currently available haplotype inference software may have limitations depending on the data size and availability of external information. Development of accurate haplotyping and haplotype inference methods suitable for genome sequence data will be a key to successful population genetics study using haplotype information.
Ten years have passed since the first drafts of human genome sequences were published, and we now have at least 15 individual whole-genome sequences, thanks to the dramatic progress in sequencing technology. However, there remain many unsolved questions on human genome diversity. To expand our understanding, we need more personal genome data from worldwide populations. With limited resources, quality (accuracy) and quantity (number of individuals) are always difficult to balance. The study by Tong and colleagues  is the first attempt to tackle this question. This approach is the first of its kind and will likely be improved on in the near future as other researchers see the potential of such an approach, however, this method and the findings, will help to open new prospect in the field of human genome research.
We thank DR Govindaraju for his valuable advice.