Whole genome sequencing of a single Bos taurusanimal for single nucleotide polymorphism discovery
- Sebastian H Eck†1,
- Anna Benet-Pagès†1,
- Krzysztof Flisikowski2,
- Thomas Meitinger1, 3,
- Ruedi Fries2 and
- Tim M Strom1, 3Email author
© Eck et al.; licensee BioMed Central Ltd. 2009
Received: 21 April 2009
Accepted: 6 August 2009
Published: 6 August 2009
The majority of the 2 million bovine single nucleotide polymorphisms (SNPs) currently available in dbSNP have been identified in a single breed, Hereford cattle, during the bovine genome project. In an attempt to evaluate the variance of a second breed, we have produced a whole genome sequence at low coverage of a single Fleckvieh bull.
We generated 24 gigabases of sequence, mainly using 36-bp paired-end reads, resulting in an average 7.4-fold sequence depth. This coverage was sufficient to identify 2.44 million SNPs, 82% of which were previously unknown, and 115,000 small indels. A comparison with the genotypes of the same animal, generated on a 50 k oligonucleotide chip, revealed a detection rate of 74% and 30% for homozygous and heterozygous SNPs, respectively. The false positive rate, as determined by comparison with genotypes determined for 196 randomly selected SNPs, was approximately 1.1%. We further determined the allele frequencies of the 196 SNPs in 48 Fleckvieh and 48 Braunvieh bulls. 95% of the SNPs were polymorphic with an average minor allele frequency of 24.5% and with 83% of the SNPs having a minor allele frequency larger than 5%.
This work provides the first single cattle genome by next-generation sequencing. The chosen approach - low to medium coverage re-sequencing - added more than 2 million novel SNPs to the currently publicly available SNP resource, providing a valuable resource for the construction of high density oligonucleotide arrays in the context of genome-wide association studies.
The bovine reference genome sequence assembly resulted from the combination of shotgun and bacterial artificial chromosome sequencing of an inbred Hereford cow and her sire using capillary sequencing. Most of the more than 2 million bovine SNPs deposited in dbSNP represent polymorphisms detected in these two Hereford animals . Recently, Van Tassell et al.  contributed more than 23,000 SNPs to the bovine SNP collection by next-generation sequencing of reduced representation libraries. The study involved 66 cattle representing different lines of a dairy breed (Holstein) and the 7 most common beef breeds (Angus, Red Angus, Charolais, Gelbvieh, Hereford, Limousin and Simmental). These SNPs together with SNPs deposited in dbSNP were used to compile arrays with up to 50,000 SNPs. The arrays have been used to implement a new approach to animal breeding, termed genomic selection [3, 4]. Although this approach has been applied successfully to predict breeding values in dairy cattle, the underlying SNP resource is far from complete. SNP selection for the Illumina BovineSNP50 array, for instance, has been optimized to provide high minor allele frequencies (MAFs) for the Holstein breed. The full extent of common SNP variation in Holstein and other breeds is still unexplored. Although the average r2 between adjacent markers of the BovineSNP50 array is greater than 0.2 - the minimal linkage disequilibrium required for genomic prediction to be sufficiently accurate - there is a considerable number of marker pairs with an r2 of zero . Since preliminary data indicate that the extent of linkage disequilibrium in cattle breeds is only slightly larger than in humans, it has been estimated that up to 300,000 SNPs will be necessary to achieve optimal marker coverage throughout the cattle genome [5–8].
Circumventing any pooling or enrichment protocols, we sequenced just a single Fleckvieh animal to identify a large number of candidate SNPs. We demonstrate that this approach represents an effective strategy towards a comprehensive resource for common SNPs.
Results and Discussion
Sequencing and alignment
SNP and indel detection
Of these SNPs, 1,694,546 (69.4%) were homozygous and 749,091 (30.6%) were heterozygous. The low proportion of heterozygous SNPs is mainly due to the relatively low sequence depth and our stringent SNP calling requirements. The rate of heterozygous SNP detection is expected to rise with increasing coverage (Additional data file 1). It has been estimated that at least 20- to 30-fold coverage is needed to detect 99% of the heterozygous variants .
Identified SNPs and small indels
We used the RefSeq (9,518 genes) and Ensembl (28,045 genes) gene sets to functionally annotate the detected variants (Table 1). Using the RefSeq genes as reference, we found 7,619 coding SNPs (3,139 leading to non-synonymous amino acid substitutions), 40 SNPs at canonical splice sites and 6,292 SNPs in untranslated regions. Additionally, 203 indels were located in coding regions, with almost all of them (201) causing a frame-shift in the corresponding gene. The remaining two indels comprise single amino acid deletions.
The Ensembl gene set is larger and includes also gene predictions. Thus, more variants are detected using this set. We identified 22,070 coding SNPs (9360 non-synonymous substitutions), 148 SNPs at donor or acceptor splice sites and 8114 SNPs in untranslated regions. Furthermore, we identified 425 indels in Ensembl annotated coding regions. Most of them (414) cause a frame-shift in the reading frame of the associated gene, 9 indels lead to single amino acid deletions and 2 were single amino acid insertions.
Comparison of sequence and array results
We assessed the accuracy and completeness of the sequence-based SNP calls by comparing them with the genotypes of the same animal generated with an Illumina BovineSNP50 array. This chip contains 54,001 SNPs, of which 48,188 map to the current assembly (bosTau4). Of those, 48,025 SNPs were successfully genotyped; 22,299 homozygous calls exhibited the reference allele, leaving 12,043 homozygous and 13,683 heterozygous SNPs that were different with respect to the reference sequence assembly. We used these 25,726 positions together with 16 positions where only the MAQ call differed from the reference sequence to examine the accuracy and sensitivity of SNP calling in more detail.
We also estimated the autosomal nucleotide diversity π taking into account that we identified only 30% of the heterozygous SNPs correctly. This led to an autosomal nucleotide diversity of approximately 9.4 × 10-4 or 1 SNP per 1,060 bp ((749,091 - 3,553)/0.30/(2.73e9 - 88,000,000) [(Heterozygous_SNPs - X_chromosomal_SNPs)/Detection_rate/(Genome_length - X_chromosome_length)]). This value is higher than the nucleotide diversity observed in humans [9, 13] but in accordance with previous estimates in Fleckvieh [14, 15]. To assess the nucleotide diversity in coding regions, we constructed a non-redundant gene set based on the Ensembl genes by merging all transcripts from the same gene into a single 'maximum coding sequence', resulting in 22,796 non-redundant genes. According to this set, the total coding sequence length for cattle is 33,235,846 bp, or 1.21% of the genome. This coding region contained 8,438 heterozygous SNPs, resulting in a nucleotide diversity of 8.5 × 10-4 or 1 SNP per 1,181 bp (8,438/0.30/(33,235,846)).
SNPs called by MAQ compared with calls by MALDI-TOF genotyping
MAQ heterozygote under-call
MALDI-TOF homozygous, MAQ heterozygous
Error rate (without heterozygote under-calls)
In an attempt to select SNPs specifically from coding regions, we selected 75 SNPs only from regions with high sequence depth (≥16) under the assumption that sensitivity and specificity should gain from higher coverage. Because only 5.8% of coding SNPs had a sequence depth of 16 or more, several SNPs were located in close proximity. Contrary to our expectation, comparison with MALDI-TOF genotypes resulted in a false-positive rate as high as 24% (18 of 75). All these SNPs were called as heterozygotes by MAQ. Of these SNPs, 11 were called as homozygotes by MALDI-TOF genotyping in all 96 investigated animals. The remaining 7 were counted as false-positives because they were called as heterozygotes by MALDI-TOF genotyping in all 96 investigated animals. These sites were also ambiguous when checked by capillary sequencing in 12 selected animals (Additional data file 4). We therefore suspected that the selection from the extreme of coverage has introduced a strong bias. The false-positive calls were most likely caused by reads that were misassembled because these regions are duplicated but only one copy is contained in the reference sequence. Checking the read depth around the false-positive SNPs, we found 3 SNPs (chr4_117247234, chr4_117247581, chr13_16920248) that were obviously located in regions of 30 and 300 kb with high average read depth, indicating a duplication of that region (Additional data file 5). In the other regions, the high read depth extended only across a short distance so that we can not exclude random noise. It was further noticeable that several of the false-positive SNPs were located near gaps or in regions with several gaps, suggesting assembly difficulties. Although we can not provide an unequivocal explanation for the high false-positive rate of SNPs in regions with high read depth, we want to point out that these errors do not compromise the overall false-positive detection rate of 1.1%. Rather, it reveals that a significant proportion of heterozygous false-positives are not caused by sequencing errors but, most likely, by erroneous alignment and that the risk for this type of error is negatively correlated with the quality and completeness of the reference sequence. This information can be used to further filter the SNP set. Discarding all SNPs with a read depth ≥16 would reduce the set by 53,259 SNPs (2.2%).
By sequencing a single diploid genome to a depth of 7.4-fold, we were able to generate more than 2 million SNPs, thereby almost doubling the existing SNP resource in cattle. We evaluated the error rates of SNP detection in detail, point out possible sources of errors and propose means for filtering error-prone SNPs. We deduced an overall false-positive detection rate of 1.1% from genotyping 196 randomly selected SNPs by an alternative technique. This value compares well with the reported false-positive detection rate of 2.5% estimated by genotyping 1,206 SNPs by a similar approach . Despite a false-negative detection rate of 49%, which is largely explained by missing heterozygous SNPs at low sequencing coverage, SNP identification was very effective. In contrast to the detection of SNPs and small indels, the identification of structural variations at a size that exceeds the individual read length was ineffective at low sequence depth. In addition to SNP discovery, this sequence of a single animal constitutes a first step towards a haplotype reconstruction of the Fleckvieh breed. The animal selected for this approach was a prominent Bavarian Fleckvieh bull. With more than 50,000 inseminations in 2008 alone, the selected animal is founder of a very large pedigree. Fleckvieh is a dual purpose breed (dairy and beef) originating from the Swiss Simmental breed. Fleckvieh cows contribute about 8% of all recorded lactations worldwide, which makes them the second largest dairy breed after Holstein. Fleckvieh, together with the Brown breed, are so called Alpine breeds that are phylogenetically distant from Holstein . The distribution of genotypes found for 196 SNPs in 48 Brown and 48 Fleckvieh animals proved our chosen strategy to be successful. We provide a comprehensive SNP list for the two main Alpine breeds Brown and Fleckvieh. For a future dense array with up to 1 million SNPs, the experiment provides SNPs that can be translated into genome-wide oligonucleotide arrays in a single-step procedure with a conversion rate of more than 80%. The chosen strategy is predicted to be applicable to complement the SNP resource in other farm animals such as swine and chicken, especially with sequencing outputs from a single experiment predicted to cross the 100 Gb threshold before the end of 2009.
Materials and methods
DNA library construction and sequencing
EDTA-blood was obtained from Fleckvieh bull Vanstein 191658 and genomic DNA was extracted according to standard protocols. DNA was sheared by nebulization with compressed nitrogen gas. We constructed 3 different paired-end libraries with median insert sizes of 75, 80 and 170 nucleotides. The libraries were sequenced on a GAII (Illumina, San Diego, Californica, USA). Sample preparation, cluster generation and sequencing were performed according to the manufacture's protocols with minor modifications (Illumina paired-end cluster generation kit GA II v1, 36-cycle sequencing kit v1).
We used the bosTau4.0 assembly as reference sequence including the scaffolds that were not anchored onto specific chromosomes. Image analysis and ELAND alignment was performed with the Pipeline software version 1.0 as provided by Illumina. Subsequently, short read alignment, consensus assembly and variant calling were performed using the re-sequencing software MAQ version 0.6.8 . For the alignment part, we used the following parameters: number of maximum mismatches that can always be found = 2; mutation rate between the reference sequence and the reads = 0.001; threshold on the sum of mismatching base qualities = 70. For the 'snpfilter' part of the MAQ software, we used the following parameters: minimum read depth = 3; maximum read depth = 256; minimum mapping quality = 40; minimum neighboring quality = 20; minimum consensus quality = 20; window size around potential indels = 3; window size for filtering dense SNPs = 10; maximum number of SNPs in a window = 2.
After SNP calling by MAQ, we applied additional filters. We required each putative SNP to have a median quality value of the variant base of at least 20 and that at least 20% of the reads covering this position must come from opposite strands. Functional analysis of the SNPs was performed with custom Perl scripts using datasets from Ensembl , the Santa Cruz Genome Browser  and the Baylor College Bovine Genome Project web pages . Ensembl and RefSeq gene annotations were used as provided by the Santa Cruz Genome Browser (October 2008). SNP locations were downloaded form the Baylor College Bovine Genome Project ftp site .
For genotyping, we selected bulls that did not have both sires and maternal grandsires in common. Genotypes were determined on a BovineSNP50 chip (Illumina). Genotyping of selected SNPs was performed with the MassARRAY system (Sequenom, San Diego, California, USA) using the iPLEX Gold chemistry. For random selection of SNPs we used a random number generator as implemented in the Perl function 'rand'. Assays were designed using AssayDesign 220.127.116.11 with iPLEX Gold default parameters and up to 25 assays were multiplexed. Genotype calling was done with SpectroTYPER 3.4 software.
Sequence data are available from the European Read Archive (ERA) [ERA:ERA000089]. SNPs have been submitted to dbSNP ([dbSNP:ss140006985] to [dbSNP:ss142339932]).
Additional data files
The following additional data are available with the online version of this paper: a table showing the number of homo- and heterozygous SNPs depending on different read depth (Additional data file 1); a figure showing empirical cumulative distribution of read depth of the SNPs selected for MALDI-TOF genotyping in comparison to the entire SNP set (Additional data file 2); a table showing genotypes, MAF and test for Hardy-Weinberg equilibrium of 196 SNPs determined with MALDI-TOF spectroscopy in 48 Fleckvieh and 48 Braunvieh bulls (Additional data file 3); a table showing the false-positive SNP calls in 75 coding SNPs with high read depth (≥16) (Additional data file 4); a figure showing the sequencing depth around false-positive MAQ calls (Additional data file 5).
small insertion/deletion event
minor allele frequency
matrix-assisted laser desorption/ionization time-of-flight
single nucleotide polymorphism.
This work was supported by grants of the German Ministry for Education and Research (BMBF 01GR0804 and 0315131E) and the Dr. Dr. h. c. Karl Eibl-Stiftung. We thank Dr J Aumann (Besamungsverein Neustadt a. d. Aisch e. V.) and Dr W Lampeter (Meggle Besamungsstation Rottmoos GmbH) for providing us with a blood sample of Vanstein 191658 and for allowing us to publish the sequence and polymorphism data. The research was conducted within the Synbreed Consortium.
- Gibbs RA, Taylor JF, Van Tassell CP, Barendse W, Eversole KA, Gill CA, Green RD, Hamernik DL, Kappes SM, Lien S, Matukumalli LK, McEwan JC, Nazareth LV, Schnabel RD, Weinstock GM, Wheeler DA, Ajmone-Marsan P, Boettcher PJ, Caetano AR, Garcia JF, Hanotte O, Mariani P, Skow LC, Sonstegard TS, Williams JL, Diallo B, Hailemariam L, Martinez ML, Morris CA, Silva LO, et al: Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science. 2009, 324: 528-532. 10.1126/science.1167936.PubMedView ArticleGoogle Scholar
- Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5: 247-252. 10.1038/nmeth.1185.PubMedView ArticleGoogle Scholar
- Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME: Invited review: Genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 2009, 92: 433-443. 10.3168/jds.2008-1646.PubMedView ArticleGoogle Scholar
- VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009, 92: 16-24. 10.3168/jds.2008-1514.PubMedView ArticleGoogle Scholar
- Goddard ME, Hayes BJ: Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009, 10: 381-391. 10.1038/nrg2575.PubMedView ArticleGoogle Scholar
- Kim ES, Kirkpatrick BW: Linkage disequilibrium in the North American Holstein population. Anim Genet. 2009, 40: 279-288. 10.1111/j.1365-2052.2008.01831.x.PubMedView ArticleGoogle Scholar
- Khatkar MS, Nicholas FW, Collins AR, Zenger KR, Cavanagh JA, Barris W, Schnabel RD, Taylor JF, Raadsma HW: Extent of genome-wide linkage disequilibrium in Australian Holstein-Friesian cattle based on a high-density SNP panel. BMC Genomics. 2008, 9: 187-10.1186/1471-2164-9-187.PubMedPubMed CentralView ArticleGoogle Scholar
- Gautier M, Faraut T, Moazami-Goudarzi K, Navratil V, Foglio M, Grohs C, Boland A, Garnier JG, Boichard D, Lathrop GM, Gut IG, Eggen A: Genetic and haplotypic structure in 14 European and African cattle breeds. Genetics. 2007, 177: 1059-1070. 10.1534/genetics.107.075804.PubMedPubMed CentralView ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-1858. 10.1101/gr.078212.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Elsik CG, Tellam RL, Worley KC, Gibbs RA, Muzny DM, Weinstock GM, Adelson DL, Eichler EE, Elnitski L, Guigo R, Hamernik DL, Kappes SM, Lewin HA, Lynn DJ, Nicholas FW, Reymond A, Rijnkels M, Skow LC, Zdobnov EM, Schook L, Womack J, Alioto T, Antonarakis SE, Astashyn A, Chapple CE, Chen HC, Chrast J, Camara F, Ermolaeva O, Henrichsen CN, et al: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009, 324: 522-528. 10.1126/science.1169588.PubMedPubMed CentralView ArticleGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36: e105-10.1093/nar/gkn425.PubMedPubMed CentralView ArticleGoogle Scholar
- Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, et al: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409: 928-933. 10.1038/35057149.PubMedView ArticleGoogle Scholar
- Werner FA, Durstewitz G, Habermann FA, Thaller G, Kramer W, Kollers S, Buitkamp J, Georges M, Brem G, Mosner J, Fries R: Detection and characterization of SNPs useful for identity control and parentage testing in major European dairy breeds. Anim Genet. 2004, 35: 44-49. 10.1046/j.1365-2052.2003.01071.x.PubMedView ArticleGoogle Scholar
- Heaton MP, Grosse WM, Kappes SM, Keele JW, Chitko-McKown CG, Cundiff LV, Braun A, Little DP, Laegreid WW: Estimation of DNA sequence diversity in bovine cytokine genes. Mamm Genome. 2001, 12: 32-37. 10.1007/s003350010223.PubMedView ArticleGoogle Scholar
- Kruglyak L, Nickerson DA: Variation is the spice of life. Nat Genet. 2001, 27: 234-236. 10.1038/85776.PubMedView ArticleGoogle Scholar
- Medjugorac I, Kustermann W, Lazar P, Russ I, Pirchner F: Marker-derived phylogeny of European cattle supports demic expansion of agriculture. Anim Genet. 1994, 25 (Suppl 1): 19-27.PubMedView ArticleGoogle Scholar
- Ensembl Genome Browser. [http://www.ensembl.org]
- UCSC Genome Browser. [http://genome.ucsc.edu]
- Baylor College Bovine Genome Project. [http://www.hgsc.bcm.tmc.edu/projects/bovine]
- Baylor College Bovine Genome Project ftp site. [ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/snp/Btau20070913/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.