Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor)

Background Sorghum (Sorghum bicolor) is globally produced as a source of food, feed, fiber and fuel. Grain and sweet sorghums differ in a number of important traits, including stem sugar and juice accumulation, plant height as well as grain and biomass production. The first whole genome sequence of a grain sorghum is available, but additional genome sequences are required to study genome-wide and intraspecific variation for dissecting the genetic basis of these important traits and for tailor-designed breeding of this important C4 crop. Results We resequenced two sweet and one grain sorghum inbred lines, and identified a set of nearly 1,500 genes differentiating sweet and grain sorghum. These genes fall into ten major metabolic pathways involved in sugar and starch metabolisms, lignin and coumarin biosynthesis, nucleic acid metabolism, stress responses and DNA damage repair. In addition, we uncovered 1,057,018 SNPs, 99,948 indels of 1 to 10 bp in length and 16,487 presence/absence variations as well as 17,111 copy number variations. The majority of the large-effect SNPs, indels and presence/absence variations resided in the genes containing leucine rich repeats, PPR repeats and disease resistance R genes possessing diverse biological functions or under diversifying selection, but were absent in genes that are essential for life. Conclusions This is a first report of the identification of genome-wide patterns of genetic variation in sorghum. High-density SNP and indel markers reported here will be a valuable resource for future gene-phenotype studies and the molecular breeding of this important crop and related species.


Background
Sorghum (Sorghum bicolor) originated from Africa and is a pro-poor multipurpose crop providing food, feed, fiber and fuel across a range of agro-ecosystems, especially in those with fragile conditions. Food and Agriculture Organization data show that sorghum is currently the number five most important grain crop and, in the past decade, its yearly production has been stabilized at 60 million tonnes with a harvesting area of 44 million hectares. Sorghum is known as 'the camel amongst crops' and requires much less water than many other cereals and has a remarkable ability to produce a crop under low levels of inputs and adverse stress conditions. Sweet sorghum is a natural variant of common grain sorghum with high stem sugar content and often considered a smart crop because it can produce both food and fuel. As a C 4 crop with a high level of directly fermentable stem sugars and the ability to produce high biomass under adverse conditions, sweet sorghum is considered an ideal biofuel crop for the first and second generation bioethanol production, particularly having the advantages of exploitation of marginal land and avoiding competing for land for food crops [1][2][3]. However, the genetic basis for these remarkable traits of sweet sorghum is poorly understood.
Genetic variation consists of sequence variation and structure alteration. Sequence variation normally is manifested by SNPs, short sequence insertions and deletions (indels), microsatellites or simple sequence repeats, and transposable elements. The importance of SNPs and indels was initially realized by the occurrence of human sickle-cell anemia and cystic fibrosis diseases, the dramatic consequences caused by a nucleotide change in the hemoglobin beta gene [4] and a three-base deletion in the gene encoding a cystic fibrosis transmembrane conductance regulator [5][6][7], respectively. Structural alteration is generally described as presence/absence variations (PAVs) and copy number variations (CNVs), which include large scale deletions, insertions, duplications, inversions and translocations. An effect of CNV on phenotypic variation was documented 75 years ago in Drosophila melanogaster, with the Bar eye phenotype being caused by the Bar gene duplication [8,9]. In plants, sequence polymorphisms have gained much interest in the academic and breeding communities [10][11][12]. In several model and crop plants, including Arabidopsis, rice and maize, whole genome SNPs and indels have been developed [13][14][15][16] for a broad range of functional and evolutionary studies, including association mapping [17,18], genetic diversity [19,20], domestication, and genome evolution [21][22][23]. The effects of CNVs in plant genomes have only been reported in a few cases. In Arabidopsis and rice, array-comparative genome hybridization has been used to examine singlefeature polymorphisms [24,25], genomic lesions caused by mutagenesis [26], as well as natural variation [27]. In maize, structural alterations have been reported to violate the intraspecific genome co-linearity [28,29] and contribute to the diversity of a range of important traits, such as heterosis and disease responses [22,30,31]. CNVs also shape the genome diversity of progeny of the immediate next generations in Arabidopsis [32]. Nonetheless, CNVs and their importance in plant genome and phenotypic variation are still far from well explored.
S. bicolor has three subspecies, namely arundinaceum, bicolor and drummondii, and the cultivated sorghums are all from bicolor, which has five local races, bicolor, caudatum, durra, guinea and kafir [33]. Although sweet sorghum differs phenotypically from grain sorghum and tends to have a sugar-rich juicy stem, taller plant, higher biomass but less grain production [1,34], how sweet sorghum differs genetically from grain sorghum is not well defined [35,36]. Sweet sorghum was found in several local races of bicolor subspecies [37], which raises questions about the origin, selection and genetic and genomic basis of sweet sorghum. To address these questions, knowledge about the genome-wide genetic variation between sweet and grain sorghum is required. Such knowledge will also be useful for genetic improvement and tailor-designed breeding of this important crop [38,39]. The availability of the first whole genome sequences for a grain sorghum, BTx623 [40], has provided a template for genome-wide analysis of genetic variation. However, without additional genomes in the same species it is difficult to access hidden genome variation information. We took a next generation sequencing technology and resequenced two sweet and one grain sorghum genomes to identify patterns of sequence polymorphism and structural variation in comparison with the published BTx623 genome. This effort identified a large quantity of SNPs, indels, PAVs and CNVs in sorghum. Comparison of these variation data defined potential genome regions and metabolic pathways associated with sweet-and biofuel-associated traits. The large genome resources provided here are useful for comparative genomics and crop breeding in sorghum and related species.

Results
The morphological and physiological characteristics of sorghum lines used for resequencing Sorghum (S. bicolor) accessions Keller, E-Tian, Ji2731 and the reference accession BTx623 were used for this work. Keller is an American-bred elite sweet sorghum line and has been shown to have good performance across a range of environmental conditions [41]. E-Tian (literally meaning Russian Sweet in Chinese) was a sweet sorghum line introduced to China in the early 1970s, while Ji2731 is a representative Chinese kaoliang grain sorghum well adapted to the northeast part of China with good seedling establishment and a short growth period.
These sorghum lines differ in a number of agronomic and biofuel-associated traits ( Table 1). As expected, the two sweet sorghum lines (Keller and E-Tian) had taller  [42]. In total, we achieved an effective depth of ×36.51 coverage, with an average of ×12.17 for each line (Table S1 in Additional file 1).
With these reads and the information from the reference genome BTx623, including physical sequence alignment and gene models, we identified large quantities of SNPs, indels and PAVs ( Figure 1). In total, 1,057,018 SNPs among these sorghum genomes, of which 83,262 SNPs were located in the coding regions, were identified (Table S2 in Additional file 1; Additional file 2). SOAPsnp [43] allows the detection of heterozygosity of SNPs and the results showed that the number of heterozygous SNP sites is less than 25% of all SNP sites over the whole genome or in coding regions of the sorghum genome (Table S3 in Additional file 1). The proportions of genic SNPs identified as coding, intronic, or UTR were 42.3%, 50.2%, and 7.5%, respectively. We also identified 99,948 indels ranging from 1 10 bp in length, of which 2,230 were in coding regions (Table S4 in Additional file 1). The proportions of genic indels identified as coding, intronic, or UTR were 9.7%, 75.7%, and 14.6%, respectively. Moreover, 16,487 PAVs with an average length of 2,394 bp were identified (Table S5 in Additional file 1). Coding regions of 1,416 genes in sorghum genomes were included in these PAVs (Table S6 in Additional file 1). CNV was detected by using read depth of coverage [44]. A total of 17,111 CNVs, including 13,427 gains and 3,684 losses ranging from 2 kb to 48 Mb, were detected (5,994 for Ji2713, 3,603 for Keller and 7,514 for E-Tian).
Sanger sequencing technology was used for targeted gene verification. Primer sequences spanning genomic regions predicted to contain genetic variation were used to amplify genomic DNA templates from the three sorghum lines. In this manner, 215 SNPs in 30 genes were selected and 213 were verified using this method, suggesting a prediction accuracy of over 99% (Additional file 3). Similar accuracy was obtained with 48 indels and 9 CNVs.
Because some newly identified genes might exist beyond the currently assembled BTx623 sorghum genome, we assembled unmapped reads with SOAPdenovo and obtained contigs with a total sequence length of 7.2 Mb. Annotation of these contigs showed 73 putative absent genes with an average length of 409 bp (only coding regions were considered; Table S8 in Additional file 1). A Blast search against Arabidopsis, rice and maize genome databases revealed that 33 of these genes showed homology with known proteins (E-value < 1e -6 ).

SNP annotations and large-effect SNPs
SNPs are small differences but with great impact on the variation of genomes and the biological traits. We therefore looked into the SNP annotations in detail and paid special attention to those in genic regions. For this purpose, the newly sequenced grain sorghum BTx623 genome was used as a reference [40]. Bearing in mind that all genome annotations, including that of sorghum, are imperfect and many factors affect the analysis of effects of SNPs -especially the presence of abundant transposon elements in the sorghum genome, which can be difficult to detect when they are present in low copy numbers, are even expressed, or contain fragments of 'real' genes -we analyzed the effects of SNPs using four different gene categories: bona fide genes, low-confidence genes, pseudogenes and transposons. We retrieved gene models of the Btx623 genome from the Phytozome database [45], and verified the gene identities using EST information from PlantGDB and maize gene orthologues from MaizeGDB [46]. In the end, the gene sets included 27,640 bona fide genes, 5,197 low-confidence genes, 932 transposons and 727 pseudogenes, respectively. These genes fall into 2,637 Pfam families (Additional file 4).
As shown in Table 2 the non-synonymous-to-synonymous ratios in the bona fide gene categories were the smallest, increasing from pseudogenes to transposons to low-confidence genes. It was also found that the bona fide Pfam-containing genes had a smaller ratio than those of the low-confidence genes and transposons (Table 3). Clearly, the presence of genes involved in transposon functions and transposases have strong effects on increasing the frequencies of SNPs in the genome and hence increasing the diversity of the genomes. Figure 1 Genome-wide landscape of genetic variation in Sorghum bicolor. Gene density of chromosomes is visualized by line darkness; the more genes on a chromosome region, the darker the color. The purple and blue colors in the CNV ring represent gain and loss of copy number variation, respectively. For PAVs, the green color stands for the absence of variation, whereas pink for the presence of variation. When the non-synonymous-to-synonymous ratios in the corresponding gene categories were compared, it was found that, overall, three out of the four gene categories displayed higher ratios in the coding regions of the genome, except for transposons (Tables 2 and 3), indicating that the Pfam domains possibly have fewer amino acid substitutions. A similar drop in the ratios was also reported for the rice genome [20], but the decrease in our sorghum genomes was smaller, which might be related to the imperfection of the genome annotations and the abundant presence of transposon elements. We further analyzed the distribution of the SNPs in Pfam-containing genes in detail Figure 2. shows the number of non-synonymous and synonymous SNPs in individual Pfam gene families. The number of genes in the four gene categories is also indicated. Nearly half of the SNPs were found in leucine-rich repeats and genes encoding pentatricopeptide repeats (PPRs; Additional file 5), and another 20% of the total SNPs were from 10 Pfam family genes, including genes encoding protein kinases, protein tyrosine kinases, tetratricopeptide repeats, WD domain repeats, zinc knuckles, NB-ARC domains, ankyrin repeats, HEAT repeats, F-boxes, and armadillo/beta-catenin-like repeats. These Pfam family genes often have non-synonymous-to-synonymous ratios higher than 1, making the overall genome ratios high, and when these genes and the transposons were removed for the calculation, the ratio was reduced to close to 1 (data not shown). The finding that sequences encoding leucine-rich repeats and NB-ARC domains had higher ratios of non-synonymous to synonymous SNPs was consistent with findings in Arabidopsis, rice and maize, indicative of the diversification of plant disease-resistance proteins caused by pathogen pressure [19,20,22], whereas genes coding for ubiquitin, elongation factor Tu domain 2 and GTP binding domain proteins, all of which have important biological functions essential for life, had the lowest non-synonymous to synonymous ratios. Furthermore, we found that X8 domain and glycosyl hydrolase family 17 containing genes were amongst the families with the highest nonsynonymous to synonymous ratios. As controls, the low-confidence genes, transposons and pseudogenes containing various Pfam domains were also included in the analysis. These genes tended to have more non-synonymous SNPs than synonymous SNPs, as represented by genes encoding BED zinc fingers, hAT family dimerization domains, and DUF domains, and MuDR family transposases. As these genes are not normally functional, caution should be exercised when interpreting the gene families with exceptionally high non-synonymous to synonymous ratios, which might not be bona fide genes, although evaluation of sorghum Pfam genes using the PlantGDB database showed that 71% of them had EST support.
We went further to analyze the distribution of socalled large-effect SNPs, which are predicted to have a potentially disabling effect on gene function. It was found that 1,664 SNPs were expected to induce premature stop codons, 65 to alter initiation methionine residues, 512 to disrupt splicing donor or acceptor sites, and an additional 16 to remove the annotated stop codons, resulting in longer open reading frames ( Figure  3b). These large-effect SNPs are statistically significantly (P-value < 0.01) enriched in 14 Pfam families and depleted in 9 Pfam families ( Figure 3a). However, largeeffect SNPs were mostly enriched in transposase genes, such as those encoding MuDR family transposases, Transposase family tnp2 and retrotransposon gag proteins, or genes affected by transposon elements, such as hAT family dimerization domain and DUF domain genes. As these genes do not appear to be functional, we need to experimentally verify the importance of such enrichment. In contrast, those families devoid of largeeffect SNPs, including ABC transporter and methyltransferase domain genes, are important for organism survival.
Taken together, we have identified large sets of SNPs, some of which are useful for further downstream functional genomics analyses if the SNPs reside in bona fide genes (SNPs residing in non-functional genes or transposon elements can also be useful as molecular markers). However, these results should be viewed with caution at this stage. As the identification of large-effect SNPs depends on the annotation of gene models, the exact number and spectrum of such SNPs will probably be modified when the genome annotation is updated. Furthermore, the sorghum lines used for genome resequencing differ in many traits, and the results may be biased towards having more large-effect SNPs. It would be interesting to include sorghum lines that closely phenotypically resemble each other for comparison. Finally, we analyzed only a limited number of sorghum lines; a representative collection of sorghum lines is required to justify the results obtained.

Effects of indels and presence/absence variations
We examined the genome-wide patterns of the 1-to 10bp indels. With increasing indel size, the number of indels decreased. However, our result show that indels that are not multiples of 3 bp and produce frameshift mutations are particularly uncommon in coding regions but relatively common in non-coding regions ( Figure  4a). We also found that genes with multiples of 3-bp indels were more commonly present in the genome than those with indels of other lengths (Figure 4b).
We further analyzed the Pfam domains affected by indels. Similar to the situation of non-synonymous SNPs, indels were statistically significantly (P-value < 0.001) enriched in NB-ARC and leucine rich repeat domain genes (Table S11 in Additional file 1). Indels were also found enriched in F-box, protein kinase and tyrosine kinase gene families, which are known to possess diverse functions and are suspected to follow a rapid birth-death cycle [47][48][49]. Although in theory the effects of frame-shifting (1-, 2-, 4-, 5-, 7-, 8-and 10-bp) indels are different from those of non-frame-shift (3-, 6and 9-bp) ones, we found that, in sorghum, the gene families affected by them were very similar (Additional file 6). PAVs were highly enriched (P-value < 0.001) in nine gene families, some of which, such as NB-ARC  domain genes, are the same as those enriched in SNPs and indels (Additional file 6).

Genes with copy number variations
The annotation showed that 2,600 genes had 3,234 CNVs, and 32 genes had CNVs in all three sorghum lines. Some of these 32 genes encode proteins involved in basic biological functions, such as RNA polymerase beta subunit (Sb04g009441, Sb04g009491, Sb02g017833), NADH dehydrogenase subunit 6 (Sb10g008595), ribosomal protein S7 (Sb05g020390) and ribosomal protein S18 (Sb02g032062). One gene, Sb04g035450, was found to be lost in Keller and E-Tian, but gained extra copies in Ji2731. A blast revealed that it is a homologue of the glutamate-gated kainate-type ion channel receptor subunit gene GluR5. Gene family enrichment analysis showed that CNVs were statistically significantly (Pvalue < 0.001) enriched in genes encoding cellulose synthases, pectinesterases, GRAS transcription factors, and BTB/POZ and auxin responsive proteins, in addition to the DUF, leucine-rich repeat and the zinc knuckle proteins (Additional file 2).

Genetic variation between sweet and grain sorghum
We speculated that some of the identified genetic variation might contribute to the phenotypic differentiation of sweet-and biofuel-associated traits and focused our analysis on SNPs, indels and PAVs in genic regions. For this, we used the gene set of the reference BTx623 genome as the control and identified all the shared variation within the two sweet sorghum lines and the variation between the reference genome and the Chinese local grain sorghum, respectively. Subsequently, the two sets of data were compared to remove those genes that have large-effect SNPs in the Chinese local grain sorghum, and the remaining gene set considered as those differentiating sweet and grain sorghum. We selected SNPs that are non-synonymous in the two sweet sorghum lines Keller and E-Tian but synonymous or even not present in the grain sorghum line Ji2731 (the reference BTx623 is a grain sorghum line). Similarly, for indels and PAVs, we selected those that were identified in both the Keller and E-Tian lines but not in Ji2731 and a gene was considered the putative differentiating gene when one indel was mapped to the coding region of the gene or the coding region of the gene was affected by a PAV. Figure 5 shows the chromosomal locations of these genes, most of which are scattered in the vicinity of subtelomeric regions, and obvious clusters of sweet-associated genes were found in genomic regions on chromosomes 4, 6 and 9. The selected SNP positions were mapped to 1,266 genes in this manner. A pathway enrichment analysis was performed for these genes, and as a result, ten Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, including starch and sucrose metabolism pathways and the lignin-and coumarinebiosynthesis associated phenylpropanoid biosynthesis pathways, were identified as statistically significantly (Pvalue < 0.1) enriched in the sweet-related gene sets (Table 4). In addition, 123 genes selected by indels and 53 genes selected by PAVs were identified. Among them, four genes were found to be in important gene families in previous studies. Sb10g024663 and Sb03g032210 were reported to be members of the P450 gene family, Sb09g017540 is a DREB transcription factor, and Sb01g034050 is in the expansin gene family. Owing to the limited grain and sweet sorghum genomes analyzed and the lack of a de novo-assembled complete sweet sorghum genome, the gene set reported here may not represent the whole spectrum of the genes differentiating sweet and grain sorghum and further in-depth study is required.

Discussion
Rapid development of sequencing technologies and bioinformatic tools makes the complete genome sequencing of many species possible, which provides a starting point to unravel the tremendous genetic variation and diversity at the genome scale. Amongst several model organisms examined to date, such as human, mouse, Arabidopsis, rice, and maize, genome-wide patterns of genetic variation are able to be captured by sampling a relatively small number of genomes [14,20,[50][51][52]. By resequencing two sweet and one grain sorghum inbred lines, we uncovered nearly two million SNPs and indels, along with large numbers of PAVs and CNVs. This is a first report on the genome-wide patterns of genetic variation in sorghum, which will be valuable for further genotype-phenotype studies and for molecular breeding of this important C 4 model crop.
Our study shows that the proportions of genic SNPs identified as in coding regions, intronic regions, or UTRs are 42.3%, 50.2%, and 7.5%, respectively. Compared to Arabidopsis [19] and rice [14,20], the intronic Figure 5 Chromosomal locations of genes differentiating sweet and grain sorghum. Genes with large-effect SNPs, indels and PAVs in the two sweet sorghum lines but devoid of these genetic variations in the grain sorghum were identified, considered as the sweet-associated genes, and mapped to the sorghum genome (see text for details). A 1-Mbp sliding window was used to define sweet-related regions on individual chromosomes, and only those windows containing more than three sweet-associated genes are shown. The overall gene distribution in the sorghum genome is shown by the grey bars as the background of every chromosome. results also demonstrate that, in sorghum, the proportions of large-effect SNPs resulting in premature stop codons, alteration of initiation methionine residues and disruption of splicing donor or acceptor sites are remarkably similar to what have been reported so far in Arabidopsis [19] and maize [22], but different from rice [20]. Furthermore, we found that 16 SNPs removed annotated stop codons and resulted in longer open reading frames, which is substantially smaller than the number (1,087) in maize.
It is known that transposon elements are abundant in sorghum as well as other cereal genomes [40,53]. As the genome annotation is not perfect, caution should be exercised with regard to the analysis of the effects of SNPs. Indeed, we found that the transposase genes, pseudogenes and low-confidence genes tended to have high non-synonymous-to-synonymous ratios in comparison with bona fide genes. This was reflected in the Pfam SNP annotations as well as in the analysis of socalled large-effect SNPs, which are predicted to disable gene functions. Most of the SNPs resided in receptorlike kinases, PPR repeats, disease resistant NB-ARC genes and other genes with multiple effects on stress responses. These genes also exhibited high non-synonymous-to-synonymous ratios, further supporting the notion from studies in other species that an arms race between plant-pathogen interactions results in diversification of the pathogen-or microbe-associated molecular pattern recognition receptors in plant genomes [54,55]. Significantly, the highest non-synonymous substitution ratios were found in X8 domain and glycoside hydrolase family 17 (glucan endo-1,3-beta-glucosidase) genes, which has not been reported in Arabidopsis [19], rice [20] or maize [22]. Current annotations show that limited low-confidence genes were included in these two Pfam gene families, although we cannot rule out the possibility that these genes are pseudogenes, or truncated because of the transposon elements. Further studies are required to validate whether they are related to specific biological processes in sorghum. However, the function of these genes in carbohydrate binding as well as in cell wall biosynthesis certainly provides clues to manipulating genes of interest for biofuel production.
In sorghum, the 14 gene families enriched with largeeffect SNPs comprise genes encoding DUF proteins with unknown functions or include transponsons, which appear to be nonfunctional but may affect genetic variation at the genome scale. Furthermore, gene families involved in biotic and abiotic stress tolerance, which do not contain transposons, also harbor enriched largeeffect SNPs. For instance, over-expression of lecithin: cholesterol acyltransferase can increase lipid metabolism and the fluidity of membranes and hence the resistance to heat and/or cold shock (United States Patent Application 20050150007), whereas chalcone synthase in flavonoid biosynthesis and stilbene synthases for phytoalexin biosynthesis play important roles in sorghum disease resistance [56,57]. None of these gene families were reported to be enriched with large-effect SNPs in Arabidopsis, rice or maize. This could be due to genome/species-specific diversity, or result from the prediction algorithms used. Alternatively, this may also be related to the limited sorghum lines used, which have diverse relationships.
This effort also uncovered substantial numbers of indels and PAVs in the sorghum genomes. Indels that are not multiples of 3 bp were particularly uncommon in coding regions but relatively common in non-coding regions. This implies that most frameshift mutations are harmful to sorghum survival. The spectrum of gene families affected by indels and PAVs was similar to that of large-effect SNPs. This implies that although the origins and scales of affected genome segments may differ, SNPs, indels and PAVs may share similar survival and distribution patterns, at least in terms of gene families affected. CNV studies in plants lag behind those in animal and human models. Recent studies in maize showed its potential contribution to the heterosis of this crop during domestication and disease responses [22,30,31]. CNVs also shaped the genome diversity of progeny of the immediate next generations in Arabidopsis [32]. In the sorghum genomes, CNVs were present in several thousand genes, and some of the commonly involved genes are involved in basic biological functions as well as sugar-and bioenergy-associated traits. How this variation is associated with phenotypic variation is a new direction of future research.
The resequenced sorghum lines contained two elite sweet sorghum lines and one local elite Chinese grain sorghum line. We were able to identify genetic variation in 1,442 genes differentiating sweet and grain sorghum. Some of these genes are involved in the starch and sucrose metabolism pathway and the lignin-and coumarine-biosynthesis-associated phenylpropanoid biosynthesis pathway, which are obvious candidates for sugar and biofuel production and deserve further study. Five genes in the starch and sucrose metabolism pathway were identified and are located on chromosomes 2, 6 and 9. In the phenylpropanoid biosynthesis pathway the cinnamyl-alcohol dehydrogenase gene (Sb06g028240, encoding EC 1.1.1.195) on chromosome 6 plays a central role in lignin biosynthesis. Previous genetic analyses have identified several quantitative trait loci controlling stem Brix content, grain yields, plant height and biomass on the same chromosomes [35,36,41,58]. However, due to the lack of the links between the genome physical map and the genetic linkage maps, it is hard to judge whether these genes and quantitative trait loci co-localize and further genetics and functional genomics studies are required to characterize the functions of these genes. Some of these gene families and pathways, may not be directly associated with sugar and biofuel traits, but rather reflect variation inherited from their different origins and/or caused by breeding selection. It is known that sweet sorghums are of polyphyletic origin, spreading from the kafir, caudatum, bicolor and other grain sorghum types [37,41]. Furthermore, using the BTx623 genome as a reference, the Chinese kaoliang line Ji2731 was found to harbor a lot more genetic variation than the other two lines (Additional file 6). Further genomewide analysis with a panel of sweet and grain sorghum lines, close relatives of sorghum, as well as Chinese kaoliang is required to illustrate the complex relationships.

Conclusions
We report here a whole genome map of SNPs, indels, PAVs and CNVs amongst elite sorghum lines, which can be used as a framework for future comparative and functional genomics. Sorghum is an important global crop, used for food, fodder, the production of alcoholic beverages, as well as biofuels. Genome-wide comparison studies with trait data of elite sorghum lines using the SNPs, indels, PAVs and CNVs discovered here will provide additional clues to the molecular basis of the remarkable traits of sweet sorghum and will provide a powerful source for association genetics and discovery of alleles, which can be combined to achieve crop improvement in the future.

Plant materials and sequence data sets
Four sorghum (S. bicolor) accessions were used in this study. Btx623 is an elite grain sorghum line used for whole-genome sequencing by the Joint Genome Institute and for making several mapping populations [38]. Seeds were imbibed and germinated at 25°C in darkness under standard glasshouse conditions for 4 days and 10-day-old etiolated seedlings were harvested for DNA isolation using the CTAB (Hexadecyl trimethylammonium bromide) buffer method. Following quality assessment, genomic DNA was randomly fragmented using sonification and size-fractionated through electrophoresis and DNA fragments of the desired length were gel purified. Adapter ligation and DNA cluster preparation were performed and subjected to Solexa sequencing according to the supplier's protocol. The BTx623 reference genome sequences were downloaded from the Joint Genome Institute Phytozome website [46].

Bioinformatics pipeline SNP detection
We used a three-step procedure to detect high-quality SNPs. First, we calculated the likelihood of each accession's genotype using SOAPsnp [43]. Based on the alignment results, with consideration and analysis of data characteristics, sequencing quality and other experimental influences, the Bayesian model was applied to the actual data to calculate the probability of genotypes. The genotype with the highest probability was selected as the genotype of the sequencing individual at the specific locus and a quality value was designated accordingly to reveal the accuracy of the genotype. Second, using the consensus sequence, a polymorphic locus against the reference sequence was selected. Third, on the basis of the resequencing data of three accessions, sites with sufficient quality, called effective sites, were used for SNP determination. Sufficient quality was based on the following criteria: 3 ≤ depth ≤ 50, with depth calculated using data from each individual, and average mappable sites < 1.5. Candidate SNPs were those with sequencing depth of 3 to 50 for each accession and an average quality for the novel allele > 20. To exclude SNP calling errors caused by incorrect mapping or indels, we did not call two adjacent SNPs that were separated by < 5 bp. The remaining SNPs were defined as high quality SNPs. We performed SNP calling for each of the three accessions. These SNPs were used to calculate the whole genome SNP number and for further analysis. In SOAPsnp, a sum rank test was used to check the heterozygous sites of the called consensus. The read depth was used to filter the candidate SNPs and to obtain high accuracy heterozygous SNP sites. For a diploid genome, a site is considered heterozygous if each allele is supported by at least three reads.

SNP annotation
The localization of SNPs in coding regions, non-coding regions, start codons, stop codons and splice sites was based on annotation of gene models as provided by the Sorghum bicolor Genome Database [59]. Gene family annotation data of genes were also retrieved from this database.

Short indel detection
For short indel identification, mapped reads that met the pair-end requirements and contain alignment gaps in one end were necessary. We first mapped the paired-end reads to the reference sequence by allowing up to 10-bp gaps, and then merged these redundant pairs prior to looking for indels. Subsequently, gaps that were supported by at least three non-redundant paired-end reads were extracted. A potential indel was identified when the number of the un-gapped reads that crossed a potential indel was no more than twice that of the gapped reads. For quality control, the final list of indels included only those identified on both strands by paired-end reads.

Presence/absence variation detection
According to the principal of paired-end sequencing, one of the paired-end reads should normally be aligned onto the forward sequence, while the other should be aligned onto the reverse sequence. The distance between the two aligned positions at the reference should be in accordance with the insert size. Thus, two paired-end reads aligned to the genome should have a normal orientation and appropriate span. If the orientation or span of the two paired-end reads is different from expectation for the alignments results, the region might then have structure variation. The abnormal paired-end alignments are analyzed by clustering and compared with the types of structure variation previously defined. In this manner, PAVs could be detected, with support from at least three abnormal paired-end reads. In this study, PAVs that were supported by at least six pairedend reads were thought to be of high quality and selected as the final PAVs.