Genome sequencing of the important oilseed crop Sesamum indicum L
© BioMed Central Ltd 2013
Published: 31 January 2013
Skip to main content
© BioMed Central Ltd 2013
Published: 31 January 2013
The Sesame Genome Working Group (SGWG) has been formed to sequence and assemble the sesame (Sesamum indicum L.) genome. The status of this project and our planned analyses are described.
Sesame (Sesamum indicum L., 2n = 26), which belongs to the Sesamum genus of the Pedaliaceae family, is one of the oldest oilseed crops and is cultivated in tropical and subtropical regions of Asia, Africa and South America [1, 2]. Its cultivation history can be traced back to between 5,000 and 5,500 years ago in the Harappa Valley of the Indian subcontinent . The total area of sesame harvested in the world is currently 7.8 million hectares, and annual production is 3.84 million tons (2010, UN Food and Agriculture Organization data). Being one of the four main sesame-producing countries, China has contributed 15.2 to 32.5% of the total world sesame production over the past 10 years (2001 to 2010, UN Food and Agriculture Organization data). Sesame has one of the highest oil contents: decorticated seeds contain 45 to 63% oil . The seed is also rich in protein, vitamins, including niacin, minerals and lignans, such as sesamolin and sesamin [4–7], and it is a popular food and medicine [8–13]. Sequencing and analysis of the sesame genome is essential if we are to elucidate the evolutionary origins and characteristics of the sesame species.
Sesamum is the main genus in the family Pedaliaceae, which contains 17 genera and 80 species of annual and perennial herbs that are distributed in the Old World tropics and subtropics . The taxonomy and cytogenetics of the Sesamum genus has been reviewed and debated for a long time [1, 14–17], and many heterogeneous landraces present in various growing areas still need to be distinguished [1, 18]. S. indicum is the sole cultivar in the Sesamum genus and evolved from wild populations [14, 19]. However, the origin and evolution of cultivated sesame is still unclear and requires more detailed investigation [1, 15]. Evidence suggests that sesame may have originated in either India or Africa [3, 20–26]. Bedigian reported that sesame was derived from the Indian subcontinent (the western Indian peninsula and parts of Pakistan) thousands of years ago, and believed that the progenitor of sesame is a taxon named S. orientale var. malabaricum Nar. [22, 23], although most species of Sesamum and genera of the Pedaliaceae are native to Africa [27–29]. We hope to clarify the origin and phylogeny of S. indicum by applying comparative genomics and morphological and cytological analyses.
Sesame seed is commonly known as the 'Queen of the oil seeds', perhaps for its resistance to oxidation and rancidity . As it contains lignans, sesame oil also exerts anti-cancer properties both in vitro and in animal bioassays [30–34]. Compared with peanut (Arachis hypogaea), soybean (Glycine max), oilseed rape (Brassica napus), sunflower (Helianthus annuus L.) and other oilseed crops, sesame seed oil has an ideal nearly equal content of oleic acid (18:1) (39.6%) and linoleic acid (18:2) (46.0%), and has desirable physiological effects, including antioxidant activity, and blood pressure- and serum lipid-lowering potential [2, 35, 36]. Studies of the genome and functional genome of sesame are essential for elucidating the regulatory mechanisms underlying fatty acid and storage protein composition and content, and the secondary metabolism of antioxidant lignans [37–40].
Sesame grows well and gives good yields in both tropical and temperate climates. Its tolerance of drought and high temperatures make sesame well suited to land where few other crops can survive. However, compared with other oilseed crops, sesame seed production is not consistent, as it is susceptible to pathogens, waterlogging and low temperature conditions . Sesame breeding objectives, like those for other seed-producing crops, especially oil crops, are to create new varieties with high quality and yield potential, and resistance to pathogens (including Fusarium wilt and Charcoal rot diseases), insect pests, waterlogging, drought and low temperature stress [37, 42–45]. However, identification of genes or gene families and marker loci associated with yield, quality, and resistance to disease and abiotic stresses has been hampered due to a lack of information on the sesame genome. Only a few functional genes, mainly involved in the formation and regulation of fatty acids, seed storage proteins and secondary metabolites, and salt stress response, have been investigated [46–54]. With the exception of a sole amplified fragment length polymorphism (AFLP) marker associated with the indehiscent-capsule trait reported in 2003 , no quantitative trait loci have been found in the linkage map of sesame, let alone used for molecular-assisted selection (MAS) in sesame breeding programs. Integrating desirable qualities from the few available excellent germplasm resources, including wild species, will not be achievable rapidly unless considerably more genomic and functional genomic information is available. In addition, sequencing of the sesame genome will facilitate studies of other genera of the Pedaliaceae family by providing a closely related reference genome.
We therefore plan to implement a Sesame Genome Project and sequence the S. indicum genome using the Chinese domestic cultivar, Yuzhi 11, which represents S. indicum cultivars with a simple stem, three flowers per axilla, oblong-quadrangular capsules, and white flower and seed-coat color. Yuzhi 11 is one of the most important Chinese cultivars due to its high oil content (56.66%), resistance to fungal pathogens such as Fusarium wilt, charcoal rot and Alternaria leaf spot, and waterlogging stress. It is cultivated in the main production regions of China [56, 57].
The Sesame Genome Working Group (SGWG) comprises six major sesame research teams in China involved in investigating genetic diversity of germplasm resources, functional genomics, and biotic and abiotic resistance, in addition to sesame genome sequencing. All members of the SGWG work under the Toronto Statement for prepublication data release . The main goal of the Sesame Genome Project is to provide a fine map of S. indicum and facilitate global genomic and functional genomic studies. We have already released a preliminary draft assembly  of the sesame genome that can be used according to the conditions outlined in this letter. A detailed plan for the Sesame Genome Project has been made available on our website .
Natural sesame species can be divided into three types based on chromosome numbers, that is, 2n = 26 (for example, S. indicum, S. alatum), 2n = 32 (for example, S. protratum, S. angolense) and 2n = 64 (for example, S. radiatum, S. schinzianum) [14, 37]. The basic chromosome number in the Sesamum genus is X = 8 and 13, with X = 13 probably resulting from ancient polyploidy . The size of a haploid genome of S. indicum (2n = 26) was reported to be about 0.95 Gb, with a mass of 0.97 pg , which is out of proportion with the 0.51 Gb and 0.97 Gb of Cerathoteca sesamoides (2n = 32) and S. radiatum (2n = 64), respectively . Before beginning this genome project, we examined the characteristics of sesame chromosomes using cv. Yuzhi 11. Results showed that its karyotype formula is 2n = 2x = 26 = 6m + 16sm + 4st, and chromosome length ranges from 1.21 to 2.48 μm (H Zhang, unpublished data). We distinguished and numbered the chromosomes with 45S rRNA, simple sequence repeats (SSR) and bacterial artificial chromosome (BAC) sequence probes using fluorescent in situ hybridization (FISH) and BAC-FISH techniques to facilitate super-scaffold assembly in the sesame genome (H Zhang, unpublished data). Comparing genome size with that of Arabidopsis thaliana , soybean (cv. William 82)  and rice (cv. Nipponbare) , the genome size of S. indicum cv. Yuzhi 11 is estimated by flow cytometry to be about 369 Mb (H Zhang, unpublished data). From our preliminary sequencing data, we estimate the genome size to be approximately 354 Mb, close to this result (see below).
The sesame chloroplast genome was published recently . Sequencing of the chloroplast genome of S. indicum cv. Yuzhi 11 has also been performed (H Zhang, unpublished data), and will be used for raw read filtering and genome assembly in our Sesame Genome Project. A total of 86,222 unigenes with an average length of 629 bp are available and 46,584 (54.03%) unigenes have a significant similarity with proteins in the NCBI nonredundant protein database and Swiss-Prot database (E-value <10-5) . Before the beginning of this project, we sequenced sesame transcriptomes from 24 groups of S. indicum materials and treatments using Illumina paired-end sequencing technology to greatly enrich available information on the functional genome [40, 68], obtaining a 40G dataset containing 42,566 unitranscript sequences. We also constructed a BIBAC (pCLD 04541) library of 80,000 clones with an insert size of 120 kb and a BAC (CopyControl™ pCC1BAC™) library of 57,600 clones with an insert size of 85 kb. The genome coverage of both BAC libraries was 27- and 13-fold, respectively (H Zhang, unpublished data). There are 45,093 S. indicum expressed sequence tags (ESTs) available in the NCBI EST database. Prior to our work, only two other S. indicum seed-specific cDNA libraries, including one full-length cDNA library, had been constructed, some clones of which were chosen at random and sequenced [38, 69]. In order to explore more genes involved in sesame growth and development, we constructed a full-length cDNA library of S. indicum cv. Yuzhi 11 containing 300,000 clones, 1,200 clones of which were selected randomly and sequenced (H Zhang, unpublished data). The genomic and transcriptomic data from these studies should facilitate genome assembly and analysis. The first sesame linkage map, which contains 284 microsatellite polymorphic loci, was set up in 2009 and has been used as a landmark frame for assembly of the whole genome . We recently updated this high-density linkage map with 653 SSR, SNP, AFLP and random selective amplification of microsatellite polymorphic loci (RSAMPL) markers falling into 14 linkage groups to facilitate sesame genome assembly and anchoring of trait loci (H Zhang, unpublished data).
Summary of Illumina data for the S. indicum genome
Library type (n)
Insert size (bp)
Usable bases (Gb)
Illumina genome analyzer (Solexa)
Overview of the current draft assembly of S. indicum
Estimatedgenome size (Mb)
Genome assembly length (Mb)
The second phase will involve Roche 454 pyrosequencing and BAC sequencing and fine map construction. We have constructed Roche 454 paired-end libraries with an insert size of 20 kb and will generate 3.5 Gb of data giving a 250× coverage of the estimated genome. We also plan to end-sequence 40,000 sesame BAC clones using conventional Sanger sequencing, giving a 12× coverage of the estimated genome. To ensure hybrid de novo assembly of the best possible quality, we will use a modified Celera Assembler pipeline . Roche 454 paired-end reads and BAC-end reads are better for spanning longer repetitive elements and joining scaffolds into superscaffolds. We will use BAC-end information to retrieve and select 1,000 specific BAC clones, one end of which aligns well to the scaffold while the other end is located in a gap region, for full-length sequencing using the Illumina BAC polling method. The full-length BAC sequences will fill in the gaps within superscaffolds and greatly improve genome integrity. At this stage, we expect to obtain a fine map of Yuzhi 11 with 800 to 1,000 superscaffolds of a putative N50 length of 1 Mb and N90 length of 250 kb.
In the final phase, the superscaffolds will be anchored to chromosomes. We will first anchor the BACs containing mapped SSR markers on the updated linkage map  (H Zhang, unpublished data). Physical distances between landmarks will then be determined. Furthermore, we will construct a physical chromosome map based on at least 1,000 BAC clones using information obtained from BAC-FISH and BAC-end. At least one BAC will be anchored on the chromosomes per superscaffold to ensure all superscaffolds are anchored onto the 13 chromosomes. In order to validate the accuracy and integrity of the sesame genome assembly, several quality control parameters, such as read depth of coverage, average quality values per contig, discordant read pairs and gene footprint coverage, will be examined. To check the accuracy of the assembly of scaffolds, we will also complete full-length sequencing of 15 BAC clones using conventional Sanger sequencing and align them to the scaffolds.
The blueprint for the Sesame Genome Project was conceived and designed by the SGWG in 2009. We completed the goals of the first phase in March 2012. In the second phase, Roche 454 paired-ends reads will be sequenced by December 2012, and the double-ended sequencing of the 40,000 BAC clones and full-length sequencing of 1,000 BAC clones will be completed by June 2013. The final phase of scaffold anchoring will proceed in parallel with bioinformatics analysis. We expect to complete all the goals of Sesame Genome Project and submit a paper by December 2013. To make our data broadly available prior to publication, the completion of each goal of these phases will be publicly communicated via our website . Updated versions of assembly data will be made available to any independent research groups performing non-genome-scale analyses. Sequence data and the preliminary assembly produced in the first phase are already available on the website.
Repeats derived from de novo and homology-based predictions in S. indicum
Length occupied (bp)
Percentage of sequences
Total bases masked
Total interspersed repeats
In order to control the quality of raw data, the SolexaQA package was used to verify the sequence data generated from each of the 17 Illumina-Solexa libraries . The raw reads were trimmed by DynamicTrim (quality threshold Q ≈ 20) and then filtered by LengthSort (the length cutoff set as 25). Unpaired reads would be screened and discarded in this system. Meanwhile, Roche 454 reads data, which are kept in Standard Flowgram Format (SFF), were converted into FastQ format and evaluated using the traditional quality metrics. As Sanger reads may contain vector sequences, the Lucy package was used to search and trim for cutting off the vector sequence contamination . Low-quality bases and chimeric reads would be tracked with trim modules of the Celera Assembler.
We validated the coding region coverage of the draft assembly using two different gene footprint coverage methods. Using the Core Eukaryotic Genes Mapping Approach (CEGMA) , 444 (96.9%) of the 458 core eukaryotic genes (CEGs) mapped against the draft assembly were identified. An RNA sequence based method employing Velvet  and OASES  allowed us to assemble 3.5 Gb of RNA-Seq reads (NCBI accession SRX061117)  into 99,589 putative transcripts. Putative transcripts were then translated into 82,549 peptides using ESTScan (version 2.1) . These peptides were aligned against the SWISS-PROT  database using BLAST (E-value 10-5) to obtain high-confidence peptides. Redundant peptides (such as alternative-splicing transcripts) were filtered according to BLAST scores and the names of the hits. More than 99.5% of the 3,584 peptides obtained could be aligned to the draft assembly using GMAP . The above results indicate that the draft assembly has a high coverage of the coding region.
Predicted genes in S. indicum
Average gene length (kb)
Average number of introns per gene
CDS GC (%)
Average length of introns (bp)
Average length of exons
Average length of CDS
We plan to address several key biological questions specific to sesame using this new genome and transcriptome data. We will compare the sesame genome with the genomes of monocotyledonous and other dicotyledonous plants to elucidate the phylogeny of the Sesamum genus and the origin of S. indicum. We will also perform more detailed investigations on the formation and regulation of fatty acids, storage proteins and secondary metabolites (including sesamin) in sesame. We will apply the bio-information obtained in this genome project in sesame breeding programs, paying particular attention to the induction and regulation of resistance to the main sesame diseases, including Fusarium wilt and charcoal rot diseases, and the environmental stress of waterlogging. Other possible uses of the genomics dataset, such as determining the regulatory mechanisms of biological characteristics in Sesamum, including simple stem or branch, leaf shape, indeterminate growth habit, flower number per axilla, capsule carpel number, flower color and other species-specific traits, will not form part of our analysis. We believe that the main achievement of this project will be to markedly accelerate sesame genetic research and breeding. Members of the SGWG also hope to address additional questions about the relationship between sesame growth and environmental conditions, such as identifying which genes regulate low temperature responses and drought sensitivity.
This project is being conducted by the SGWG. We invite other research groups to access and use the draft assembly and raw read data, which have already been released. Any group performing non-genome-scale analyses, or investigating the above biological questions, is welcome to use our data without restriction. As a matter of courtesy and to avoid duplication of effort, we request that competing genome-scale projects or studies that overlap with the above stated research areas disclose their status to the SGWG consortium. Formal inquiries and requests to join the working group should be made to HZ. Updated versions of the genome assembly, further project descriptions and a complete list of current SGWG members dedicated to this project can be accessed on our website .
amplified fragment length polymorphism
bacterial artificial chromosome
expressed sequence tags
fluorescent in situ hybridization
random selective amplification of microsatellite polymorphic loci
Sesame Genome Working Project
single nucleotide polymorphism
simple sequence repeats
This work was supported by the earmarked fund for the China Agriculture Research System (CARS-15), China National '973' Project (2011CB109304), and Henan Zhongyuan Scholar Fund (092101211100) to HZ. HM was supported by a grant from the China National Key Technology R & D program (2009BADA8B04-03) and the earmarked fund for China Agriculture Research System (CARS-15). HL, QW and MY were individually supported by the earmarked fund for China Agriculture Research System (CARS-15). Special thanks to Dr Joy Fleming for helpful discussions and suggestions in the manuscript revision process.