Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica

Background The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. When the genomes of different strains of a given organism are compared, whole genome resequencing data are typically aligned to an established reference sequence. However, when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Results Here, we use rice as a model to demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared using whole genome alignment to provide an unbiased assessment. Using this approach, we are able to accurately assess the ‘pan-genome’ of three divergent rice varieties and document several megabases of each genome absent in the other two. Conclusions Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard reference-mapping approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, including the S5 hybrid sterility locus, the Sub1 submergence tolerance locus, the LRK gene cluster associated with improved yield, and the Pup1 cluster associated with phosphorus deficiency, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0506-z) contains supplementary material, which is available to authorized users.

Conclusions: Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard reference-mapping approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, including the S5 hybrid sterility locus, the Sub1 submergence tolerance locus, the LRK gene cluster associated with improved yield, and the Pup1 cluster associated with phosphorus deficiency, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.

Background
Rice (Oryza sativa) provides 20% of the world's dietary energy supply and is the predominant staple food for 17 countries in Asia, 9 countries in North and South America and 8 countries in Africa. Within O. sativa, there are two major varietal groups, Indica and Japonica, that can be further subdivided into five major subpopulations: indica and aus share ancestry within the Indica varietal group, and tropical japonica, temperate japonica and aromatic (Group V) share ancestry within the Japonica varietal group ( Figure 1) [1][2][3]. The subpopulation structure of O. sativa is deep and ancient, with estimates of divergence showing average pairwise Fst values of 0.375 to 0.45 [1][2][3], compared with Fst values of 0.25 for dogs [4], around 0.10 to 0.12 across human populations [5], or 0.08 to 0.09 for heterotic groups in maize [6].
The time since divergence of the ancestral Indica and Japonica gene pools is estimated at 0.44 million years, based on sequence comparisons between cv Nipponbare (Japonica) and cv 93-11 (Indica) [8]. This time estimate pre-dates the domestication of O. sativa by several hundred thousand years, suggesting that rice cultivation proceeded from multiple, pre-differentiated ancestral pools [1,[9][10][11][12][13]. This is consistent with genome-wide estimates of divergence based on gene content [14], transcript levels [15], single nucleotide polymorphisms (SNPs) [3,16], and transposable elements [17]. This is also consistent with evidence from the cloning of dozens of genes underlying diverse quantitative trait loci (QTLs) [2,10,[18][19][20][21]. Despite ongoing debate about the precise moment and location of the first domestication 'event' in rice, these studies all demonstrate that natural variation in the rice genome is deeply partitioned and that divergent haplotypes can be readily associated with major varietal groups and subpopulations. The course of domestication, as rice transitioned from its ancestral state as a tropical, outcrossing, aquatic, perennial species to a predominantly inbreeding, annual species adapted to a wide range of ecologies, was punctuated by persistent episodes of intermating among the different subpopulations. This resulted in both natural and human-directed gene flow between the different gene pools, but the essential differentiation that distinguishes the Indica and Japonica genomes was maintained and reinforced over time as a result of numerous partial sterility barriers scattered throughout the genome [22][23][24][25].
A better understanding of the nature and extent of genome variation within the Oryza clade is critical for both practical and scientific reasons. While the OMAP project [26] is focused on documenting structural variation across 21 wild species of Oryza, relatively little effort has been made to explore the nature of structural variation within and between subpopulations of O. sativa. The high quality, bacterial artificial chromosome (BAC)-by-BAC sequence of the temperate japonica rice variety Nipponbare, generated by the International Rice Genome Sequencing Program (IRGSP) [27], and the shotgun assembly of an indica rice genome, cv 93-11, by Chinese scientists in 2005 [28,29] have served as 'reference genomes' for the rice research community. The availability of these reference genomes helped catalyze and unify rice research efforts for over a decade, and continue to serve as the backbone for re-sequencing efforts today [2,[30][31][32][33].
Recently, the resequencing of hundreds of wild and cultivated rice genomes using next generation sequencing (NGS) and various complexity-reduction and genotypeby-sequencing strategies have enriched the pool of sequence information available for rice [30,34,35]. However, the vast majority of resequenced genomes are aligned to and compared with the Nipponbare reference rather than being assembled de novo, including in our own previous work [35] and in the current 3,000 rice genomes project [36]. This introduces a potential bias due to significant differences in genome size [37,38] and structure [14,17,29,39] that characterize the different subpopulations and varieties of rice. Alignment to a single reference is particularly problematic when NGS data from indica, aus or divergent wild species genomes from the center of diversity of Oryza are aligned to the genetically and geographically divergent Nipponbare (temperate japonica) reference because of the potential for misalignment, and for elimination of critical sequences that cannot be aligned with confidence.
The type and distribution of structural variation that distinguishes one rice genome from another, both within and between the five subpopulations of O. sativa, remain largely unknown. Yet it is essential to understanding the genetic basis of heterosis, as well as to identify genes underlying many of the most significant phenotypic differences that are critical to global food security, including a plant's ability to grow in stressful environments afflicted by drought, submergence, low phosphorus and/ or disease. The only practical way to fully understand the genomic diversity of rice is to carry out whole genome shotgun sequencing and de novo assembly. This has been problematic until recently due to the difficulties in assembling the short reads initially provided by NGS. However, recent advances in NGS chemistry and in computational approaches to sequence assembly have significantly improved the power and reliability of de novo assembly of NGS data.
In this study we use these advances to de novo assemble three divergent rice genomes representing the indica (IR64), aus (DJ123) and temperate japonica (Nipponbare) subpopulations and to determine the extent and distribution of structural variation among them. These varieties were chosen for both biological interest and to facilitate evaluation of assemblies. On the biological side, different subpopulations of rice are adapted to different ecologies and geographies, and harbor different alleles and traits of interest for plant improvement [3,19,20,[40][41][42][43]. The aus subpopulation is of particular interest because it is the source of important alleles conferring disease resistance [44], tolerance to submergence [33], deep water [45], lowphosphorus soils [41], and drought [46]. Indica rice harbors the greatest amount of genetic variation [1,30] and accounts for the largest contribution to rice production globally. Our choice to sequence Nipponbare was due to the fact that it provided a high quality BAC-by-BAC sequence assembly [27] that served as a solid benchmark for assessing the quality of our three NGS assemblies and provided a context for understanding the impact of varying data sets and parameters used in the assemblies.

Results and discussion
De novo genome assemblies and functional annotation The three rice varieties were assembled using the ALLPATHS-LG whole genome assembler [47] using approximately 50× coverage of a 180 bp fragment library, approximately 30× coverage of a 2 kbp jumping library, and approximately 30× coverage of a 5 kbp jumping library (see Materials and methods). We selected this assembler based on its performance with these data compared with other assemblers and its high ranking in the Assemblathon I and II and GAGE evaluations [48][49][50]. The three assemblies were named Os-Nipponbare-Draft-CSHL-1.0, Os-IR64-Draft-CSHL-1.0, and Os-DJ123-Draft-CSHL-1.0, following nomenclature proposed by [51].
All three of our assemblies had excellent results: approximately 90% of each of the genomes were assembled into scaffolds at least 1 kbp long, with scaffold N50 sizes ranging from 213 kbp to 323 kbp, and contig N50 sizes ranging from 21.9 kbp to 25.5 kbp (Table 1). It is notable that an earlier assembly of the Nipponbare genome prior to sequencing the 5 kbp jumping library achieved a similar contig N50 size (21.2 kbp versus 21.9 kbp), but a substantially smaller scaffold N50 size (99 kbp versus 213 kbp) (also see Materials and methods). Improved scaffold sizes from including the larger library were expected, although the magnitude depends on the specific genome characteristics. Since the scaffolds were more than twice as large for Nipponbare with the larger library, this prompted us to sequence the 5 kbp jumping library for all three genomes to maximize our ability to identify genes and other features, as well as to structurally compare the genomes.
The assemblies were repeat-masked and annotated for protein-coding genes using the MAKER-P automated pipeline [52], combining both evidence-based and ab initio methods (Table S1 in Additional file 1). In addition to EST and full-length cDNA, we included as evidence the two published annotations of Nipponbare [51], and the published annotations of strains 93-11 and PA64s [28], thereby maximizing consistency and reducing bias of annotation across the three assemblies. Putative transposon-encoded genes were screened following analysis of InterPro domains (see Materials and methods), which flagged approximately 1% of initial gene calls in each of the three assemblies. Summary statistics for remaining genes are provided in Table 1 and in Table S2 in Additional file 1. Gene counts ranged from 37,758 (IR64) to 39,083 (Nipponbare), similar to the numbers reported by the Michigan State University (MSU) Rice Genome Annotation Project and Rice Annotation Project for the Os-Nipponbare-Reference-IRGSP-1.0 (39,102 and 35,681 respectively) [51]. Overall statistics for structural features, such as exons, introns, and coding regions were highly consistent between the three assemblies and with published annotations. For instance, average translated protein lengths compared across MSU, Rice Annotation Project, and the three de novo assemblies ranged from 280 to 288 amino acids (median values: 268 to 291 amino acids), suggesting that contiguity of the de novo assemblies did not limit ability to identify protein-coding genes. For each assembly, 61 to 62% of annotated loci possessed one or more InterPro domains and 77% showed homology to plant NCBI RefSeq genes.

Whole genome comparison to Nipponbare reference genomes
We evaluated the agreement between our de novo assemblies to the Nipponbare reference sequences using the GAGE assembly evaluation algorithm [50]. As expected, the de novo Nipponbare assembly very closely matches the reference Nipponbare sequence, with a 99.94% average identity and only 0.31% of the assembly not aligning to the reference (Tables 2, 3 and 4). Even at this very high agreement, there are several tens of thousands of small variations, and several hundred larger variations. These variations are a combination of true variations from our sample relative to the reference genome, of which we expect there to be few, and errors from ALLPATHS-LG when used with these libraries and coverage levels. Consequently, considering that the assembly has a 99.94% overall similarity, the upper-bound on the error rate of sequencing and assembling with ALLPATHS-LG is at most 0.06%. The portions of the reference genome without any alignments from our Nipponbare assembly are scattered throughout the genome in 57,821 segments averaging 203 bp long. However, of this, only 301,525 bp are annotated to be within the coding sequence (CDS; 0.72% of the total CDS), and another 12,344 bp are annotated to be within non-coding exons. We further evaluated the unaligned regions by computing their read k-mer coverage from a sample of 400 million unassembled Nipponbare reads, and found the mean k-mer coverage of these regions exceeds 12,000×, while the mode k-mer coverage of the set is less than 100× ( Figure S1 in Additional file 1). A full two-thirds (38,373/57,821) of these regions exceed 1,000× k-mer coverage, more than 10 times higher than unique segments of the genome. This implies the unassembled/unaligned regions are highly enriched for high copy repeats too complex to be assembled. In contrast, the genic regions are very well represented, suggesting it would be possible for a detailed analysis of the 'gene space' of the accessions from these assemblies. Tables 2, 3 and 4 summarize the alignments of the three de novo assemblies relative to the reference IRGSP-1.0 Nipponbare assembly. As expected, the IR64 and DJ123 assemblies show noticeably lower overall identity, and have considerably more unaligned bases. The average k-mer coverage of the unaligned bases indicates most regions are unassembled high copy repeats, although there are 11.8 Mbp and 12.3 Mbp unaligned reference bases in IR64 and DJ123, respectively, that are not repetitive based on the k-mer analysis. This suggests there may be megabases of sequence specific to each of the three genomes.

Whole genome comparison to indica reference genomes
Using the same methods used for comparing to the reference Nipponbare genome, we also evaluated the three genomes relative to the reference indica genome (cv 93-11) [27] (Tables 5, 6 and 7). The agreement between the de novo IR64 assembly and the reference indica sequence is appreciably less than the Nipponbare-Nipponbare alignment; 4.31% of the IR64 assembly does not align to the 93-11 reference and the aligned regions have only 99.52% identity between these two indica varieties. Since the assemblies and alignments were computed with the same sample preparation and analysis algorithms, this suggests there are more true biological variations between IR64 and 93-11 (as would be expected from two different varieties), and/or that the 93-11 reference assembly is not as complete nor as accurate as the reference Nipponbare assembly. The later explanation is quite likely to be a contributing factor, given the fact that the 93-11 genome represents a whole genome shotgun assembly, while the Nipponbare genome utilized a combination of BACs and whole genome shotgun sequencing. For example, the 93-11 assembly has 14.1 million unresolved ('N') bases, while the Nipponbare reference has only 118,200. As seen with Nipponbare, most of the unassembled/unaligned bases between the 93-11 reference and our assemblies are repetitive with mean k-mer coverage over 14,000×. A quarter of the unaligned references bases (7.75 Mbp/31 Mbp) are non-repetitive from the k-mer analysis, while less than 900 kbp of the unaligned reference Nipponbare genome are not repetitive. This underscores the fact that there are substantially more true biological differences between IR64 and the reference 93-11 indica assembly than our Nipponbare sample and reference.
Finally, the comparison between the IR64 and DJ123 assemblies shows that they differ from each other nearly as much as either one differs from the Nipponbare reference sequence. These results suggest that the aus genome harbors a greater amount of novel variation than previously recognized. It also highlights the value of taking an unbiased, de novo assembly approach when evaluating  Mean k-mer coverage was evaluated by counting the k-mers in a sample of 400 M unassembled reads in each of the three genomes, and evaluating those counts along the reference sequence.
genomic variation among varieties and subpopulations to capture genome-specific variations.

Pan-genome analysis
We next evaluated the 'pan-genome' of the three de novo assemblies to identify sequences that were conserved across the genomes as well as sequences specific to just one genome (see Materials and methods). Using the whole genome alignment information, we classified each base of each genome as being specific to that genome (unaligned to either other genome), or shared by one or both genomes. The majority of the assembled sequences (approximately 302 Mbp per genome) and exonic sequences (approximately 55.5 Mbp per genome), were shared among the three genomes, although 4.8 Mbp to 8.2 Mbp (423 kbp to 930 kbp exonic) were found to be genome-specific (Figure 2A). Since a gene sequence may be partially shared or partially genome-specific, we assigned each gene to the sector on the Venn diagram for which the majority of the exonic bases were assigned over all transcripts associated with each gene. For example, if 90% of a gene is shared among all three genomes, but 10% is genome-specific, we would assign it to the center (fully shared) sector under the majority rule. This will not necessarily characterize changes in gene function if critical protein domains are shared/unshared, but highlights the major trends between the lineages and discovers 297 to 786 genome-specific loci.
Using the same k-mer analysis techniques we applied for the reference analysis, we further classified the genomespecific bases as being unique or repetitive, using a threshold of 100× average k-mer coverage to classify unique sequences. From this, we identified only 1.2 Mbp to 1.5 Mbp of non-repetitive sequence specific to each genome, meaning that most of the genome-specific bases were actually repetitive (Table 8). Since repetitive sequences are also the most likely to be unassembled, as observed in our comparison to the reference genomes, we further examined the genome-specific exonic bases and refined our initial estimates to 555 kbp to 760 kbp of nonrepetitive, genome-specific sequences intersecting annotated genes by at least 100 bp (Table 9). Note these segments may include flanking promoter and other regulatory regions in addition to the exons themselves. From this catalog, we selected 10 of the largest regions in each of the genomes for PCR validation, and were able to confirm the computational analysis with 100% success (Figure 3; Table S4a-c in Additional file 1).
For Nipponbare and IR64, we determined the positions of the non-repetitive segments along the different reference chromosomes, and found the segments were broadly distributed. For Nipponbare, we could localize 2,208 of the genome-specific regions, and found that one region occurred, on average, every 162 ± 362 kbp, following an approximately exponential distribution (data not shown). For IR64, we could localize 1,074 of the genome-specific regions, and found one region occurred, on average, every 338 ± 752 kbp, also from an approximately exponential distribution. The distributions suggest that the genome-specific bases are not highly localized, as an exponential distribution in spacing can occur if there is a uniform probability distribution of a site occurring at any position at random.
Genome-specific loci, as well as those shared between two genomes but not the third, exhibited shorter CDSs and greater novelty compared with genes shared among all three genomes ( Figure 2B). For example, loci common to all genomes had a median coding length of 888 bp compared with median values ranging from 483 to 654 bp for the genome-specific gene sets. Likewise, the core, fully shared set of genes averaged 4.9 exons per transcript compared with a range of 2.9 to 3.0 genes per transcript amongst genome-specific genes. A smaller fraction of genome-specific loci contained InterPro domains compared with the core set (40% versus 63%), and fewer showed homology to plant RefSeq proteins (57% Repeats were evaluated by analyzing their k-mer coverage using the same method as above. versus 79%). However, artifacts of inaccurate annotation may contribute to this trend [53], so we investigated if these differences were negatively influenced by assembly quality, especially if genome-specific genes tended to terminate in scaffold gaps more frequently than core genes. We observe a modest effect, and genes shared by all three strains have a median distance of 12 to 14 kbp (5′ and 3′ flanking distances), whereas genome-specific genes have a median distance of 8 to 10 kbp (Table S7 in Additional file 1). Only 7 to 12% of the genes have a flanking distance of less than 1 kbp, suggesting this is not a major factor in our analysis. Indeed, our results are similar to studies in yeast, Drosophila, and vertebrates that have found that novel and recently evolved genes tend to encode smaller proteins than conserved or ancient genes [53][54][55].
To characterize potential function of genome-specific genes we further examined genes with annotated Inter-Pro domains. Notably, genes with domains related to disease resistance were the most prevalent type among genome-specific genes. For example, 12% of genes specific to IR64 possessed the NB-ARC motif (IPR002182), the central nucleotide-binding domain of plant R-genes. This domain, and others associated with R-genes, also prevailed among the DJ123-specific and Nipponbarespecific gene sets, accounting for 9% and 5% of genes, respectively. In contrast, only 0.35% of genes shared among all three genomes encode the NB-ARC domain. Genes shared between just two genomes showed intermediate frequencies of disease resistance genes (1.5 to 2.5%). Similar distributions were seen for genes classified with the Gene Ontology (GO) term 'defense response' (GO:0006952). These results are consistent with Ding et al. [14], who showed high levels of 'genome asymmetry' among R genes when comparing the Nipponbare and 93-11 reference assemblies. A large diversity of other protein domain classes, such as those associated with receptor and non-receptor protein kinases, transcription factors, metabolic enzymes, proteases, and transporters, were also found in the genome-specific gene sets. A complete listing of putative strain-specific genes, their InterPro domains, GO terms, and summary of homology search results are provided in Additional file 2. We anticipate these findings will greatly enhance the ongoing 3,000 rice genomes project [36] and other resequencing projects that   had previously focused on single nucleotide variations relative to the Nipponbare reference.

Detailed regions
We chose four agronomically relevant regions of the rice genome that were previously reported to harbor differences among the three varieties or subpopulations to illustrate the utility of these high quality whole genome assemblies for understanding the variation in genome structure underlying salient phenotypic variants.

S5 hybrid sterility locus
S5 is a major locus for hybrid sterility in rice that affects embryo sac fertility. Genetic analysis of the S5 locus documented three alleles: an indica (S5-i), a japonica (S5-j), and a neutral allele (S5-n) [23,56]. Hybrids of genotype S5-i/S5-j are mostly sterile, whereas hybrids of genotypes consisting of S5-n with either S5-i or S5-j are mostly fertile. The S5 locus contains three tightly linked genes that work together in a 'killer-protector'-type system [57,58]. During female sporogenesis, ORF5+ (killer) and ORF4+ (partner) cause endoplasmic reticulum stress. ORF3+ prevents endoplasmic reticulum stress and allows the production of normal gametes, whereas the ORF3-allele cannot prevent it, resulting in embryo sac abortion. The ORF3-allele has a 13-bp deletion; the ORF4-allele carries an 11-bp deletion that causes a premature stop codon [58]. The ORF5 indica (ORF5+) and japonica (ORF5-) alleles differ by only two nucleotides, whereas the wide compatibility allele S5-n (ORF5n) has a large deletion in the amino terminus of the predicted protein, rendering it presumably non-functional [57]. The typical indica haplotype is ORF3+/ORF4-/ORF5+, while the typical japonica haplotype is ORF3-/ORF4+/ORF5-.
In each of the three de novo assemblies reported here the S5 locus containing the three genes lies within a single scaffold and haplotypes can be easily identified (Table S5 and Figure S2 in Additional file 1). The identity of the ORF5 alleles in Nipponbare, IR64 and DJ123 were also confirmed by Sanger sequencing from genomic DNA, and perfectly confirm the assembly results ( Figures S3 and S4 in Additional file 1). The Nipponbare assembly is in agreement with the Nipponbare IRGSP-1.0 reference sequence for the region and shows that it carries the typical japonica haplotype ORF3-/ORF4+/ ORF5-. The IR64 assembly shows that this accession carries the typical indica haplotype ORF3+/ORF4-/ ORF5+. In the case of DJ123, our de novo assembly revealed that this aus accession carries the 136-bp deletion characteristic of the neutral allele, ORF5n. However, the DJ123 ORF5n allele is novel, as it differs from the reported ORF5n allele by two SNPs and one 10-bp deletion within the coding region of the gene (also confirmed by Sanger sequencing). The DJ123 haplotype for the locus is ORF3-/ORF4-/ORF5n, a haplotype previously identified by Yang et al. [58] in four accessions from Bangladesh. Although the accessions bearing this haplotype were referred to as indica in this study, they almost certainly belonged to the aus subpopulation.

Sub1 locus
The Submergence 1 (Sub1) locus on chromosome 9 is a major QTL determining submergence tolerance in rice [33]. The Sub1 locus is a cluster of three genes encoding putative ethylene response factors. Sub1B and Sub1C are present in all rice accessions tested to date, while Sub1A may be present or absent. Originally identified in the aus accession FR13A, Sub1A appears to be found only within the Indica varietal group [33]. Sub1A has two alleles: Sub1A-1 is found in submergence-tolerant varieties, while Sub1A-2 is found in intolerant varieties. A haplotype survey in O. sativa varieties also identified nine Sub1B and seven Sub1C alleles [33].
In the IR64 and DJ123 de novo assemblies reported here the Sub1 locus lies within a single scaffold and haplotypes can be easily identified (Table S5 in Additional file 1). In the IR64 assembly the Sub1A gene is present as the Sub1A-2 allele, previously identified in submergenceintolerant accessions including IR64 [33]. For the Sub1B and C genes, IR64 carries the alleles Sub1B-1 and Sub1C-3, as reported [33]. Sub1A is absent from the DJ123 assembly, suggesting that this aus variety is not submergence tolerant. DJ123 carries a novel Sub1B allele (Sub1B-10), and the previously identified Sub1C-6 allele. In the Nipponbare assembly, Sub1B and Sub1C lie within a single scaffold, and the alleles identified are in agreement with published results [33]. Nipponbare is not submergence tolerant and the Sub1A gene is absent in Nipponbare Identified sequences must be at least 100 bp long with no alignments to the other genomes, average between 10× and 100× k-mer coverage in that genome, and average below 10× k-mer coverage using the reads from the other two genomes. Regions identified to be specific to a given accession using the criterion used in Table 8 and that intersect an annotated gene region by at least 100 bp.
according to previous reports. Our de novo assembly is unresolved in the region that corresponds to the Sub1A gene, but a k-mer analysis using the methods and data applied above clearly shows a lack of coverage in the DJ123 and Nipponbare sequencing reads across the locus except for high copy repeats dispersed in the sequence (Figure 4, top and bottom). Conversely, the k-mer coverage of the IR64 assembly is uniformly at the single-copy coverage level (approximately 100×), except for a small number of localized gaps in coverage, corresponding to SNPs in IR64 relative to the reference Sub1A sequence, and the high frequency repeats (Figure 4, middle). In contrast, the k-mer coverage across Sub1B and Sub1C is consistently approximately 100×, except for isolated sharp gaps corresponding to variations relative to the reference sequences ( Figures S5 and S6 in Additional file 1).

LRK gene cluster
Fine-mapping of a yield-improving QTL on rice chromosome 2 identified a cluster of leucine-rich repeat receptor kinase genes [59], consisting of seven or eight intronless gene copies contained within a 40 to 50 kb genomic region. The QTL, originally introgressed from a wild rice accession (Dongxiang), was shown to increase grain yield of the recurrent parent Guichao2 (indica) by about 25%. The LRK locus in Dongxiang carries an extra gene, LRK1, absent from Guichao2. A survey of haplotype divergence in 13 rice accessions showed that LRK1 is absent in only three indica accessions, suggesting that these haplotypes may have originated via gene loss. In each of the three de novo assemblies reported here the LRK locus lies within a single scaffold and haplotypes can be easily identified (Table S5 in Additional file 1). The Nipponbare assembly is in agreement with the reference sequence, with the exception of regions that the de novo assembly was not able to resolve because of high copy repeats ( Figure S7 in Additional file 1). LRK1 is absent in the IR64 assembly as evident in the k-mer plot ( Figure S8 in Additional file 1), indicating that IR64 carries the sevengene haplotype identified in other indica accessions [59]. According to our assembly and the corresponding k-mer analysis, the aus accession DJ123 carries LRK1. Based on sequence variation on the 5′ upstream region of LRK4 and LRK6, we can predict that the DJ123 haplotype for the LRK gene cluster is closest to the haplotypes identified in indica accessions in which LRK1 is present (haplotypes A, B and C in Figure 3 of [59]).

Pup1 region
Phosphorus uptake1 (Pup1) is a major rice QTL associated with tolerance to phosphorus deficiency in soils [60,61]. The Pup1 locus is a large, 90 kb region originally identified in Kasalath, an aus variety that is tolerant to phosphorus deficiency, but is absent in phosphorus starvation-intolerant varieties, including Nipponbare [62]. A gene encoding a protein kinase, Pstol1, located within the 90 kb indel, is responsible for the P-uptake efficiency phenotype [41]. Of the three de novo assemblies reported here, the 90 kb indel is absent from both Nipponbare and IR64, but a large portion of it, including the Pstol1 gene, is present in the aus variety DJ123 ( Figure 5; Table S5 in Additional file 1). Although it is at least partially present, the region of the 90 kb indel described in Kasalath could not be fully resolved in our DJ123 assembly. This suggests that the 90 kb indel may be truncated and/or rearranged in some aus varieties. Interestingly, as shown in Figure 5, the Kasalath reference sequence contains unresolved gaps flanking regions of very high k-mer coverage; therefore, longer reads may be necessary to assemble this region with confidence. The Pstol1 gene sequence is complete in DJ123, and shows six SNPs relative to the Kasalath sequence (also apparent as abrupt drops in coverage in the k-mer coverage plot; Figures S9 and S10 in Additional file 1). These SNPs were confirmed via Sanger sequencing on genomic DNA ( Figure S11 in Additional file 1). One of these SNPs introduces a premature stop codon, resulting in a protein that is only 136 amino acids long (the intact PSTOL1 protein is 324 amino acids) and, therefore, presumably non-functional.

Conclusions
In this study we wanted to overcome the limitation on sequencing and comparison to a reference genome by instead analyzing high quality de novo assemblies of multiple rice genomes to observe biologically significant changes between them. The rice accessions sequenced were selected to represent the indica (cv IR64), aus (DJ123) and temperate japonica (Nipponbare) subpopulations ( Figure 1). The inclusion of the high quality, BAC-by-BAC assembly of Nipponbare and the shotgun assembly of 93-11 provided a control that allowed us to assess the quality of the different datasets and de novo assembly strategies. It is apparent from comparing different assembly software that ALLPATHS-LG gave the best results in our hands (Table S3 in Additional file 1). It is also apparent that the use of k-mer frequencies is a robust technique for characterizing repetitive regions, and enabled us to correctly characterize and validate genome-specific regions.
The three-way comparison among the different genomes was informative in identifying major shared and structurally variable regions of the rice genome. We were particularly interested in regions that were structurally unique to either the indica and/or the aus genome because they would likely have been discarded in previous re-sequencing efforts due to difficulties aligning their sequencing reads to the Nipponbare reference genome. This would be particularly true for longer genomespecific sequences, which would be completely absent in the alignments to the reference. We anticipate the ongoing 3,000 rice genomes project [36] will benefit greatly from having our assemblies available, especially so that they can map variations within regions not present in the Nipponbare reference as they have currently done. We also anticipate future studies will systematically perform follow-up functional studies of genome-specific gene loci as being likely candidates for phenotypic differences observed between the genomes.   Figure S9 in Additional file 1 for a detailed view of the Pstol1 coverage, and Figure S10 in Additional file 1 for a plot of the entire Kasalath sequence. Unresolved gaps in the reference sequence are indicated with black vertical bars. Only DJ123 has consistent coverage across this region, especially upstream of the gene, while the other two genomes show complete gaps in coverage.
Our analysis clearly demonstrates that the indica and the aus genomes are more distantly related than previously known. Because the aus subpopulation is phenotypically so similar to indica, the degree of genetic differentiation has been underappreciated by breeders and geneticists alike [43,63,64]. The unusual characteristics of the aus subpopulation, combined with evidence of unique aus alleles at loci such as Rc, conferring white versus colored pericarp [19], the Snorkel locus conferring deep water ability [45], the Pstol1 locus conferring phosphorus-update efficiency [41], or the Sub1 locus conferring submergence tolerance [33], all support the hypothesis that aus may have a unique domestication history compared to japonica and indica. These findings underscore the importance of recognizing genetic subpopulation structure to guide plant breeders in identifying novel sources of variation for traits of interest. In recent years, many key biotic and abiotic stress tolerance genes have been discovered in aus varieties [33,41,[44][45][46]. It is interesting to note that in several cases, the donor aus germplasm is referred to as indica, underscoring how indica and aus are often confused, as noted for the DJ123 haplotype of the S5 hybrid sterility locus (see above).
The overall annotation of our Nipponbare assembly is quite close to that of the reference Nipponbare genome. This illustrates that the approach we describe here provides a genome sequence of considerable vitality for further research. However, our contig N50 sizes (as opposed to scaffold N50) are still fragmented by the presence of repeats too long and too complex to be fully resolved in the short read assemblies. This somewhat limits application when studying large structural rearrangements, as exemplified by the Pup1 region that remains partially unassembled in the aus variety DJ123 (Figure 5), and the modest differences we observed in annotation quality between core genes and genome-specific genes. We anticipate that some combination of short-read NGS sequencing and newly emerging long read sequences, such as Pacific Biosciences Single Molecule Real Time Sequence [65], which can now produce reads approaching 100 kbp long, will soon overcome this limitation and provide assemblies approaching, or perhaps even surpassing, those provided by the vastly more expensive and time consuming BAC-by-BAC approach. Once this occurs it should spark an outburst of genomics studies of agronomically important plant genomes, greatly enriching our potential to understand their many unique qualities and characteristics and paving the way for enhanced utilization of natural variation in plant improvement.

Plant material
Three rice (Oryza sativa) accessions (Nipponbare, IR64, DJ123) were used in the study. Accession information (that is, Genetic Stocks Oryza (GSOR) identifier, accession name, country of origin, subpopulation) is summarized in Table 10 [63]. The plants were grown in the Guterman greenhouse facility at Cornell University, leaf tissue was harvested from one-month-old seedlings, ground in a mortar and pestle, and DNA was extracted using the Qiagen Plant DNeasy kit (Qiagen, Valencia, CA, USA).

DNA sequencing
The DNA sequencing was performed in the Cold Spring Harbor Laboratory Genome Center using Illumina HiSeq 2000 instruments. For each of the three varieties, three libraries were sequenced following the requirements and recommendations of the ALLPATHS-LG whole genome assembler: (1) a 180 bp fragment library sequenced as 2 × 100 bp reads; (2) an approximately 2 kbp jumping library sequenced as 2 × 50 bp reads; and (3) an approximately 5 kbp jumping library sequenced as 2 × 50 bp reads.
For the 180-bp overlap library the sample was mechanically fragmented by using the Covaris S2 System and then prepared based on the New England Biolabs NEB-Next Illumina library protocol and ligated to standard Illumina paired-end adapters. To maximize sample throughput the samples were size-selected in 50-bp windows between 290 and 310 bp using the Caliper XT instrument. Each library was PCR enriched for 12 cycles and quantified using the Bioanalyzer.
For the jumping libraries, the Illumina mate-pair library protocol was used. The DNA was fragmented into 2 kb and 5 kb segments. We again used the Covaris S2 System using programs that we developed in the lab. The fragmented DNA was then end-repaired with biotin-labeled dNTPs. The labeled fragments are circularized and fragmented again into 400 bp pieces. Fragments with the biotin labels are enriched, end-repaired, and ligated with adapters used for downstream processes. Each library was PCR enriched for 18 cycles and size-selected for 350 to 650 bp fragments. The final library consists of fragments made up of two DNA segments that were originally separated by approximately 2 kbp or approximately 5 kb. Each of the libraries was sequenced to 30× to 80× sequence coverage, as recommended by the assembler. Libraries were sequenced on one or more lanes of an Illumina HiSeq 2000 using paired-end 50-or 100-bp runs. Image processing and base calling were performed as the runs progressed with Illumina's Real Time Analysis (RTA) software. The binary base call files were streamed to a shared Linux server for further processing. The Illumina Casava pipeline (v1.8) was used to process the binary files to fastq files containing the base-called reads and per base quality scores. Only reads passing the standard Illumina quality filter were included in the output files.

Genome assembly
The ALLPATHS-LG version R41348 assembly algorithm was used for the assemblies. It consists of five major phases: (1) pre-assembly error correction, (2) merging of the overlapping fragment reads into extended reads, (3) constructing the unipath graph from the k-mers present in the reads, (4) scaffolding the unipaths with the jumping libraries, and (5) gap closing. To complete the five phases, the algorithm requires an overlapping pair fragment library and at least one jumping library, although the authors recommend at least two jumping libraries of approximately 2 kbp and approximately 5 kbp or larger. We assembled each of the genomes using approximately 50× coverage of the fragment library and approximately 30× coverage of each of the two jumping libraries using the recommended parameters, except we lowered the MIN_CONTIG size to 300 bp from the default 1,000 bp. This parameter controls the minimum contig size to be used for scaffolding, and our previous testing determined this change leads to (modestly) improved contig and scaffold statistics.
We also evaluated using SOAPdenovo2 [66] and SGA [67] for the assemblies (Table S3 in Additional file 1), using the same fragment, 2 kbp, and 5 kbp libraries but both assemblers had substantially worse contiguity statistics under a variety of parameter settings. For SOAP-denovo2, we corrected the reads using the Quake error correction algorithm [68], and then ran seven assemblies with the de Bruijn graph k-mer size set to k = 31 through k = 45 (odd values only, as required). In every attempt the scaffold N50 size was below 10 kbp compared with >200 kbp for our best ALLPATHS-LG assembly. For SGA, we evaluated four assemblies with the string graph minimum overlap length of k = 71 through k = 77 (odd values only, as required), but the scaffold N50 size was below 15 kbp in every attempt. We hypothesize that ALLPATHS-LG achieved superior results because the algorithm automatically measures many of the properties of the sequencing data, and could therefore self-adjust the various cutoffs used by the algorithm for error correction, contigging, and scaffolding.

Genome annotation
Repeat elements were masked using RepeatMasker [69] with a rice repeat library available from the Arizona Genome Institute. Protein-coding genes were annotated using MAKER-P version 2.30, installed on the Texas Advanced Computer Center Lonestar cluster and provisioned through an iPlant Collaborative allocation [52,[70][71][72]. Sequence evidence used as input for MAKER-P included Oryza expressed sequences (EST, cDNA, and mRNA) downloaded from the National Center for Biotechnology Information (NCBI), and annotated coding and protein sequences available for Nipponbare (IRGSP1.0 and MSU release 7) [51], 93-11 [28], and PA64s (Table S1 in Additional file 1). Ab initio gene predictions made using FGENESH [73] were incorporated exogenously into the MAKER-P pipeline using the pred_gff parameter. The SNAP [74] ab initio predictor was run within MAKER-P using the O.sativa.hmm parameter provided with SNAP. To annotate protein domain structure and assign GO terms we used InterProScan 5 software [75], available within the iPlant Discovery Environment [76]. Among resulting InterPro domains we curated 21 as being associated with transposon-encoded genes and screened out MAKER-P annotations with these domains (IPR000477,  IPR001207, IPR001584, IPR002559, IPR004242, IPR004252,  IPR004264, IPR004330, IPR004332, IPR005063, IPR005162,  IPR006912, IPR007321, IPR013103, IPR013242, IPR014736,  IPR015401, IPR018289, IPR026103, IPR026960, IPR027806). To identify homologies we conducted BLASTP alignment to the plants subsection of NCBI RefSeq (release 63), using an e-value threshold of 1e-10.

Whole genome comparisons
We used the MUMmer [77] whole genome alignment package and the GAGE assembly comparison scripts to compare the de novo assemblies to the reference Nipponbare and Indica genomes. Briefly, we aligned the assemblies to the genomes using nucmer using sensitive alignment settings (-c 65 -l 30 -banded -D 5). For base level accuracy evaluations, we used the GAGE assembly comparison script, which further refines the alignments by computing the best set of one-to-one alignments between the two genomes using the dynamic programming algorithm delta-filter. This algorithm weighs the length of the alignments and their percentage identity to select one-to-one non-redundant alignments. This effectively discards spurious repetitive alignments from consideration, allowing us to focus on the meaningful differences between the genomes. Finally, the evaluation algorithm uses dnadiff to scan the remaining, non-repetitive alignments to summarize the agreement between the sequences, including characterizing the nature of any non-aligning bases as substitutions, small indels, or other larger structural variations. To characterize the unaligned regions of the reference genome, we converted the whole genome alignments into BED format. For this we did not exclude repetitive alignments, so that we could focus on novel sequence instead of copy number differences. We used BEDTools [78] to intersect the unaligned segments with the reference annotation, and summarized the size distributions of the unaligned segments using AMOS [79].

K-mer analysis
To evaluate the repeat composition, we selected a random sample of 400 million unassembled reads from each of the three genomes and used Jellyfish [80] to count the number of occurrences of all length 21 k-mers in each read set. Length 21 was selected to be sufficiently long so that the expected number of occurrences of a random k-mer was below 1, but short enough to be robust to sequencing errors. The modes of the 3 k-mer frequency distributions, excluding erroneous k-mers that occurred less than 10 times, were 60× (Nipponbare), 64× (DJ123), and 73× (IR64) drawn from an approximately negative binomial distribution ( Figure S1 in Additional file 1). These values correspond to the average k-mer coverage for single copy, non-repetitive regions of the genome. See Kelly et al. [68] for a discussion of k-mer frequencies. We then used the AMOS program kmer-covplot [79] to report the kmer coverage along the two reference genomes using the three databases of read k-mer frequencies. Unlike read alignments, which may be sensitive to repeats and variations, evaluating k-mer coverage is very robust to determine repetitive content [81,82]. Single nucleotide variants are also readily apparent in these plots as abrupt gaps in coverage kilobase pairs long, while indels will be present as longer gaps in coverage [83].

Pan-genome analysis
The pan-genome analysis followed the reference-based analysis above, using nucmer to align the genomes to each other, BEDTools to find the genome-specific and shared regions of the genomes, and the jellyfish/AMOS k-mer analysis as described above to classify unique and repetitive sequences. We also used BEDTools to intersect the genome-specific/shared regions against their respective annotations to determine how the exonic bases were shared across the genomes. We summarized the genomespecific/shared exonic bases into gene counts by counting the total number of shared or specific exonic bases across all possible transcripts for a gene, and assigned the gene to the sector of the Venn diagram with the most bases associated with it. For the purposes of the Venn diagram (Figure 2A), wherever possible, the Nipponbare base or gene counts were used, followed by the values from IR64, and then followed by the DJ123 specific values, although the values were all largely consistent.

PCR and sequencing validation of specific regions
The same algorithms and parameters as the pan-genome analysis were also used to characterize the specific regions identified in the paper. PCR and/or sequencing validation were performed on genomic DNA extracted from tissue collected from independently grown plants obtained from the same seed source used for Illumina sequencing. Genomic DNA was extracted from young leaf tissue using the Qiagen Plant DNeasy Mini kit. Primers used for validation of 10 of the longest genomespecific sequences from each rice line, and of the S5 and Pup1 loci, are listed in Table S4a-c in Additional file 1. Sanger sequencing was performed at the Biotechnology Resource Center at Cornell University.

Data access
The read data, assemblies, annotations, and pan-genome alignments are posted on the CSHL website at [84]. The NCBI Sequence Read Archive (SRA) accession numbers for the short read data used in this study are listed in Table 11. Analysis software packages are available open source from the websites for ALLPATHS-LG [85], MUMmer [86], AMOS [87], Jellyfish [88], and BEDTools [89].

Additional files
Additional file 1: Contains the supplementary tables and figures described in the manuscript.
Additional file 2: Is an excel spreadsheet listing the putative strain-specific genes. Competing interests WRM has participated in Illumina sponsored meetings over the past four years and received travel reimbursement and an honorarium for presenting at these events. Illumina had no role in decisions relating to the study/work to be published, data collection and analysis of data and the decision to publish. WRM has participated in Pacific Biosciences sponsored meetings over the past three years and received travel reimbursement for presenting at these events. WRM is a founder and share holder of Orion Genomics, which focuses on plant genomics and cancer genetics.