Construction, alignment and analysis of twelve framework physical maps that represent the ten genome types of the genus Oryza

Bacterial artificial chromosome (BAC) fingerprint and end-sequenced physical maps representing the ten genome types of Oryza are presented


Background
Comparative genomics is a powerful tool for unraveling the evolutionary history and gene functionality of related species. The availability of high resolution genetic and physical maps and genome assemblies has established comparative genomics platforms in organisms ranging from bacteria and fungi to animals and plants [1][2][3][4][5][6][7]. The majority of sequence-based eukaryotic platforms have focused on comparisons between genera [8][9][10][11]. Although highly informative, the divergence times between genera are often too distant to be useful for the identification of conserved noncoding sequences such as transcription factor binding sites, enhancers and matrix attachment regions (for example, see [12,13]). Sequence com-parisons between species within a single genus have focused primarily on orthologous loci or genomic regions [14,15]. Whole genome comparative platforms within a genus are still in their infancy, with the most notable exception being the generation of whole genome shotgun assemblies of 12 Drosophila species that span an evolutionary time frame of approximately 60 million years [16].
In plants, rice (Oryza sativa) and thale cress (Arabidopsis thaliana) are important model systems for functional and evolutionary biology. Both genomes have been completely sequenced [17,18] and serve as the reference sequences for the major monocot and dicot plant lineages. Rice has added significance by virtue of being the world's most important crop, directly feeding half the human population. The population that depends on rice is expected to double in 25-50 years and there is thus an intense effort by breeders to double current rice yields using less land and water and on poorer soils [19]. To achieve this goal, the plant biology community is engaged in a concerted effort to functionally characterize all plant genes in both model plants using approaches including genetics, transgenetics, and comparative genomics. These efforts have significance not only to world food and bioenergy security issues but also to fundamental eukaryotic biology.
Comparative genomics across the cereals and within the genus Oryza will play a major role in these efforts. Oryza is composed of 24 species, including 2 domesticated (O. sativa, and O. glaberrima) and 22 wild species [20,21]. These species are classified into ten genome types (six diploids and four allotetraploids) based on crossing barriers [22,23], chromosome pairing [24,25], morphology [26,27], and molecular phylogenetics [28]. Compared to the domesticates, the wild relatives of rice are phenotypically inferior grass-like plants, but are a virtually untapped reservoir of agriculturally important genes and allelic variants that can be used to improve cultivated rice [21,29]. Therefore, comparative genomics of wild and domesticated Oryza species may not only provide useful materials for breeding, but will also shed light on the evolution and domestication of cultivated rice.
Our long-term objective is to develop a genome-level comparative experimental system for the genus Oryza (the Oryza Map Alignment Project (OMAP)). Our strategy is to develop bacterial artificial chromosome (BAC)-end sequence (BES) physical frameworks of the ten genome types of Oryza and align them to the reference sequence. These frameworks can then be populated with an array of phenotypic, genetic, biochemical, and physiological data to address fundamental questions in biology and agriculture. Previously, we described the construction and characterization of 12 BAC libraries from representative species that encompass the 10 genome types of Oryza [30]. Here we report the construction of 12 BAC/BES framework physical maps derived from these libraries and an analysis of the BESs in terms of transposable element, simple sequence repeat (SSR), microRNA (miRNA), and single nucleotide variation (SNV) content.

Results
Fingerprinting, BAC end sequencing and contig assembly to generate phase I physical maps for the ten genome types of Oryza BAC-based physical maps for the ten genome types of Oryza were generated using a two-step procedure. First, clones from 12 Oryza BAC libraries [30] were fingerprinted [31] and BAC end sequenced. Fingerprints from each library were then assembled into contigs using FPC [32]. Fingerprint, BES and FPC assembly statistics are summarized in Tables 1 and 2.  Briefly, we generated a total of 710,536 fingerprints [BBCC], which roughly correlated with the size of each genome. The 12 unedited FPC assemblies are defined as phase I physical maps ( Table 2) and are available interactively using WebFPC [33] or CMap, or as downloadable FPC files (see Materials and methods).
To determine the approximate genome coverage of each of the 12 phase I physical maps, we estimated the size of the ordered consensus bands for each map in kilobases based on the average insert size of each BAC library and the average number of fingerprint bands observed for each species. Using this estimation, the genome coverage of the phase I physical maps ranged from 136% in O. granulata [GG]  Coverage discrepancies above or below 100% could be due to a number of parameters, including: possible overlaps between the contigs that were not merged by the criteria used in this study; inaccurate estimation of genome size; and over-estimation of average insert sizes of the BAC libraries that would lead to an over-estimation of kilobases per consensus band (CB).

Alignment of phase I physical maps to the O. sativa reference sequence
To evaluate the coverage of the phase I physical maps graphically, we aligned the BESs associated with each FPC contig to the O. sativa reference sequence (RefSeq) [18] using SyMAP [34]. SyMAP displays for all 12 physical maps can be viewed at [35] (display details are in Additional data file 1) and are summarized in Table 3 and Additional data file 2. Overall, between 61% and 95% of the contigs from each of the phase I physical maps could be aligned to the O. sativa RefSeq. Contigs not assigned a chromosomal position were generally small (133-289 CB/contig) and had few clones per contig (3-38 clones/contig) (  As expected, the most co-linear map alignments were observed with species that were evolutionarily less divergent (for example, AA, BB, CC). Regions of disrupted colinearity were detected mostly in the centromeric regions of the O. sativa genome. Although SyMAP alignments were also performed for the four polyploid species, we cannot conclude anything conclusive about their genome organization until their physical maps can be manually subdivided into each of their subgenomes.

Sequence analysis of the OMAP BES data set: repeats, SSRs, miRNAs, and SNV
The large BES data set produced by this project offered a unique opportunity to explore sequence content and variation across the Oryza. We next investigated repeat and SSR content across all twelve species and identified miRNA content and SNV amongst the three AA and one BB genomes.

Repeats
RepeatMasker [36] in conjunction with a curated rice (O. sativa) repeat database was used to identify and classify previously identified repetitive elements. RECON [37] was used to identify repeats de novo. Results reported here relate to the number of bases covered by each element class, rather than the total number of members of each element class with the exception of small transposable elements (that is, miniature inverted repeat transposable elements (MITEs) and short interspersed elements (SINEs)), as BES reads are too short to span complete elements. These sequences were compiled into species-specific repeat databases.
The amount of repeats in each genome was directly proportional to the genome size except for O. granulata [GG] (Additional data file 3), suggesting that repetitive sequences play a major role in genome size expansion in the genus Oryza. O. australiensis [EE] had a high amount of repetitive DNA, consistent with the observation of a rapid burst of retrotransposons in this species [38].
We attempted to sub-classify the repetitive elements identified by their class. Only the results from the RepeatMasker search were used for classification ( Figure 2) as the de novo repeats identified by RECON were mostly truncated. Retrotransposons (both long terminal repeats (LTRs) and non-LTRs) comprised the major fraction of the total repeats in all   (Table 5). Among the 16,980 non-redundant SSRs, di-and tri-nucleotide SSRs were the most abundant, representing 48.6% and 26.7% of the total dataset, respectively. The relative frequency of di-nucleotide SSRs in all the Oryza species was 6-7% higher than in the O. sativa reference genomes (Table 5; Additional data file 5). In contrast, the relative frequency of tri-nucleotide SSRs was 4-5% lower than in the O. sativa reference genomes and was inversely related to that of the di-nucleotide repeats in each species (Additional data file 5).
The top ten predominant motifs from the five most frequent SSRs in each species comprised between 68% (O. sativa ssp. japonica) and 80% (O. granulata) of all the SSRs identified ( Figure 3). The TA, GA and CGG motifs were the most abundant and accounted for a total of 51% (30%, 15%, and 6%, respectively) of all the Oryza SSRs. Interestingly, two SSR motifs showed biased compositions in a species-specific manner. The TAA and CAA motifs were significantly higher in O. ridleyi (9.6% for TAA, 13% for CAA) and O. granulata (9% for CAA) compared to the average (4.2% and 3%, respectively) for all Oryza species (Figure 3). Further investigation using the rice repeats revealed that 137 of 235 BESs (58%) contain-    In total, 64 paralogous miRNA precursors of O. sativa belonging to 22 miRNA families were identified among the four Oryza species (Table 7). O. rufipogon showed the highest number of conserved O. sativa miRNAs while O. punctata showed the lowest (22 and 7, respectively). This is consistent with our expectations since, as the evolutionary distance among species increases, the divergence within the miRNA precursor structure also increases [44]. The miRNA families of miR166 and miR171, known to play a key role in apical meristem development, were detected in all four species; the highest number of copies of each was 9 and 7, respectively. Both of these miRNAs are known to be widely shared among plant lineages. In addition, four unknown locus miR-NAs (miR420, miRNA441, miR442, and miR446) appeared to be AA genome specific.

Variation discovery in four wild rice species and O. sativa
Candidate SNVs between wild rice BESs and the O. sativa genome were detected with a multi-tiered alignment strategy [45] using BLASTZ [46] and cross_match followed by a baseby-base analysis of the resulting high quality alignment (see Materials and methods). Putative nucleotide variations were further filtered using sequence quality scores and quality scores of neighboring regions to exclude variations that were due to possible sequencing error.
Our initial data set contained 330,993 high quality BESs (73, [48]. Therefore, the rate of variation between Oryza species is roughly an order of magnitude greater than the polymorphism rate within species, as we would expect. The variation data are available through dbSNP and Gramene [49]. An example view of the variation data in Gramene is shown in Additional data file 9.

Discussion
We describe here the generation and analysis of a comparative genomics resource for the genus Oryza that can be used as a research platform to address grand challenge questions in basic and applied biology. The primary resources developed were 1,452,912 BESs, 710,536 fingerprints, 12 phase I physical maps aligned to the rice RefSeq, and repeat, SSR, miRNA and SNV analyses. All sequences and variations have been deposited in GenBank and dbSNP, and all physical maps can be visualized using SyMAP [34] and CMAP at the OMAP [50] and Gramene [49] websites, respectively. The BAC libraries, BESs, SNaPshot fingerprints, and phase I physical maps have been publicly available since January 2006 and have been extensively accessed by the international community [31,38,[51][52][53][54][55][56].
Estimates of genome coverage of the physical maps relative to flow cytometry data were quite close except for O. granulate.
To make the Oryza physical maps more useful for future research, they must now be manually edited to merge contigs and resolve any conflicts and mis-assemblies, as was done for O. punctata [BB] [31]. Manual editing will be particularly challenging for the four allotetraploid species, where contigs need to be assigned to one of two subgenomes for each species.  [38] and Gran3 in the GG genome [51] account for 50% and 25% of the genome sizes of these species, respec- Total no. of loci 20 22 15 7 64 *Named miRNA loci ID [69] are shown by species (for example, a putative miR156d locus ID is noted using a 'd' in the O. rufipogon column). For miRNA without a locus ID (for example, miR408), an 'x' is shown. ND, not determined; TF, transcription factor.
tively. The RECON data described here for O. alta [CCDD] may likely represent yet a third retrotransposon burst accounting for genome size variation, subsequent to speciation [39].
Further analysis of the Oryza BES data identified 16,980 and 1,619,446 non-redundant SSRs and SNVs, respectively. The SSRs and SNVs can be immediately translated into polymorphic markers for breeding crosses between the two sequenced O. sativa genomes with the AA genome OMAP accessions to accelerate crop improvement. Further, the SSRs will serve as candidates to identify polymorphic genetic makers for intraspecific Oryza mapping populations currently under development (for example, BB × BB; CC × CC, and so on). The availability of molecular genetic maps tied to our physical maps for all 12 accessions will make these resources even stronger for functional and map-based cloning applications.
Paired-end data are also extremely important for studying structural variation, such as expansions, contractions, inversions and translocations [57,58]. Using our paired BES database, we plan to catalogue all such variation across the Oryza in order to address fundamental questions on the roles of such variation on the speciation process.
To our knowledge, the combined resources described herein comprise the most comprehensive within-genus comparative genomics framework available for any higher eukaryote, exceeded only by Drosophila, for which 12 draft genome sequences of wild flies are available [16]. We envisage that the OMAP system will be a model for other within-genus comparative frameworks, including large genome plant genera like Gossypium (cotton), Solanum (tomato), Zea (corn) and Phaseolus (soybean). Having physical map frameworks for all genome types of a particular genus aligned to a reference genome provides immediate access to any orthologous or unique region of a genome(s) for functional and evolutionary analysis. Such frameworks are also highly compatible with next generation sequencing technologies whereby whole genomes can be sequenced in 4-8 Mb chunks, thereby reducing assemblies to the size of a bacterial genome [59].

Conclusion
The development and analysis of genus-wide research platforms is the next frontier in comparative systems biology. The genus Oryza, which includes the world's most important cereal, rice, is composed of 24 species (2 domesticated and 22 wild) with wide geographical distribution and contains 10 distinct genome types (6 diploid (2N = 24) and 4 polyploid (2N = 48)) and a 3.6-fold genome size variation (357-1,283 Mb).
We generated BAC-based physical maps of 12  sativa with more conservation in evolutionarily closer species. In addition, we observed species-wide miRNA families (miR166 and miR171) and AA genome specific miRNAs (miR420, miRNA441, miR442, and miR446).
All biological reagents are available at the AGI BAC/EST Resource Center [60] and all fingerprints and BESs are available at [49,50] and GenBank. We envision that the resources and analysis presented here will serve as a model for the establishment of similar genus-wide frameworks for plants and animals. The genus-wide physical framework also fits in well with next generation sequencing technologies where an entire genome or targeted regions across a genus can be reduced to bacterial genome size chunks that can be easily sequenced and assembled as opposed to a whole genome sequencing approach.

BAC libraries, end sequencing, fingerprinting and FPC assembly
All BAC libraries and methods have been described previously [30,31].

FPC map alignment to the O. sativa RefSeq
SyMAP [34] was used for alignment and display of FPC contigs from each of the 12 phase I physical maps to the O. sativa RefSeq (IRGSP V.4). Briefly, the 12 BES data sets were repeat-masked [36] and then searched for sequence similarity to the rice genome using BLAT [61]. To capture weaker sequence similarities from the more divergent species, BLAT parameters were adjusted as follows (-minIdentity = 70 -tile-Size = 10 -minScore = 30 -qMask = lower -maxIntron = 10,000). BLAT results were filtered, to reduce false-positives, by retaining the top two hits for each query, as measured by match length, and subject to the additional criteria that no retained hit had a match length within 25% of a discarded hit. Approximate linear chains were computed from the retained hits using dynamic programming and then merged to form synteny blocks that were displayed by SyMAP. After the Oryza BESs were positioned on the O. sativa RefSeq, contigs containing the aligned BESs were anchored to the chromosomes and then renumbered in FPC. Contigs that aligned to less than 200 bp of O. sativa RefSeq were ignored in the alignment analysis (Table 3). All 12 SyMAP alignment results are available at [35].

Synteny analysis pipeline and incorporation of data in the CMap, Genome Browser and SyntenyView displays in Gramene
We constructed a pipeline to align clones and contigs from the OMAP species to O. sativa by combining data from the BES alignments to O. sativa and the phase I physical maps. The pipeline consisted of 6 steps: 1) upload phase I physical maps and BES data, 2) align BES to the RefSeq, 3) determine the best alignments for each clone, 4) assemble the clone positions to determine the region where the contig is found to align, 5) create blocks of synteny between selected OMAP species and O. sativa, 6) utilize the data to create visualizations in CMap, the Genome Browser, and SyntenyView in Gramene. Details on each step can be found at [62] Screen shots of the data in Gramene are shown in Additional data file 9.

Repeat analysis
RepeatMasker (V3.1.5), loaded with a custom Oryza repeat database, was used to identify repeats from the Oryza BES data set and the O. sativa RefSeq. The custom database was composed of annotated repeats from: TIGR rice repeat database [63]; Dr Susan Wessler (University of Georgia, USA); and Dr Tom Bureau (McGill University, Canada).
RECON [37] was used to identify de novo repeats from the Oryza BES data set. To increase the speed and efficiency of the program, the BLAST output was parsed to discard self hits as well as hits with an e-value greater than 1e-5. The RECON output, which identified repetitive elements and classified them into distinct families, was parsed for sequences greater than 40 bp in length that were found at least 5 times/family. Overlap between the de novo and the custom library was determined using RepeatMasker. Sequences left unmasked by this process and, thus, were not a part of our custom repeat database, were extracted, assembled using PHRAP [64] and annotated using BLASTN at an e-value = 1e-4 against the NCBI non-redundant nucleotide database [65] and a dataset of 2,050 full length LTR-retrotransposons identified from the whole genome sequence of Nipponbare through LTR_STRUC [66]. Finally, these sequences were compiled into repeat databases specific to each species. The procedure is outlined in Additional data file 10.  [71], against BES of four closely related wild rice species (AA and BB genomes). Alignments were filtered and multiple hits to the same region of a BES were discarded, keeping the hit with the lowest e-value. The best orthologous match typically had more conservation in the region surrounding the precursor and, therefore, had a much lower e-value. If two miRNAs aligned to different regions of the same BES, they were not discarded, since they were likely to be tandemly duplicated miRNA genes. To obtain the exact coordinates and secondary structure of the miRNA precursor within the remaining BES, a pattern matching approach was used to align the mature miRNA precursor to the BES. Once the mature miRNA was mapped to the BES, the extended sequence was then submitted to MFold to determine if it could form a stable hairpin structure [72]. MFold alignments were checked for the following criteria: had conserved orientation; contained fewer than eight mismatches; was not overlapping the loop region; and lacked some types of asymmetric bulges. A precursor was discarded if the bulge had four or more nucleotides that failed to match with any nucleotide from the opposite strand. A precursor was retained if a bulge of four or fewer nucleotides on one stem corresponded, although not necessarily matching, with one or more nucleotides on the opposite strand. If the sequence that aligned to the mature miRNA met these criteria, it was categorized as the opposing stem (miRNA*). Both stems and the intervening loop region were extracted and resubmitted as input to MFold. If this refined sequence formed a stable hairpin structure under the four conditions, it was categorized as a paralogous miRNA precursor.  [46] using the S1-S2 scoring method to select only the best alignment for each read. Each chromosome was then divided into 5 Mb segments and BESs aligned to each region from the previous step were re-aligned to the segment using cross_match [64] (-bandwidth 100 -alignments -discrep_lists) with variation filtering using the neighborhood quality standard (NQS) [73]. Settings for NQS were such that candidate variation in the read had a Phred quality value (Q) of at least 23, its neighboring 7 bases on either side of the candidate variation all had Phred quality values of ≥ 15 and at least 11 of the 14 neighbors matched. If a read aligned to more than one place in the genome, then only the longest alignment with the fewest SNVs was reported. To decrease the number of false positive insertions or deletions, variations were filtered out if they had a single base variation and the quality score was less than 40 or if they were multiple base variations and the quality score was less than 23. Since two assemblies of the O. sativa RefSeq were available (IRGSP V.4 and TIGR V.4), variations were detected using both assemblies (Additional data file 8a-d). All data have been deposited into dbSNP and can be viewed at Gramene [49].

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure describing the SyMAP display details. Additional data file 2 is a