A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure

Background Recent phylogenetic analyses have identified Amborella trichopoda, an understory tree species endemic to the forests of New Caledonia, as sister to a clade including all other known flowering plant species. The Amborella genome is a unique reference for understanding the evolution of angiosperm genomes because it can serve as an outgroup to root comparative analyses. A physical map, BAC end sequences and sample shotgun sequences provide a first view of the 870 Mbp Amborella genome. Results Analysis of Amborella BAC ends sequenced from each contig suggests that the density of long terminal repeat retrotransposons is negatively correlated with that of protein coding genes. Syntenic, presumably ancestral, gene blocks were identified in comparisons of the Amborella BAC contigs and the sequenced Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa genomes. Parsimony mapping of the loss of synteny corroborates previous analyses suggesting that the rate of structural change has been more rapid on lineages leading to Arabidopsis and Oryza compared with lineages leading to Populus and Vitis. The gamma paleohexiploidy event identified in the Arabidopsis, Populus and Vitis genomes is shown to have occurred after the divergence of all other known angiosperms from the lineage leading to Amborella. Conclusions When placed in the context of a physical map, BAC end sequences representing just 5.4% of the Amborella genome have facilitated reconstruction of gene blocks that existed in the last common ancestor of all flowering plants. The Amborella genome is an invaluable reference for inferences concerning the ancestral angiosperm and subsequent genome evolution.


Background
The origin and rapid diversification of the angiosperms (flowering plants) were pivotal events in the evolutionary history of Earth's biota. Over the past 130 to 150 million years angiosperms have diversified to include approximately 350,000 species occupying nearly all habitable terrestrial and many aquatic environments. Angiosperms generate the vast majority of human food either directly or indirectly as animal feed, and they account for a huge proportion of land-based photosynthesis and carbon sequestration. Comparative analyses of genome sequences and gene function for a growing number of species are shedding light on how gene and genome duplications have contributed to the diversification within major flowering plant lineages (for example, Rosidae, Asteridae, Monocotyledoneae [1]), but elucidation of the genetic and genomic processes underlying the key innovations associated with the origin of flowering plants (for example, typically bisexual flowers, endosperm formation, double fertilization, ovules with two integuments, seed development within the carpel) requires comparisons between lineages that diverged from the last common ancestor of all extant angiosperms [2,3].
Recent phylogenetic analyses have identified Amborella trichopoda, an understory tree or shrub species endemic to the forests of New Caledonia, as the sister species to all other extant angiosperms [4][5][6][7][8]. Amborella is no more 'ancient' or 'primitive' than any other extant flowering plant species, but comparisons between Amborella and other angiosperms are allowing researchers to triangulate on characteristics of their last common ancestor. Using a similar approach, researchers have used the complete genome sequence of platypus, Ornithorhynchus anatinus, representing the sister group of all other extant mammals, to elucidate mammalian genome evolution [9].
Previous comparisons of transcriptome content [10], gene expression patterns [11][12][13], and gene function [14,15] between Amborella and other flowering plant species have suggested that much of the floral development program that has been characterized in Arabidopsis, snapdragon and maize existed in the last common ancestor of extant angiosperms. While gene duplications in the MADS-box transcription factor family likely contributed to the earliest floral development regulatory networks [11,12,[16][17][18][19], it is not clear whether these were single gene duplications or the product of polyploidization. Genome duplications have occurred repeatedly throughout angiosperm history [20][21][22][23] but there is uncertainty in the timing of polyploidy events relative to the origin of the angiosperms and important innovations in flowering plant history [24].
Here we describe a BAC-based draft physical map for A. trichopoda and use BAC end sequences (BESs) to compare the structure of the Amborella genome to representative eudicot (Vitis, Populus and Arabidopsis) and grass (Oryza) genomes. Comparative analyses of sequences for two large contiguous regions (487.3 and 629.7 kb in the Amborella genome) were also performed. In addition we use a large transcriptome assembly to identify BAC ends matching protein-coding sequences [25]. Our aim here is to begin to investigate whether regions of these genomes have remained syntenic throughout angiosperm history, and determine whether ancient genome duplications discovered in eudicot and grass genomes [26][27][28][29] occurred before or after the divergence of these lineages from the Amborella lineage. In addition, the physical map and sequence analyses establish a framework for future studies of all flowering plant genomes, including the Amborella genome itself.

BAC library and physical map
The structure and composition of the 870 Mbp/C [30]A. trichopoda genome was investigated through physical mapping of clones from a 5.2 × coverage BAC library. The library was constructed after partial digest of highmolecular-weight DNA with HindIII. The library, which comprises 36,684 BAC clones with an estimated average insert size of 123 kb, is available through the Arizona Genomics Institute [31]. The BAC library was double spotted in high density onto Hybond N+ filters. All 36,684 clones were end-sequenced, and a physical map was constructed after high information content fingerprinting (HICF) [32,33]. A total of 32,719 fingerprinted BACs was assembled into 3,106 contigs and 1,356 singletons using the program FPC version 7.2 [34].
The quality of the physical map was assessed by screening the arrayed library with probes developed for Amborella homologs for eight genes that have been found to be single-copy in sequenced plant genomes [35,36]. Probes derived from Amborella cDNA clones or PCR amplicons were putative homologs of following single-copy Arabidopsis genes: ASD (At1g14810), DWARF1 (At3g19820), GIGANTEA (At1g22770), LEAFY (At5g61850), a dienelactone hydrolase gene (At2g32520), a cytochrome-Coxidase-related gene (At4g37830), EIF3K (At4g33250) and a hypothetical protein-coding gene with strong similarity to rice gene Os02g0593400 (At5g63135). All verified positive clones mapped to the same FPC contig for six of the eight probes ( Figure S1 in Additional file 1). Positive clones for the EIF3K and the hypothetical protein-coding gene probes were each distributed between two FPC contigs and inspection of the HICF bands for these contigs suggests that the genes have been duplicated in the Amborella lineage. In accordance with the expected library coverage, the single copy nuclear gene probes hybridized to 3 to 13 clones (mean 6.9).
The correlation between HICF bands and the number of BACs included in each FPC contig was 0.655 for all contigs and 0.917 after removing two contigs derived from the chloroplast and mitochondrial genomes and one contig composed largely of repetitive elements ( Figure S2 in Additional file 1). We used a calibration of average insert size (123 kb) over the average number of HICF bands per BAC clone (128) to obtain a rough estimate of FPC contig lengths. Of 77 FPC contigs with 39 or more BACs (not including the contigs with the plastome and repetitive elements), estimated lengths ranged from 308 to 1,429 kb.
BAC end sequencing was performed on all fingerprinted BACs producing 69,466 Sanger reads with an average length of 695 bp after quality and vector trimming. This corresponds to 48.25 Mbp, or roughly 5.4% of the Amborella genome. BESs were related to the physical map and used to identify regions of synteny between regions of the Amborella genome and the sequenced Arabidopsis, Populus, Vitis (grape), and Oryza (rice) genomes (see below). In addition, end sequences were used to verify the identity of the three excluded FPC contigs described above. All BESs mapping at least 100 bp apart on the plastid genome [37] were found in the same FPC contig. This contig included just 532 BACs, indicating very low (1.6%) plastid DNA contamination.

Characterization of repeats in BAC end and shotgun sequences
Repeat composition and frequency in the Amborella genome were characterized through analysis of the BAC end and whole genome survey sequences. Reads were first compared with sequences in Repbase (v.15.08) [38] using BLASTN [39]. In order to minimize the effect of divergence between Amborella genes and homologous repeats from other species, we used relaxed BLASTN settings (-q -4 -r 5) to accommodate an estimated 160 million years of sequence divergence since the last common ancestor of extant flowering plants [8,[40][41][42] while maintaining rigorous support for significant hits (E-value threshold was set at 1e-10). All BAC end sequences without significant hits were then compared with the non-redundant protein database in GenBank using BLASTX and an E-value threshold of e-5. Finally, the remaining sequences without matches in Repbase or the GenBank nr database were compared with sequences that did have matches in either database using BLASTN with an E-value threshold of 1.0e-10. We report results both excluding these 'internal' BLAST searches and including them (I). Together these results provide estimates of transposable element (TE) content based on conservative and more comprehensive (and possibly more permissive; I) search strategies.
With the more comprehensive strategy (I), slightly more than half of all the Amborella BESs matched known TE sequences. Not surprisingly, the most highly represented TE class was long terminal repeat (LTR) retrotransposons, accounting for 7.65% (I: 30.01%) of all BESs and 57.5% (I: 56.58%) of all those with hits to Repbase. Hits to Ty1copia type sequences were slightly more common (3.11%; I: 13.79%) than matches to Ty3-gypsy-like LTRs (3.50%; I: 12.09%); the remaining LTR retrotransposon matches (1.04%; I: 4.13%) were not classified. LINEs also represented a significant fraction of Amborella BAC ends: 2.70% (I: 11.60%) of the total, 19.98% of all the repeats (I: 22.22%). This is noteworthy because LINEs are usually significantly less numerous than LTR retrotransposons in plant genomes [43][44][45][46][47] with some notable exceptions, such as the element del2 in Lilium speciosum [48]. The complete set of DNA TE-related BESs accounts for just 1.63% (I: 4.51%) of the total, and the most represented classes are those of hAT and MuDR elements: 0.92% (I: 2.41%) and 0.49% (I: 1.04%) of the total BESs, respectively. Results from the same analyses replicated on the set of 2,695 random sheared Sanger sequences (Table 1) and 648,519 454 reads (Table S1 in Additional file 1) are generally in very good agreement with those obtained using BES data.
A de novo search for novel miniature inverted repeat transposable elements (MITEs) overlooked by the similarity search approach was carried out using the pipeline MUST [49]. The most abundant candidates identified by the pipeline were manually inspected to confirm features typical of MITEs, such as small size, terminal inverted repeats, high A+T nucleotide content and target site duplications. Three putative high-copy MITEs were identified. All of these were small elements (174 to 500 bp) with terminal inverted repeats, target site duplications, and A+T content greater than 65% ( Figure S3 in Additional file 1). Repeat copy numbers estimated from the BESs and random sheared sequences were extrapolated to obtain genome-wide estimates using the procedure developed by Hawkins et al. [50]. Copy number ranges from 3,300 copies for MITE_2 to 17,000 copies for MITE_1. The estimates inferred from BESs were generally consistent with those calculated for random sheared reads (with the possible exception of MITE_3; Table 2).
The conserved reverse transcriptase domains of LTR retrotransposons and LINEs were collected and used to estimate maximum likelihood trees ( Figure 1). In the case of LTR retroelements, the trees indicate substitution rate heterogeneity (that is, variation in root-to-tip distances) and no evidence for recent retrotranspositional bursts of single families (that is, short terminal branches). In the case of LINEs, the phylogenetic tree displays very long branches suggestive of an ancient diversification or very rapid substitution rates. As has been described for other plants [51], Amborella LINEs exhibit high sequence divergence and extreme heterogeneity.
The Amborella BESs were also searched for microsatellites (that is, simple sequence repeats (SSRs)); for comparison, the search was also conducted on the Amborella random sheared reads and on BESs (from other HindIII BAC libraries) from Glycine (soybean) and Oryza rufipogon. In comparison to the other two species, Amborella shows a higher frequency of SSRs, particularly mono-and dinucleotide repeats, with a particularly high frequency of 'AG' dinucleotide microsatellites. The results of SSR analysis in BESs were confirmed by those obtained from the randomly sheared Amborella sequences (Table 3).
Repeat profiles in the shotgun sequences were also assessed using Tallymer to characterize K-mer frequencies [52]. The Amborella K-mer frequency profiles were compared with those of Arabidopsis thaliana, Oryza sativa (rice), Sorghum bicolor and Zea mays (maize).
While the Amborella genome size is closest to Sorghum's (870 and 740 Mbp/C, respectively), its K-mer frequency profiles were more similar to those of Arabidopsis and rice, with much smaller genome sizes (157 and 490 Mbp/ 1C, respectively [53]) ( Figure 2).

Distribution of BESs with matches to protein-coding regions of reference genomes
All BESs and shotgun sequences were compared to the GenBank nr database using BLASTX [39] with an evalue threshold of 1e-5. After the removal of sequences similar to TEs, the overall frequencies of sequences finding matches in the protein database were 11.9% and 8.05% for the BES and Sanger shotgun sequences, respectively. For BESs from FPC contigs with ten or more BACs, we found a negative correlation between the frequencies of BESs matching protein-coding genes and LTR retrotransposons (r = -0.423, P < 0.0001). As has been described for other genomes [54][55][56], gene density seems to be negatively correlated with retrotranposon density in the Amborella genome.
Identification of syntenic blocks between Amborella, Arabidopsis, rice, poplar and grape Taking advantage of the availability of a phase I physical map assembly, we mapped the Amborella contigs onto the genomes of A. thaliana, Populus trichocarpa, Vitis vinifera, and O. sativa. We focused on the 77 largest contigs with at least 39 clones. BLAST analyses of BESs were done within the context of their linkages within FPC contigs. All of the contig BESs classified as repeats (see above) were discarded. Those remaining were compared against the four reference genomes. Because of the large evolutionary time that separates Amborella from the other four sequenced genomes [41,42,57], the comparisons were carried out at the protein level using tBLASTX; only the best hits were taken into account. Amborella FPC contigs were considered for further analyses if at least two BESs had matches with bit scores greater than 80 (typically a maximum e-value of 1.0E-20 over 100 amino acidic residues) to loci separated by less than 500 kb within one of the four genomes being compared. Positive matches were used as anchors to circumscribe 4-Mbp tracts within the  reference genomes and a second, more focused tBLASTX search was performed comparing the BESs with these regions. An e-value threshold of 1.0E-4 was used for the second set of tBLASTX searches and all significant hits were used to identify syntenic regions. We considered a contig as anchored if the contig had at least four positive hits (e-value lower than 1.0e-4) to at least three distinct genes. Non-repetitive BESs were also compared to a database of 246,196 Amborella cDNA unigene assemblies with lengths greater than 100 bp. These cDNAs were derived from comprehensive sequencing of nine cDNA libraries (Table 4) [25]. Sixty-six percent of the non-repetitive BESs matched cDNA sequences in BLASTN searches with an e-value cutoff of 1.0e-10.
Using the search strategy described above, 29 large Amborella BAC contigs (>39 BAC clones) showed   synteny with at least one of the four sequenced genomes, and nine of these showed synteny with at least one region in all four genomes. All BESs mapping to these syntenic regions also exhibited significant matches to the sequences in the Amborella cDNA assembly (Table 4; Table S2 in Additional file 1). Whereas 25 of these Amborella BAC contigs mapped to at least one tract in the Vitis genome, 15, 16, and 24 contigs were found to be syntenic with one or more tracts in the Oryza, Arabidopsis, and Populus genomes, respectively (Table S2 in Additional file 1). These results provide a novel, albeit coarse, first view of the ancestral genome for all flowering plants and the timing of rearrangements and other structural changes (for example, genome duplications, fractionation, chromosomal fissions and fusions) that have reduced synteny between the monocot and eudicot genomes analyzed here (Figure 3). Parsimony mapping of synteny loss onto a phylogeny consisting of Amborella and the other four species indicates variation in rates of change in genome structure. In agreement with previous studies [29,45], Vitis seems to have been the most stable of the sequenced genomes, and the rate of change slowed in the lineage leading to Populus following divergence from the lineage leading to Arabidopsis (Figure 3).

Paleopolyploidy in angiosperm genomes
Paleopolyploidy events have been well characterized in all four sequenced genomes analyzed here [29,45,[58][59][60], and the syntenic Amborella FPC contigs described above often match multiple regions in these genomes. The most ancient of these paleopolyploidy events is the so-called γ triplication that has been inferred to have occurred before the divergence of the Asteridae (represented by tomato, Solanum lycopersicon) and the Rosidae, including Vitis, Populus and Arabidopsis [29]. Given the very incomplete view of the Amborella genome that is available in the BES data, we are not able to assess synteny between Amborella FPC contigs. Nevertheless, comparisons between the Amborella contigs and sets of syntenic blocks in the Vitis genome indicate that the γ triplication most likely occurred sometime after the divergence of all other angiosperms from the lineage leading to Amborella. All BESs were compared to all annotated protein-coding genes in the Vitis genome placed within the context of the pre-triplication ancestral gene blocks and post-triplication syntenic segments identified by Tang et al. [29]. A total of 328 Amborella FPC contigs had between two and eight genes with significant best BLASTX matches (e-values ≤1.0E-6) to Vitis genes corresponding to pretriplication gene blocks in the ancestral genome. In most of these cases (199 of 328; Additional file 2), best hits were distributed between two or three homeologous (that is, post-triplication) syntenic Vitis genome segments. Of the remaining 129 Amborella FPC contigs with BESs showing significant BLASTX hits to a single Vitis subgenome (that is, single copy of a triplicated ancestral block), most (113) included just 2 genes mapping to the ancestral Vitis gene blocks (14 including 3 genes, and 2 including 4 genes) (Additional file 2). All 21 Assemblies and raw data can be downloaded from the Ancestral Angiosperm Genome Project website [25]. A BLAST portal for the assembly is also available at the project website. FPC contigs with best BLASTX matches to five or more genes within the ancestral Vitis blocks were distributed among two or three post-triplication subgenomes. Complete sequences for the Amborella BAC contigs may reveal more even distribution of segments among Vitis subgenomes, but the results described here suggest that triplication, fractionation and divergence of homeologous segments in the Vitis genome postdate the divergence between lineages leading to Vitis and Amborella (that is, the last common ancestor of all extant angiosperms).

Analysis of complete sequences for two Amborella BAC contigs
Two of the larger (approximately 500 kb) BAC contigs (IDs 431 and 1003) mapping to multiple segments in all four sequenced reference genomes were identified for further investigation. A minimum tiling path was constructed for each contig, and florescence in situ hybridizations were performed to verify that the BACs mapped to a single contiguous region in the Amborella genome ( Figure 4). Each BAC in the tiling paths was subcloned and sequenced to 8 × coverage on an ABI 3730xl sequencer. Gaps were closed for each scaffold, and contiguous 487,318 and 629,678 bp phase II sequences were assembled for contigs 431 and 1003, respectively.
The DAWGPAWS suite of scripts was used to organize ab initio gene predictions, BLAST results and the output of repeat identification tools [61,62]. Ab initio gene predictions were generated using FGENESH [63], AUGUS-TUS [64], SNAP [65], GeneID [66] and GenScan [67]. In addition, Amborella EST sequences produced by the 454 Titanium platform (2,943,273 reads; total read size of approximately 776 Mbp; average read length of 263.60 bp) and Sanger sequencing (38,147 reads; total read size of approximately 21.3 Mbp; average read length of 559.57 bp) were splice-aligned to the contigs using GMAP (Genomic Mapping and Alignment Program) [68] with the PASA (Program to Assemble Spliced Alignments) genome annotation tool [69]. All predictions were manually compared with BLASTX results against gene annotations from Arabidopsis [70], Vitis [45], Z. mays [56], Medicago [71], Oryza [72,73], and Sorghum [55] as well as tBLASTx results against the Amborella transcript assemblies. GBrowse views of gene annotations and BLAST results for each contig are available at the Ancestral Angiosperm Genome Project website [25].
Rigorous assessments of synteny between these Amborella contigs and the aforementioned four angiosperm genomes were performed using LASTZ [74,75]. Dotplots comparing the Amborella contigs and the Vitis  genome show that contigs are syntenic with previously triplicated blocks [29]. Regions of contig 1003 match genes on syntenic segments of chromosomes 1, 14 and 17 in the Vitis genome ( Figure 5) and contig 431 mapped to syntenic portions of Vitis chromosomes 6, 8 and 13 ( Figure 6). These findings support the conclusion from the BES analyses suggesting that the γ triplication occurred after the first branching event in the phylogeny of extant angiosperms.
At least two genome duplications (ρ and σ) have been inferred to have occurred within the monocot lineage leading to rice since divergence of monocots and eudicots [28]. These duplications were evident in comparisons with both Amborella contigs. Regions of contig 1003 were found to be syntenic with portions of rice chromosomes 2 and 4 derived from the ρ duplication and a portion of chromosome 10 ( Figure 5) that is related to these two regions through the earlier σ duplication [28]. The LASTZ analysis of contig 431 revealed synteny with seven regions in the rice genome ( Figure 6) and one of the 'putative ancestral regions' (PAR 17) characterized by Tang et al. [28]. These PARs were defined as regions of synteny between the rice and Vitis genomes. Phylogenetic analyses of genes in Amborella contig 431 and syntenic regions of the rice and Vitis genomes may elucidate the timing of the γ triplication and genome duplications

Phylogenetic analyses of gene families represented in sequenced Amborella contigs
While the fractionation process has resulted in the loss of most duplicated genes following the ancient polyploidy events evident in the syntenic Vitis and rice segments shown in Figures 5 and 6, duplicate Vitis genes have been retained for homologs of three Amborella genes located on contig 431 (Figures 6a). These genes were used to search the PlantTribes gene family database [35]. The three gene sets identified in the synteny analysis correspond to three gene families (auxin-independent growth promoter, ceramidase and plant uncoupling mitochondrial protein) circumscribed through OrthoMCL clustering [76] of gene annotations from the available Arabidopsis, Carica (papaya), Populus, Medicago (alfalfa), Glycine, Cucumis (cucumber), Vitis, Mimulus, Oryza, Sorghum, Selaginella (spike moss) and Physcomitrella genomes. Homologous genes sampled from exemplar asterid, ranunculid, nongrass monocot and gymnosperm species were obtained from EST assembly databases [25,77,78] and were added to each gene family set. Sequences in each gene family set were aligned using MUSCLE [79], and RAxML [80] run with the GTRGAMMA substitution model was used to obtain maximum likelihood estimates of gene trees.
Inspection of the resulting gene trees shows support for the inference drawn from the BAC end sequence analysis. The γ triplication (hexaploidy event) clearly occurred after Amborella diverged from other extant angiosperm lineages (Figure 7). The placement of the γ triplication with respect to the divergence of monocots and eudocots or core eudicots and the Ranunculales varies among the three gene trees. This incongruence among gene trees is likely due to artifacts associated with substitution rate variation and insufficient taxon sampling. Analyses of additional gene families with broader taxon sampling will be necessary to obtain better resolution for the timing of the γ triplication with respect to the divergence of monocot, eudicots, Ranunculales (that is, 'basal' eudicots) and core eudocots. The physical map and BAC end sequences described in this study provide a low-resolution view of the Amborella genome. Nonetheless, these data shed light on genomic features of the last common ancestor of flowering plants. Moreover, the Amborella genome provides a unique reference for understanding genome evolution throughout angiosperm history. When placed in the context of the physical map, BESs representing just 5.4% of the Amborella genome allowed reconstruction of ancestral gene blocks in regions represented by 29 BAC contigs and inference of the timing of structural mutations that disrupted these blocks (Figure 3).
Analyses of BESs and BAC contigs also indicate that the ancient γ polyploidy event inferred from the Arabidopsis [58], Carica [81], Populus [60], and Vitis [45] genomes occurred after the Amborella lineage diverged from the rest of the angiosperms. Therefore, if the origin of angiosperms was associated with a genome duplication as has been hypothesized elsewhere [16,20,23], that polyploidy event predated the γ event.

BAC library construction
Protocols for DNA megabase preparation, library construction, picking and arraying proposed in Luo and Wing [82] were followed.

Fingerprinting
The SNaPshot fingerprinting technique was adopted [32] with the modifications described by Kim et al. [83]. Snapshot reactions were loaded into ABI 3730xl DNA sequencers. Analysis of data for each contig was carried out using the ABI Data Collection Program.

Physical map construction
Fingerprints were assembled into contigs using the program FPC version 7.2 [34]. The initial assembly was carried out using a Sulston score threshold of e-50 followed by three rounds of dequeuing at the same stringency and auto-merging of contigs at e-21.

BAC end extraction and sequencing
BAC DNA was extracted and end sequenced from 36,684 clones using the methods described by Ammiraju et al. [83,84]. Sequence quality assessment and trimming were carried out using the programs Phred [85] and Lucy [86].

Random sheared library
A random sheared library was constructed as previously described [87].

cDNA sequencing and assembly
Additional Sanger ESTs were generated from available male and female flower bud cDNA libraries [10] (Table  4). Libraries for 454 sequencing were constructed from the tissues listed in Table 4 using the Mint cDNA synthesis kit (Evrogen, Moscow, Russia). Total RNAs for cDNA synthesis were isolated using a combination of CTAB extraction and the RNeasy Plant Mini kit (Qiagen Valencia, CA USA) as previously described for basal angiosperms [11]. Two rounds of messenger RNA isolation were performed with the Poly(A)Purist™ mRNA Purification Kit (Ambion Inc. Austin, TX USA) according to the manufacturer's recommendation. Contaminant DNA was removed with DNA-free™ (Ambion Inc.) and mRNA quality was verified using a Bioanalyzer (Agilent Inc. Santa Clara, CA, United States). Vector and adaptor sequences were trimmed from 454 Titanium (2,943,273 reads; total read size of approximately 776 Mbp; average read length of 263.60 bp) and Sanger sequences (38,147 reads; total read size of approximately 21.3 Mbp; average read length of 559.57 bp) using seqclean [88] and assembled using MIRA [89].

Similarity searches, repeat classification and contig anchoring
Similarity searches were carried out using the programs BLASTN and BLASTX [39]. BLASTN was run under relaxed settings (-q -4 -r 5) in order to accommodate the evolutionary distance between Amborella and the species included in the repeat databases used; the significance threshold was set at 1e-10. In the case of BLASTX searches the threshold was set at 1e-5 or 1e-4 for the BES synteny analysis. tBLASTX was used to anchor the contigs to the reference genomes (see Results for details).

Databases
The databases used in similarity searches were RepBase version 15.08 [38], the GenBank non-redundant (nr) database, and the Oryza, Arabidopsis, Vitis and Populus genome sequences.

Validation of repeat searches and MITE identification
The program MUST [49] was used for de novo characterization of highly repeated sequences; results were then inspected for the presence of MITE features. Inverted repeats were identified manually parsing the results of dot-plot comparisons made using the program 'Dotter' [90].

Simple sequence repeat searches
Microsatellites were identified using the program Sputnik [91]. SSR composition, length and distribution were parsed and analyzed using the tools and the strategy used by Morgante et al. [92].
Fluorescence in situ hybridization FPC contigs were validated by hybridizing BAC DNAs to Amborella chromosome squashes. DNA was prepared for BAC mapping to the middle and both ends of BAC contigs 431 and 1003 and used to prepare fluorescently labeled BAC-FISH probes. Chromosome squashes were prepared from root tips and labeled BAC-FISH probes were prepared as described by Xiong et al. [93].

Contig sequencing and annotation
Minimum tiling paths of seven and six BACs were identified for contigs 1003 and 431, respectively, by the visual inspection of the FPC assemblies. Adjacent clones were chosen based on their reciprocal position and probability value associated to their overlapping fingerprinted bands as shown by FPC. Sequencing of selected minimum tiling path BACs was done to phase II quality as previously described [73]. Phase II BAC sequences were then assembled into 1003 and 431 contig sequences based on dot plot comparisons and overlap similarity between adjacent clones.
Perl scripts available from the DAWGPAWS package [61,62] were used to convert computational annotation results from multiple sources into a single GFF3 file for combined evidence annotation in Apollo [94] and publication in Gbrowse [95]. Ab initio gene annotation programs used in this process included FGENESH [63] AUGUSTUS [64], SNAP [65], GeneID [66] and GenScan [67]. Because Amborella-specific gene model parameterizations were not available for these programs, multiple plant models were used for each ab initio program. The sequence of the entire contig was BLASTx (e < 1 × 10 -5 ) searched against gene annotations from Arabidopsis [70], Vitis [45], Z. mays [56], Medicago [71], Oryza [72], and Sorghum [55] as well as tBLASTx (e < 1 × 10 -5 ) searched against a database of comprehensive Amborella transcript assemblies [25]. In addition, Amborella EST sequences (reads and assemblies; Table 4) were splice-aligned to the contigs using GMAP (Genomic Mapping and Alignment Program) [68] with the PASA (Program to Assemble Spliced Alignments) genome annotation tool [69]. The gene models and BLAST search results were manually combined into gene models using the Apollo genome annotation curation tool [94].

Synteny analysis of sequenced BAC contigs with Vitis and Oryza genomes
Sequenced Amborella BAC contigs 431 (487,318 bp) and 1003 (629,678 bp) were compared to the International Rice Genome Sequencing Project (IRGSP) rice genome assembly (version 5) and the Genoscope 12 × Vitis genome assembly using LASTZ and default parameters. Prior to LASTZ comparisons, all genomic sequences were masked using NCBI's WindowMasker to remove simple repeats. Significant matches after repeat masking were visualized as dot plots. Gene annotations for the rice and Vitis genomes were obtained from the Rice Annotation Project [96] and Genoscope [97], respectively, and plotted on the vertical axes of the dot plots (Figures 5 and 6). FGENESH [63] annotations for the Amborella contigs were included on the horizontal axes of the dot plots. LASTZ scores were summed for all aligned Amborella-rice or Amborella-Vitis blocks within 100 kb of each other in sequenced genomes. All regions with summed scores >100,000 were considered as syntenic and included in Figures 5 and 6.

Phylogenetic analysis
All alignments were carried out using the program 'MUSCLE' [79] run under default settings. Maximum likelihood analyses were run on aligned DNA and amino acid sequences using RAxML [80] and the GTRGAMMA nucleotide substitution model.

Additional material
Additional file 1: Supplemental tables and figures cited with additional details for the physical map and shotgun sequences.
Additional file 2: Synteny analysis of Amborella BAC ends and Vitis genes.