Open access to tree genomes: the path to a better forest

An open-access culture and a well-developed comparative-genomics infrastructure must be developed in forest trees to derive the full potential of genome sequencing in this diverse group of plants that are the dominant species in much of the earth's terrestrial ecosystems.

Opportunities and challenges in forest tree genomics are seemingly as diverse and as large as the trees themselves; however, here, we have chosen to focus on the potential signifi cant impact on all of tree biology research if only an open-access culture and comparative-genomics infrastructure were developed. In earlier articles [1,2], we argued that the great diversity of forest trees found in both the undomesticated and domesticated state provides an excellent opportunity to understand the molecular basis of adaptation in plants and furthermore that comparative-genomic approaches will greatly facilitate discovery and understanding. We identifi ed several priority research areas towards realizing these goals (Box 1), such as establishing reference genome sequences for important tree species, determining how to apply sequencing technologies to understand adaptation, and developing resources for storing and accessing forestry data. Signifi cant progress has been made in many of these priorities, with the exception of investments in database resources and understanding ecological functions. Here, we briefl y summarize the rapid progress in developing genomic resources in a small number of species and then off er our view on what we believe it will take to realize the fi nal two priorities.

The great diversity found in forest trees
Th ere are an estimated 60,000 tree species on earth, and approximately 30 of the 49 plant orders contain tree species. Clearly, the tree phenotype has evolved many times in plants. Th e diversity of plant structures, development, life history, environments occupied and so on in trees is nearly as broad as higher plants in general, but trees share the common characteristic that all are perennial and many are very long lived. Because of the sessile nature of plants, each tree must survive and reproduce in a specifi c environment over the seasonal cycles of its lifetime. Th is tight association between individual genotypes and their environment provides a powerful research setting, just as it has driven the evolution of a plethora of uniquely arboreal adaptations. Understanding these evolutionary strategies is a longstanding area of study of tree biologists, with many broader biological implications.
Completed and current genome-sequencing projects in forest trees are limited to about 25 species from just 4 of more than 100 families: Pinaceae (pines, spruces and fi rs), Salicaceae (poplars and willows), Myrtaceae (eucalyptus) and Fagaceae (oaks, chestnuts and beeches). Large-scale sequencing projects such as the 1000 Human Genomes [3], 1000 Plant Genomes (1KP) [4] or the 5000 Insect Genome (i5k) [5] projects have not yet been proposed for forest trees.
Th e genus Populus has 30+ species (aspens and cottonwoods) with genome sizes of approximately 500 Mb. Several species are being sequenced by DOE/JGI, and other groups around the world, and it seems likely that all members of the genus will soon have a genome sequence (Table 1). Th e next forest tree to be sequenced was the fl ooded gum (Eucalyptus grandis BRASUZ1, which is a member of the Myrtaceae family), again by DOE/JGI. Eucalyptus species and their hybrids are important commercial species grown in their native Australia and many regions throughout the southern hemisphere. Several more eucalyptus species are being sequenced (Table 1), each with relatively small genomes (500 Mb), but it will probably take many years before all 700+ members of this genus are completed. Several members of the Fagaceae family are now being sequenced ( Table 1). Members of this group include the oaks, beeches and chestnuts, with genome sizes less than 1 Gb.
Th e gymnosperm forest trees (such as the conifers) were the last to enter the world of genome sequencing. Th is was entirely due to their very large genomes (10 Gb and greater) as they are extremely important econo mically and ecologically, and phylogenetically they represent the ancient sister lineage to that of angiosperm species. Genome resources needed to support a sequencing project were reasonably well developed, but it was not until the introduction of next-generation sequencing (NGS) technologies that sequencing conifer genomes became tractable. Currently, there are at least ten conifer (Pinaceae) genome-sequencing projects under way (Table 1).
Aside from reference genome sequencing in forest trees, there is signifi cant activity in transcriptome sequencing and resequencing for polymorphism discovery (Tables 2 and 3). We have only listed the transcriptome and resequencing projects in Table 1 that are associated with a species that has an active genomesequencing project.

The opportunity for comparative-genomic approaches in forest trees
Th e power of comparative-genomic approaches for under standing function in an evolutionary framework is well established [7][8][9][10][11][12][13]. Comparative genomics can be applied to sequence data (nucleotide and protein) at the level of individual genes or genome-wide. Genome-wide approaches provide insight into both chromosome evolution and the diversifi cation of biological functions and interactions.
Understanding of gene function in forest tree species is challenged by the lack of standard reverse-genetic tools routinely used in other systems -for example, standard marker stocks, facile transformation and regenerationand by the long generation times. Th us, comparative genomics becomes the more powerful approach to under standing gene function in trees.
Comparative genomics requires not only data availability but also cyber-infrastructure to support exchange and analysis. Th e TreeGenes database is the most compre hensive resource for comparative-genomic analyses in forest trees [14]. Several smaller databases have been created to facilitate collaborations, including: Fagaceae genomics web, hardwoodgenomics.org, Quercus portal, PineDB, ConiferGDB, EuroPineDB, PopulusDB, PoplarDB, EucalyptusDB and Eucanext (Tables 1, 2, and 3). Th ese resources vary greatly in their scope, relevance and integration. Some are static and archival, whereas others focus on current sequence content for a specifi c species or a small number of related species. Th is results in overlapping and confl icting data among repositories. In addition, each database uses its own custom interfaces and back-end database technology to serve sequence to the user. Th e US National Science Foundation funding for large-scale infrastructure projects, such as iPlant, is leading eff orts aimed towards centralizing resources for research communities [15]. Without centralized resources, researchers are forced to employ ineffi cient datamining methods through queries of independently maintained databases or inconsistently formatted supple mental fi les on journal websites. Specifi c areas of interest for the forest tree genomic community include the ability to connect sequence, genotype and phenotype to individual, geo-referenced trees. Th is type of integration can only be achieved through web services that allow disparate resources to communicate in ways that are transparent to the user [16]. With the recent increase of genome sequences available for many of these species, there is a

The path to success avoids delays
Careful inspection of Table 1 reveals that forest tree genome projects are very slow to release sequence data into the public domain. Once a project is fi nished and submitted for publication, a draft genome becomes available -for example, the poplar genome was released and published in 2006. However, pre-publication releases are infrequent, exceptions being the PineRefSeq project that has made three releases and the SMarTForest project that has made one ( Table 1). Th is is unfortunate because good-quality sequence contigs and scaff olds could be made available years before publication, delivering an extremely important resource to the community. Th is delay can be understood from privately fi nanced projects seeking commercial advantages, but nearly all the projects listed in Table 1 are fi nanced by public funds whose stated mission is advancing science and development of community resources. Publication rights are easily protected by data-use policy statements such as the Ft Lauderdale [17] and Toronto agreements [18], but unfortunately these conventions are not often used and data access is restricted by password-protected websites (Tables 1, 2, and 3). We hope the opinion off ered here will lead to a discussion in the forest tree community, to a more open-access culture and thus to a more vibrant and rapidly advancing research area.  [14,19] GoldenGate & Infi nium iSelect [95,96] Pinus lambertiana (sugar pine) [14] GoldenGate array [24] Pseudotsuga menziesii (Douglas-fi r) [14] GoldenGate array [25] [19] Infi nium iSelect [64] Pinus sylvestris (Scots pine) [14] GoldenGate array Pinus pinsater (maritime pine) [14] GoldenGate array [27,90] Pinus radiata (Monterey pine) [14] GoldenGate array [29] Picea abies (Norway spruce) [14] GoldenGate array [30] Picea glauca (white spruce) [19] GoldenGate & Infi nium iSelect [97,98] Salicaceae Populus trichocarpa (black cottonwood) [19,99] Infi nium iSelect [100] Infi nium iSelect (Restricted) [36] [14] GoldenGate array [101] [19,99] SNP assay [102] Populus nigra (black poplar) [14] GoldenGate array [103] Myrtaceae Eucalytpus grandis (rose gum) [104] DArT high-density array [105] [106] GoldenGate array [106] Eucalyptus camaldulensis (river red gum) RNA-Seq SNP discovery (restricted) [107] Details current genotyping projects in forest trees with data access information and relevant publications.