2× genomes - depth does matter
© Milinkovitch et al.; licensee BioMed Central Ltd. 2010
Received: 7 December 2009
Accepted: 9 February 2010
Published: 9 February 2010
Given the availability of full genome sequences, mapping gene gains, duplications, and losses during evolution should theoretically be straightforward. However, this endeavor suffers from overemphasis on detecting conserved genome features, which in turn has led to sequencing multiple eutherian genomes with low coverage rather than fewer genomes with high-coverage and more even distribution in the phylogeny. Although limitations associated with analysis of low coverage genomes are recognized, they have not been quantified.
Here, using recently developed comparative genomic application systems, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing eukaryote genome evolution through gene duplication. We demonstrate that, when performing inference of genome content evolution, low-coverage genomes generate not only a massive number of false gene losses, but also striking artifacts in gene duplication inference, especially at the most recent common ancestor of low-coverage genomes. We show that the artifactual gains are caused by the low coverage of genome sequence per se rather than by the increased taxon sampling in a biased portion of the species tree.
We argue that it will remain difficult to differentiate artifacts from true changes in modes and tempo of genome evolution until there is better homogeneity in both taxon sampling and high-coverage sequencing. This is important for broadening the utility of full genome data to the community of evolutionary biologists, whose interests go well beyond widely conserved physiologies and developmental patterns as they seek to understand the generative mechanisms underlying biological diversity.
In the context of investigating correlations between genome and phenotype evolution, describing the evolution of genome content (in terms of protein-coding genes) should theoretically be straightforward given the increasing number of available sequenced genomes and of large-scale expression studies, accompanied by a constantly growing number of software and databases for better integration and exploitation of this wealth of data. However, this endeavor of mapping gene gains (including duplication events) and losses suffers from the lack of explicit phylogenetic criteria in analytical tools, and the overemphasis, in genome sequencing programs, on detecting conserved genome features.
The first problem relates to the fact that many of the methods and databases available for identifying duplication events and assessing orthology relationships of genetic elements among genomes avoid the heavy computational cost of phylogenetic trees inference and the difficulties associated with their interpretation, even though phylogeny-based orthology/paralogy identification is widely accepted as the most valid approach [1–4]. Recently, however, the problem has been largely recognized and increasingly addressed by the comparative genomics community. For example, ENSEMBL [5, 6] and the 'phylome' approach [7, 8] are automated pipelines in which orthologs and paralogs are systematically identified through the estimation of gene family phylogenetic trees. Furthermore, the recently developed MANTiS relational database  integrates phylogeny-based orthology/paralogy assignments with functional and expression data, allowing users to explore phylogeny-driven (focusing on any set of branches), gene-driven (focusing on any set of genes), function/process-driven, and expression-driven questions in an explicit phylogenetic framework. Such application systems should help in investigating whether the gene duplication phenomenon is generally relevant to adaptive evolution (that is, beyond the classical examples of, for example, globins, olfactory receptors, opsins, and transcription factor diversifications), and might even help in understanding the causal relationships between genome evolution and increasing phenotypic complexity. However, the efficiency of these analytical tools inescapably depends on the amount and quality of the available genome sequence data. This leads us to the second, more pervasive problem of biases in whole genome sequencing program strategies.
One major explicit goal of genome sequencing projects is that comparisons of the human genome with those of other eukaryotes allow detection of coding and non-coding conserved (hence, likely functional) elements in the human genome. Importantly, the statistical power of such comparisons depends on the sum of branch lengths of the phylogenetic tree among the species used . However, it is likely that a significant proportion of these possibly biomedically relevant conserved features are recent and thus specific to relatively shallow branches (for example, mammals, eutheria, primates) rather than common to all eukaryotes. In that case, the only way to increase statistical power is to increase the number of sequenced genomes for species belonging to the monophyletic group defined by the relevant shallow branch. This realization has motivated the development of the 'Mammalian Genome Project'  aiming at sequencing the genome of multiple placental mammals with a low mean coverage of 2×. The sequenced species were chosen to maximize the ratio [Sum of branch lengths within mammals]/[Number of genomes sequenced]. Note that the decision to choose the placental mammal branch is somewhat arbitrary: there is no a priori reason to believe that there are more (or more important) Eutherian-specific than, for example, Therian-specific biomedically relevant conserved features, and sequencing a few well-chosen marsupial species would have generated more cumulative branch length for less species. However, this decision might have been motivated by the facts that using a shallower branch will facilitate annotation of the newly sequenced genomes and that some of the chosen species are laboratory model species.
We think that the emphasis on searching for evolutionary conservation - hence, the decision to prefer 24 low-coverage (2×) genomes to, for example, 6 genomes at 8× coverage, hurts more general endeavors, such as the mapping of gene gains and losses in the evolution of eukaryotic genomes. Although the inherent limitations associated with low coverage genome analyses are recognized , their impact on understanding differences among organisms (rather than similarities) has not been quantified.
Here, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing the mode and tempo of eukaryote genome evolution through gene duplication. Such assessments are important for broadening the utility of full genome data to the community of evolutionary biologists, whose interests go well beyond widely conserved physiologies and developmental processes/patterns as they seek to understand the generative mechanisms underlying biological diversity.
Results and Discussion
On the basis of the 38 metazoan genomes (longest splice-variant of each protein-coding gene) available in version 49 of the ENSEMBL database (that is, six primates, one tree shrew, four rodents, two lagomorphs, two carnivores, one perissodactyl, one cetartiodactyl, one bat, two insectivores, one xenarthran, two afrotherians, one marsupial, one monotreme, one bird, one amphibian, five teleost fishes, two urochordates, one nematode, and three insects), and using the baker's yeast as an outgroup, we used MANTiS version 1.0.15  to generate two datasets including information on the presence/absence of genes. The first dataset ('families only') contains one character for each single (species-specific) gene and for each protein family (that is, only de novo gains are considered), whereas in the second dataset ('with duplications'), a new character was additionally created for each duplication event, such that each protein family is represented by several characters. Additional details are given in . To investigate the influence of low-coverage (2×) genomes on inferred genome evolutionary patterns, we also generated with MANTiS the corresponding datasets using versions 39 to 48 of ENSEMBL (Figure 1) and the human phylome , available at . The ENSEMBL v39 archive database includes 18 metazoan species with 7 placental mammal genomes of coverage >4 (except for the rhesus macaque, Macaca mulatta), whereas subsequent versions include an increasing number of low mean coverage (2×) genomes (v49 includes 38 metazoan species with 24 placental mammal genomes, of which 14 are of 2× mean coverage). The PhylomeDB database uses only high-coverage genomes and an improved phylogenetic pipeline that includes alignment trimming, branch-length optimization, evolutionary model testing, and maximum likelihood and Bayesian phylogeny inference (see Materials and methods for details).
An alternative explanation for the artifactual peak of gene gains in the eutherian branches would be the mirror situation: correct species tree but incorrect gene trees. To test this hypothesis, we first verified whether, in 2× genomes, the mean sequence coverage of genes inferred as duplicated in the three first eutherian branches (version 49) is lower than the mean sequence coverage of genes inferred as duplicated elsewhere in the species tree. As the sequence coverage, nucleotide by nucleotide, or gene per gene, is (to our knowledge) not publicly available, we counted the number of ambiguities in each protein sequence of each species and found that 2× genomes exhibit higher mean proportions of ambiguities, ranging from 9.11% (Ochotona princeps) to 15.46% (Dasypus novemcinctus), compared to 0 to 0.24% in high coverage genomes. However, we did not observe a higher mean proportion of ambiguities (for neither 2× nor high-coverage genomes) in genes inferred as duplicated in the three first eutherian branches than in genes inferred as duplicated elsewhere in the species tree.
Number of dubious duplications at the eutherian node involving various species as 'orphans'
sp(i)versus >5 - no sp(i)
sp(i)versus >10 - no sp(i)
2× coverage genomes
One-tail Mann-Whitney test
We argue that the phylogenetic distribution of species for which so-called 'full genome sequences' are available, as well as the coverage of these genomes, are key parameters that have not been given enough appreciation: it will remain exceedingly difficult to differentiate artifacts from true changes in modes and tempo of genome evolution until better homogeneity in both taxon sampling and high-coverage sequencing is achieved. For example, the groups of Amphibia (frogs, toads, salamanders, newts, and caecilians) or Reptilia (turtles, lizards, crocodiles, and birds) exhibit larger diversities than mammals but have long been represented in major databases such as ENSEMBL by a single species (Xenopus tropicalis, and Gallus Gallus, respectively) at the tip of a very long branch. The recent inclusion (since ENSEMBL v53) of the high-coverage genome sequences from the green anole lizard (Anolis carolinensis) and zebra finch (Taeniopygia guttata) are, in this respect, very important for improved mapping reliability of genome content evolution in the amniote tree. Similarly, including some of the missing major animal lineages (for example, Lophotrochozoans such as annelids, molluscs, and flat worms) is crucial if reliable analysis is to be extended to the whole group of Metazoa. However, major artifacts in gene gains and losses (and possibly others that we did not uncover here) will remain until all low-coverage genomes are promoted to high coverage. Note that very recent (generally species-specific) duplications will remain very difficult to differentiate from parental alleles even in high-coverage genomes.
Obviously, the artifactual gains and losses of duplicates discussed here are problematic only for a subset of comparative genomic analyses. For example, these artifacts are of low relevance for the specific and significant purpose behind the initial production of low-coverage genomes: detecting conserved genome features . Furthermore, these artifacts had little impact on analyses that uncovered historical constraints in gene expression , despite these analyses requiring the determination of the first appearance of genes and duplicates in the species phylogeny. However, artifacts in mapping of genome content evolution will likely mislead many users who access genomic databases, possibly resulting in a wave of unreliable analyses.
Fortunately, the tremendous drop in sequencing costs brought about by next generation sequencing platforms (for example, [24, 25]) allows the comparative genomics community to contemplate the possibility of sequencing, in the coming decade, hundreds or even thousands of complex genomes spanning a wide phylogenetic diversity (for example, ). We, however, urge the community to go for quality rather than for quantity: high-coverage should be a compulsory requirement in these large genome sequencing projects such that genome content evolution, as well as coding and non-coding sequence changes, can be reliably inferred for a vastly improved understanding of genome evolution.
Materials and methods
As an alternative to ENSEMBL trees, we used data from the human phylome  available through the PhylomeDB database . The pipeline used to reconstruct the human phylome is described in more detail elsewhere . In brief, a database containing all proteins encoded in the 39 eukaryotic genomes (all high coverage) included in the phylome is searched for putative homologs of human proteins by a Smith-Waterman algorithm . Significant hits with an e-value lower than 10-3 and that could be aligned over a continuous region longer than 50% of the query sequence were selected and subsequently aligned with MUSCLE 3.6 . Alignments are trimmed using trimAl 1.0  to remove columns with gaps in more than 10% of the sequences, unless such a procedure removes more than one-third of the positions in the alignment. In such cases the percentage of sequences with gaps allowed is automatically increased until at least two-thirds of the initial columns are conserved. Finally, phylogenetic trees are reconstructed by using maximum likelihood as implemented in PhyML v2.4.4 . In all cases a discrete gamma-distribution model is assumed with four rate categories and invariant sites, where the gamma shape parameter and the fraction of invariant sites are estimated from the data. To avoid model-based biases, protein evolutionary models (JTT, Dayhoff, MtREV, VT and BLOSUM62) are tested to then select the one best fitting the data according to the Akaike information criterion (AIC) .
Gene tree-species tree reconciliation
We used a strict tree-reconciliation algorithm  as implemented in ETE . In this case, every gene tree is compared to the topology of a given species tree by comparing the specific sets of species contained by all tree splits. The strict reconciliation algorithm maps the gene tree onto the species tree and explains any incongruence in terms of the minimal set of duplication and gene-loss events necessary to derive the observed gene tree topology from the one proposed in the species tree. These inferred duplication events are marked on the tree, and orthology and paralogy relations are derived accordingly.
Simulation of low coverage sequence data
To evaluate the phylogenetic effects of low quality sequence data, stretches of ambiguous sequences where introduced in the protein sequences of three species (P. troglodytes, M. musculus and B. taurus) of the phylomeDB. Continuous stretches of amino acids were substituted by 'X's according to a normal distribution of lengths with mean μ, and standard deviation δ. These parameters were set for each of the three selected species: P. troglodytes (μ = 9%, and δ = 3% of the length of the sequence); M. musculus (12%, 3%); and B. taurus (15%, 3%); that is, according to the range of values we observed in real low-coverage genomes. After introducing the simulated ambiguities, sequences were re-aligned and trees were reconstructed and analyzed in the same way as the non-perturbed PhylomeDB dataset.
This work was supported by grants from the University of Geneva (Switzerland), the Swiss National Science Foundation (FNSNF, grant 31003A_125060), the Société Académique de Genève (Switzerland), the Georges and Antoine Claraz Foundation (Switzerland), the Ernst and Lucie Schmidheiny Foundation (Switzerland), and the National Fund for Scientific Research Belgium (FNRS). AT is post-doctoral fellow at the FNRS. TG is supported by grants from the Spanish Ministries of Health (FIS06-213) and Science and Innovation (GEN2006-27784-E/PAT). We thank anonymous reviewers for their critical comments on previous versions of this manuscript.
- Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009, 19: 327-335. 10.1101/gr.073585.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Alexeyenko A, Tamas I, Liu G, Sonnhammer EL: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006, 22: e9-15. 10.1093/bioinformatics/btl213.PubMedView ArticleGoogle Scholar
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMedPubMed CentralView ArticleGoogle Scholar
- Gabaldón T: Large-scale assignment of orthology: back to phylogenetics?. Genome Biol. 2008, 9: 235-10.1186/gb-2008-9-10-235.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, et al: Ensembl 2009. Nucleic Acids Res. 2008, 37: D690-D697. 10.1093/nar/gkn828.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, et al: Ensembl 2007. Nucleic Acids Res. 2007, 35: D610-617. 10.1093/nar/gkl996.PubMedPubMed CentralView ArticleGoogle Scholar
- Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T: PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res. 2008, 36: D491-496. 10.1093/nar/gkm899.PubMedPubMed CentralView ArticleGoogle Scholar
- Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T: The human phylome. Genome Biol. 2007, 8: R109-10.1186/gb-2007-8-8-109.PubMedPubMed CentralView ArticleGoogle Scholar
- Tzika A, Helaers R, Peer Van de Y, Milinkovitch MC: MANTiS: a phylogenetic framework for multi-species genome comparisons. Bioinformatics. 2008, 24: 151-157. 10.1093/bioinformatics/btm567.PubMedView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.PubMedView ArticleGoogle Scholar
- Milinkovitch MC, Tzika A: Escaping the mouse trap: the selection of new Evo-Devo model species. J Exp Zool B Mol Dev Evol. 2007, 308: 337-346. 10.1002/jez.b.21180.PubMedView ArticleGoogle Scholar
- Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006, 34: D332-334. 10.1093/nar/gkj145.PubMedPubMed CentralView ArticleGoogle Scholar
- Ensembl Genome Browser. [http://www.ensembl.org/index.html]
- Green P: 2× genomes - does depth matter?. Genome Res. 2007, 17: 1547-1549. 10.1101/gr.7050807.PubMedView ArticleGoogle Scholar
- Multiple Mammalian Genomes for Comparative Annotation. [http://www.genome.gov/25521745]
- MANTiS: the missing link between multi-species full genome comparisons and functional analysis. [http://www.mantisdb.org/]
- The PhylomeDB. [http://phylomedb.org/]
- Bashir A, Ye C, Price AL, Bafna V: Orthologous repeats and mammalian phylogenetic inference. Genome Res. 2005, 15: 998-1006. 10.1101/gr.3493405.PubMedPubMed CentralView ArticleGoogle Scholar
- Halanych KM: The new view of animal phylogeny. Annu Rev Ecol Evol Systematics. 2004, 35: 229-256. 10.1146/annurev.ecolsys.35.112202.130124.View ArticleGoogle Scholar
- Springer MS, Stanhope MJ, Madsen O, de Jong WW: Molecules consolidate the placental mammal tree. Trends Ecol Evol. 2004, 19: 430-438. 10.1016/j.tree.2004.05.006.PubMedView ArticleGoogle Scholar
- Blomme T, Vandepoele K, De Bodt S, Simillion C, Maere S, Peer Van de Y: The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol. 2006, 7: R43-10.1186/gb-2006-7-5-r43.PubMedPubMed CentralView ArticleGoogle Scholar
- Milinkovitch MC, Helaers R, Tzika AC: Historical constraints on vertebrate genome evolution. Genome Biol Evol. 2010, 2010: 13-18. 10.1093/gbe/evp052.View ArticleGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26: 1135-1145. 10.1038/nbt1486.PubMedView ArticleGoogle Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, et al: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323: 133-138. 10.1126/science.1162986.PubMedView ArticleGoogle Scholar
- Genome 10K Community of Scientists.: Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009, 100: 659-674. 10.1093/jhered/esp086.PubMed CentralView ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.PubMedView ArticleGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113.PubMedPubMed CentralView ArticleGoogle Scholar
- TrimAl, a tool for automated alignment trimming. [http://trimal.cgenomics.org/]
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.PubMedView ArticleGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Zmasek C, Eddy S: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001, 17: 821-828. 10.1093/bioinformatics/17.9.821.PubMedView ArticleGoogle Scholar
- Huerta-Cepas J, Dopazo J, Gabaldón T: ETE: a python Environment for Tree Exploration. BMC bioinformatics. 2010, 11: 24-10.1186/1471-2105-11-24.PubMedPubMed CentralView ArticleGoogle Scholar
- Benton MJ, Donoghue PC: Paleontological evidence to date the tree of life. Mol Biol Evol. 2007, 24: 26-53. 10.1093/molbev/msl150.PubMedView ArticleGoogle Scholar
- Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sorensen MV, Haddock SH, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature. 2008, 452: 745-749. 10.1038/nature06614.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.