Skip to main content
Figure 1 | Genome Biology

Figure 1

From: The global landscape of sequence diversity

Figure 1

Sequence discovery rates across various taxonomic groups. (a) Discovery of 'distinct' sequences as a function of sampled bacterial genomes. Distinct sequences are defined as those that do not share significant sequence similarity with a sequence in a previously sampled genome. Each point represents the addition of a new genome, ordered either by the number of sequences (largest first) or by random. Two datasets are shown: one that considers all sequences; and one that considers only sequences that consist of more than 100 residues. (b) Discovery of distinct sequences in fully sequenced eukaryotic genomes. Genome addition was ordered by the number of sequences (largest first). Certain points are labeled to indicate the species added to show how the addition of closely related species influences the local gradient of the graph. (c) Rate of distinct sequence discovery within various taxonomic groupings of eukaryotic partial genomes. As before, each point represents the addition of a new partial genome (largest first), and color indicates the taxonomic group sampled. It should be noted that the classification of Protista as a group is historical and has recently been shown to consist of several paraphyletic taxa, many of which (including the species examined here) are considered basal to the root of Eukarya [29]. The inset graph provides an expanded display. (d) Rate of sequence discovery as a function of genomes sampled for both bacterial genomes and eukaryotic partial genomes. Each point represents the average and standard deviations of the rate of distinct sequence discovery over a sliding window representing the cumulative addition of 30 complete or partial genomes, obtained from 400 random orderings of genome addition (see Materials and methods for more details). The six data series include sequences from all bacterial and all partial genomes, bacterial sequences > 100 residues in length, partial genome sequences > 300 bp in length and two 'restricted' groups of bacterial sequences: those from a collection of genomes with only a single (largest) representative from each species ('strains filtered'); and those from a collection of genomes with only a single (again largest) representative from each genus ('species filtered'). (e) Rate of gene family discovery for partial and bacterial genomes. Gene families include singletons (families with only a single sequence representative) and were obtained with reference to the COGENT database for bacteria, or determined through an equivalent clustering procedure for partial genomes (see Materials and methods). As for (d), each point represents the average and standard deviations of the rate of gene family discovery over a sliding window representing the cumulative addition of 30 complete or partial genomes, obtained from 400 random orderings of genome addition (see Materials and methods for more details). Also shown are the gene family discovery rates for the two 'restricted' groups of bacterial sequences mentioned above.

Back to article page