Genomic DNA k-mer spectra: models and modalities
© Chor et al.; licensee BioMed Central Ltd. 2009
Received: 17 March 2009
Accepted: 8 October 2009
Published: 8 October 2009
The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible.
We have studied the k-mer spectra of more than 100 species from Archea, Bacteria, and Eukaryota, particularly looking at the modalities of the distributions. As expected, most species have a unimodal k-mer spectrum. However, a few species, including all mammals, have multimodal spectra. These species coincide with the tetrapods. Genomic sequences are clearly very complex, and cannot be fully explained by any simple probabilistic model. Yet we sought such an explanation for the observed modalities, and discovered that low-order Markov models capture this property (and some others) fairly well.
Multimodal spectra are characterized by specific ranges of values of C+G content and of CpG dinucleotide suppression, a range that encompasses all tetrapods analyzed. Other genomes, like that of the protozoa Entamoeba histolytica, which also exhibits CpG suppression, do not have multimodal k-mer spectra. Groupings of functional elements of the human genome also have a clear modality, and exhibit either a unimodal or multimodal behaviour, depending on the two above mentioned values.
The distribution of DNA k-mers (DNA 'words' of length k) - namely, the k-mer spectrum - in whole genome sequences provides an interesting perspective on the complexity of the corresponding species. A number of theoretical investigations of genomic k-mer distributions were done prior to the sequencing of large genomes and these works suggested various plausible probabilistic models and parameters for such k-mer distributions. Despite the relative abundance of sequenced genomes to date, the number of works investigating empirical k-mer distributions for values of k exceeding 2 or 3 is not very large. The main emphasis has been on studying words with extreme frequencies, namely, either missing or rare k-mers, or those with very high frequencies.
Several models for the distribution of k-mers in sequences have been proposed. For example, Robin and Schbath  compared several approximate k-mer distributions, and also analyzed the empirical k-mer distributions of the phage Lambda (48,500 bp genome). Reinert et al.  discussed various plausible k-mer distributions, showing that the distribution for the number of occurrences of a particular k-mer in sequences generated from a hidden Markov model has two distinct large-sample regimes: a normal distribution for abundant k-mers, and a Poisson or compound Poisson distribution for extremely rare k-mers.
With the sequencing of more complete genomes, it has become possible to move from theoretical to empirical studies and to examine the properties of these DNA words and how their distributions vary between different species or genome elements. The most basic empirical question that has been investigated is that of missing DNA k-mers. Earlier works have studied non-existent short amino acid k-mers [3, 4], and have attributed them mainly to chemical constraints (such as hydrophobic and hydrophilic amino acids). DNA does not have the complex three-dimensional structure and chemical constraints of proteins, although the nucleotide composition has been reported by el antri et al.  to weakly affect the structure of double-stranded DNA. Intuitively, if k is not too large compared to the genome or chromosome length, we expect that all k-mers will be present. This expectation turns out to be incorrect. Fofanov et al.  studied correlations between present and absent short DNA k-mers in over 1,500 species, and found short k-mers that are missing. A systematic study of missing k-mers, termed 'nullomers', was carried out by Hampikian and Andersen . They reported the complete lists of missing k-mers (8 ≤ k ≤ 13) in 12 species, including human. Several possible uses of these nullomers were suggested, and it was claimed that 'these absent sequences define the maximum set of potentially lethal oligomers, ..., and identify potential targets for therapeutic intervention and suicide markers.' Herold, Kurtz, and Giegerich  developed an efficient algorithm for finding all missing words. Zhou, Olman and Xu  studied genome barcodes using fragment size M (1,000 ≤ M ≤ 10,000) and based on k-mers (1 ≤ k ≤ 6). They report that sequences generated by Markov models of order three are the closest to the barcodes of genomic sequences in terms of their appearance. Mrázek and Karlin  studied different aspects of large viral genomes. They showed that the low order Markov models identify the most frequent k-mers fairly well. In an immunological context, Stacey et al.  studied CpG suppression in 8-mers of mammalian genomes (mouse, human) and bacterial genomes (Escherichia coli). A consequence of their work is that the mammalian spectra are multimodal, while the bacterial spectra are unimodal. In a recent study, Csürös, Noé, and Kucherov  explored the empirical k-mer genomic spectrum, and suggested a rethinking of the significance of word frequencies. Their study focuses on overabundant k-mers, showing that their distribution has a heavy sub-exponential tail. They argue that the frequencies of genomic k-mers across all genomes can be described using a double Pareto log-normal (DPLN) distribution ; this distribution therefore represents a universal genomic feature. In particular, they claim that DPLN provides a much better fit to the empirical k-mer distributions than Bernoulli or first-order Markov models, due to its heavier tail (k-mers with extreme abundance are more common). In addition, Csürös et al. suggest a simple model of evolution by random duplications ('copy and insert') that they report produced distributions whose tails are similar, in simulations, to the DPLN.
Results and discussion
The empirical distributions of k-mers in mammalian single chromosomes and whole genomes that we examined were all multimodal. Additional data file 1 depicts the multimodal distributions for human chromosomes 1 (a long chromosome), 6 (medium) and 20 (short), for both 9-mers and 11-mers. Additional data file 2 depicts these distributions for the complete human and opossum genomes, again for both 9-mers and 11-mers. Like the spectrum for 11-mers in the human genome (Figure 1a), there is typically a high peak close to zero, corresponding to a large number of k-mers that are either missing or very rare (a low number of appearances), with a second, shallower local peak approximately around the average number of occurrences, from where the numbers decrease monotonically. Often there is a third peak between these two. The high peak close to zero flattens when the length of the genome grows larger compared to 4 k , the number of possible k-mers, and conversely gains more mass as k increases. The decay of the tail (over-abundant k-mers) is slower than exponential, as discussed by Csürös et al. .
We analyzed the k-mer distributions for 89 non-mammalian genomes: 33 from Archea, 36 from Bacteria, and 20 from non-mammalian Eukaryota, including 8 vertebrates. These distributions can be divided into two main categories, unimodal and multimodal distributions.
For unimodal distributions, illustrated for zebrafish in Figure 1b, the corresponding k-mer distributions have a single maximum, usually at a relatively low number of k-mers. Additional data file 3 depicts typical unimodal k-mer distributions for nine species. To account for different genome lengths, we used values of k that are ⌈0.7 log4 ℓ⌉ or above, where ℓ is the genome length (see the Materials and methods section).
For multimodal distributions, the corresponding k-mer distributions have two or more maxima. We found that only a small and well characterized group of species exhibits this distribution (Additional data file 4). This group includes chicken (Gallus gallus), green anole lizard (Anolis carolinesis), and tree frog (Xenopus tropicalis), all in the tetrapod clade. Notice that the five bony fish, which are vertebrates but not tetrapods, are not part of this group. Usually one (or more) high maximum exists at a very low number of k-mers, and another, shallower one at a larger number.
Human genomic regions
The word distributions in specific functional categories of the human genome: exons, introns, 3' untranslated regions (UTRs), 5' UTRs, and gene promoter regions do not all share the same modality as the human chromosomal averages. The gene promoter regions were separately analyzed three times, corresponding to varying lengths of the promoter region (600, 1,000, and 5,000 nucleotide bases upstream of the 5' UTR of the gene). The most striking empirical observation is that the exons, the 5' UTRs, and the shorter lengths (600 and 1,000 bases) of gene promoter regions exhibit unimodal k-mer distributions, while the introns, 3' UTRs and the gene promoter regions of length 5,000 bases exhibit multimodal k-mer distributions (Additional data file 6).
In addition to the detailed analysis performed on the mammalian and non-mammalian genomes mentioned above, analysis was also performed on 910 bacterial and archeal genomes (all complete microbial genomes listed at  and available in the EMBL nucleotide database on 10 August 2009). We created a website  that contains the accession number, species name, effective genome length, value 0.7·log4(genome length), C+G content, ρCpG for that genome, the empirical frequencies of k-mers for various values of k, parameters for low order Markov models, and spectra plots for various values of k. In addition, the site contains the k-mer spectra for additional eukaryotic species, including tetrapods (mostly mammals), individual human chromosomes, and different human genomic regions.
Low-order Markov models
The unexpected finding of different modalities motivated us to ask what probabilistic models could describe these genomic distributions. Perhaps the simplest probabilistic models to describe strings like genomes and chromosomes are low-order Markov models , such as those commonly used for a genomic 'background' comparison when searching for regulatory elements (for example, [17, 18]). A zero-order Markov model simply describes the frequencies of each nucleotide (when we consider both strands of genomic DNA, the frequencies of A, T are equal, and so are those of C, G) and the underlying model is a Bernoulli sequence. A first-order Markov model describes the frequencies of individual nucleotides given the nucleotide immediately preceding it, a second-order Markov model describes the frequencies of individual nucleotides given the pair of nucleotides immediately preceding it, and so on.
While the copy/insert process is deliberately designed to be abstract - a pedagogical tool to show how heavy tails can arise from biologically inspired processes - it neglects nucleotide mutation, which is a major cause of genomic change. To account for mutation, we randomly replaced a fixed percentage of nucleotides after each insertion. Figure 3b shows that the heavy tail of the copy/insert process collapses when even a small amount of mutation is present, and the resulting spectra are barely distinguishable from that of a Bernoulli sequence.
CpG suppression and modality
Having established that Markov models can predict modalities, we can try and understand this phenomenon by looking at the di- and tri-nucleotide frequencies that are sufficient to define such models. The most outstanding property for the human genome is the low frequency of the CpG dimer, with P(G | C) = 0.048, and P(CpG) = 0.01. This well known phenomena, termed CpG suppression, occurs for all other mammals as well, but not for most other organisms.
These findings suggest the hypothesis that CpG suppression is what determines modality. It turns out this is incorrect, and CpG suppression by itself does not fully account for multimodal word frequencies. This is because there are several unicellular species that exhibit CpG suppression, yet have unimodal word frequencies. One such organism is the protozoa Tetrahymena thermophila, which has P(CpG) = 0.005 yet has a unimodal distribution. We note that the difference in modalities is not due to the difference in genome lengths. The genome of T. thermophila is 97 Mb long, comparable to the 108 Mb of human chromosome 12, yet all human chromosomes have multimodal spectra.
If the occurrences of C and G were independent, ρCG would be 1; genomes exhibiting suppression will have ρCG much less than 1. For human, ρCG = 0.24, for opossum it is 0.13, for lizard 0.3 and for frog 0.34. Low values are also attained for some protozoa (for example, Entamoeba histolytica, 0.30) and archea (for example, Methanococcus jannaschii, 0.32; Methanosphaera stadtmanae, 0.27). Values of ρCG exceeding 1 are infrequent but do exist; for example, 1.15 for the red flour beetle Tribolium castaneum, and 1.64 for the honey bee Apis mellifera.
It is of interest to examine the modalities of the word frequency distributions from an evolutionary perspective. All archeal and bacterial species exhibit unimodal k-mer distributions whereas there is a difference amongst the eukaryota. All mammals, including human, chimp, mouse, dog, cow, opossum (a non-placental mammal), and platypus (a monotreme, egg-laying mammal) exhibit multimodal k-mer distributions. Non-mammals that exhibit multimodal distributions are chicken, lizard, and frog. All these 'multimodal' species are tetrapods. The next sister clade in the tree of life contains the bony fish but none of the five sequenced bony fish (zebrafish, fugu, tetraodon, stickleback, and Japanese medaka) have multimodal k-mer spectra, in common with all other eukaryota except tetrapods. Following this observation, we predict that the genomes of additional tetrapods currently being sequenced, such as the alligator (Alligator mississippiensis), will also be multimodal.
We find the fact that low order Markov models are capable of producing multimodal spectra, and, furthermore, that their modalities match those of the actual genomes, to be rather unexpected. We note that the modalities of the Markov model distributions were essentially found experimentally (by simulation). It may be possible to find the modalities, and other global properties of the k-mer spectra, analytically (from just the model's parameters).
Which distribution better fits the genome sequences - low-order Markov models, or a DPLN distribution? The DPLN distribution is the ratio of two Pareto distributions times a log-normal  and has a single mode. Consequently, it cannot describe multimodal distributions such as the tri-modal distribution observed for human 11-mers (Figure 1). Csürös et al. use a mixture of four separate DPLN distributions, separating the k-mers into four groups by CpG content (0, 1, 2 and 3+ occurrences, respectively). Such mixtures can indeed fit a tri-modal distribution, but this raises the question of whether a simple generative model cannot capture these phenomena as well.
We remark that our own simulation of the 'copy/insert' process of Csürös et al. does not necessarily produce a heavy tailed k-mer spectrum and, in fact, the resulting k-mer spectrum can have a light 'exponential' tail similar to a Bernoulli sequence. The reason for the difference may lie in the length and k-mer content of the initial sequence used to prime the copy/insert process; too short a sequence and the few k-mers it does contain become abundant before the new k-mers can be generated by splicing. The lack of robustness to mutation of the heavy-tail produced by the copy/insert process suggests that this process does not explain how such tails arise in nature.
Lastly, we note that while the DPLN fits the spectrum well, noticeably so at the tails of the distribution, it is not a generative model: the shape of the distribution fits but it says little about the probability of observing a particular k-mer. Generative models like the Markov models predict which k-mers are likely to be rare or abundant, whereas the DPLN makes no such prediction.
Materials and methods
Genomic sequences were obtained from the standard sequence archives for each organism, listed in Additional data file 7, and masked for repeats. In several cases, the effective sequence length we report differs substantially from the accepted length - platypus being a noticeable outlier - due to the genome being incompletely sequenced or having large sections masked as highly repetitive. Nucleotides that are ambiguous or masked were considered to be breaks in the sequence and so any k-mer containing them was discarded and not enumerated. The k-mers for each genome where enumerated using simple programs written by the authors, from which the spectra and associated statistics are calculated.
When we use the same k for shorter genome length ℓ, we get a larger proportion of words at the low end of the curve. To account for this, smaller values of k are used for shorter genomes. Since the total number of possible k-mers is 4 k , if we take k = log4 ℓ then the mean number of occurrences in an ℓ long genome is 1. This indicates that, to normalize for the differing lengths of each genome, k should be chosen proportional to log4 ℓ. Specifically, we looked at values of k = ⌈0.7 log4ℓ⌉ (the smallest integer greater than 0.7 log4 ℓ). Simple Markov models of the type described can be fitted through counts of short k-mers. The transition frequencies, the proportion of times we see each nucleotide in the genome given the previous k nucleotides in the genome, are the maximum likelihood estimates for the transition probabilities of a k-order Markov model . To fit a zero-order model, only the nucleotide frequencies of each genome are needed; dinucleotide frequencies alone are needed for first-order models, trinucleotide frequencies for second-order, and so on. These frequencies were calculated using the programs written to determine the k-mer spectra of each genome. Again, bases that are ambiguous or masked were considered as breaks in the sequence and so not enumerated. Sequences were then simulated from these models using standard techniques. The copy/insert process was simulated using the authors' own programs. Following Csürös et al. , a (composition biased) Bernoulli sequence was used to generate an initial genome. Chunks of genome 33 bp long were chosen uniformly at random, copied, and inserted into a random place in the genome; this step was repeated until the genome reached a specified length. Unless otherwise stated, the initial genome was 5,000 bp and the final length was about 67 Mb (comparable to repeat masked human chromosome 5) to maintain consistency with . To investigate the effect of mutation, we adapted the copy/insert process so that every insertion was followed by mutation of a fixed proportion of the nucleotides, chosen uniformly at random and replaced with random nucleotides from a (composition biased) Bernoulli distribution.
Additional data files
The following additional data are available with the online version of this paper:
Spectra of individual human chromosomes (Additional data file 1); whole genome spectra of human and opossum (Additional data file 2); whole genome spectra of nine non-tetrapodal species (Additional data file 3); spectra of additional, non-mammalian tetrapod whole genomes (Additional data file 4); the distribution of 8-mers containing the dimer CpG in four eukaryotic species (Additional data file 5); spectra of specific human genomic regions (Additional data file 6); organisms whose genomes were used in this study, along with additional information about each (Additional data file 7).
double Pareto log-normal
The work was conceived and a large part performed while BC was on a sabbatical visit to the European Bioinformatics Institute, funded by Tel-Aviv University and the European Bioinformatics Institute. NG and TM are funded by the European Molecular Biology Laboratory. BC was not supported by the Israel Science Foundation, the German-Israeli Foundation, or the Israel Ministry of Science and Technology.
- Robin S, Schbath S: Numerical comparison of several approximations of the word count distribution in random sequences. J Comput Biol. 2001, 8: 349-359.PubMedView ArticleGoogle Scholar
- Reinert G, Schbath S, Waterman MS: Probabilistic and statistical properties of words: an overview. J Comput Biol. 2000, 7: 1-46.PubMedView ArticleGoogle Scholar
- Otaki JM, Ienaka S, Gotoh T, Yamamoto H: Availability of short amino acid sequences in proteins. Protein Sci. 2005, 14: 617-625.PubMedPubMed CentralView ArticleGoogle Scholar
- Tuller T, Chor B, Nelson N: Forbidden penta-peptides. Protein Sci. 2007, 16: 2251-2259.PubMedPubMed CentralView ArticleGoogle Scholar
- el antri S, Bittoun P, Mauffret O, Monnot M, Convert O, Lescot E, Fermandjian S: Effect of distortions in the phosphate backbone conformation of six related octanucleotide duplexes on CD and 31P NMR spectra. Biochemistry. 1993, 32: 7079-7088.PubMedView ArticleGoogle Scholar
- Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li T, Chumakov S, Pettitt BM: How independent are the appearances of n-mers in different genomes?. Bioinformatics. 2004, 20: 2421-2428.PubMedView ArticleGoogle Scholar
- Hampikian G, Andersen T: Absent sequences: nullomers and primes. Pac Symp Biocomput. 2007, 355-366.Google Scholar
- Herold J, Kurtz S, Giegerich R: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008, 9: 167-PubMedPubMed CentralView ArticleGoogle Scholar
- Zhou F, Olman V, Xu Y: Barcodes for genomes and applications. BMC Bioinformatics. 2008, 9: 546-PubMedPubMed CentralView ArticleGoogle Scholar
- Mrázek J, Karlin S: Distinctive features of large complex virus genomes and proteomes. Proc Natl Acad Sci USA. 2007, 104: 5127-5132.PubMedPubMed CentralView ArticleGoogle Scholar
- Stacey K, Young R, Clark F, Sester D, Roberts T, Naik S, Sweet M, Hume DA: The molecular basis for the lack of immunostimulatory activity of verterbrate DNA. J Immunol. 2003, 170: 3614-3620.PubMedView ArticleGoogle Scholar
- Csürös M, Noé L, Kucherov G: Reconsidering the significance of genomic word frequencies. Trends Genet. 2007, 23: 543-546.PubMedView ArticleGoogle Scholar
- Reed W, Jorgensen M: The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications Stat Theory Methods. 2004, 33: 1733-53.View ArticleGoogle Scholar
- NCBI: Complete Microbial Genomes. [http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi]
- k- mer analysis of multiple genomes. [http://www.ebi.ac.uk/goldman-srv/ChorEtAlSpectra/]
- Norris JR: Markov Chains. 1998, New York: Cambridge University PressGoogle Scholar
- Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001, 127-138.Google Scholar
- Narasimhan C, LoCascio P, Uberbacher E: Background rareness-based iterative multiple sequence alignment algorithm for regulatory element detection. Bioinformatics. 2003, 19: 1952-1963.PubMedView ArticleGoogle Scholar
- Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F: The mosaic genome of warm-blooded vertebrates. Science. 1985, 228: 953-958.PubMedView ArticleGoogle Scholar
- Karlin S, Mrázek J: Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci USA. 1997, 94: 10227-10232.PubMedPubMed CentralView ArticleGoogle Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, New York: Cambridge University PressView ArticleGoogle Scholar
- Ensemble ftp site. [ftp://ftp.ensembl.org/]
- NCBI ftp site. [ftp://ftp.ncbi.nih.gov/]
- TIGR ftp site. [ftp://ftp.tigr.org/]
- UCSC: Sequence and Annotation Downloads. [http://hgdownload.cse.ucsc.edu/downloads.html]
- HGSC ftp site. [ftp://ftp.hgsc.bcm.tmc.edu/]
- Genoscope. [http://www.genoscope.cns.fr/spip/]
- Human Exons and Introns. [http://www.utoledo.edu/med/depts/bioinfo/database.html]
- Human 3' UTRs and 5' UTRs. [http://harlequin.jax.org/pacdb/data.php]
- EPD Sequence Download Page: Human Gene Promotors. [http://www.epd.isb-sib.ch/seq_download.html]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.