Sequence context affects the rate of short insertions and deletions in flies and primates
© Tanay and Siggia; licensee BioMed Central Ltd. 2008
Received: 14 June 2007
Accepted: 21 February 2008
Published: 21 February 2008
Insertions and deletions (indels) are an important evolutionary force, making the evolutionary process more efficient and flexible by copying and removing genomic fragments of various lengths instead of rediscovering them by point mutations. As a mutational process, indels are known to be more active in specific sequences (like micro-satellites) but not much is known about the more general and mechanistic effect of sequence context on the insertion and deletion susceptibility of genomic loci.
Here we analyze a large collection of high confidence short insertions and deletions in primates and flies, revealing extensive correlations between sequence context and indel rates and building principled models for predicting these rates from sequence. According to our results, the rate of insertion or deletion of specific lengths can vary by more than 100-fold, depending on the surrounding sequence. These mutational biases can strongly influence the composition of the genome and the rate at which particular sequences appear. We exemplify this by showing how degenerate loci in human exons are selected to reduce their frame shifting indel propensity.
Insertions and deletions are strongly affected by sequence context. Consequentially, genomes must adapt to significant variation in the mutational input at indel-prone and indel-immune loci.
The evolution of genomes is driven by an influx of mutations that are subject to a stochastic process of neutral fixation and to multiple selective pressures that can change the neutral fixation dynamics. Good understanding of the evolutionary process requires characterization of both the mutational and fixation processes. This is particularly important in applications that try to reveal genomic loci that are evolving under selection by looking for slowly or rapidly evolving sequences. In such studies one has to make sure the mutational input at the genomic regions under study is not abnormally high or low [1–4], or else the inferred selection may be an artifact of the mutational dynamics and not a true indication for a functional constraint on the sequence. Changes are introduced into genomes through point mutations, insertions and deletions. The dynamics of each of these mechanisms may vary according to genomic context and the presence of various factors acting in trans.
Before the availability of numerous fully sequenced genomes, evolutionary studies focused on two extremes: replacements of entire genes and chromosome domains or point mutations. The former can be detected over long evolutionary times and their gain or loss has an immediate functional interpretation. Quantitative molecular evolution has developed around the occurrence of point mutations over limited regions of the genome, where it is feasible to compare intra-species variation with inter-species change, and infer fitness. Intermediate in scale are small (1-50 bp) insertions or deletions (indels). They are less numerous than single base substitutions, but can account for comparable base-pairs of change. For example, 3.2% of the base-pair changes between the fly species and 0.8% of the base-pair changes in the primate species analyzed here are affected by indels, compared with 1.8% and 1.5% affected by point mutations in flies and primates, respectively. Short indels are, therefore, a significant factor in the mutational input that feeds into the evolutionary process, a fact that underlines the importance of characterizing the mechanisms that induce or suppress their activity. Earlier work focused on human insertions and deletions at disease loci [5–8] or on indels detected between relatively distant species [9, 10] suggested that such events are correlated with specific sequence contexts. More recent works [11, 12] characterized extensive collections of indels in the human-chimp lineages, further motivating a comprehensive approach to the description of their sequence contexts.
In this work, we construct an evolutionary model for small indels in flies and primate genomes. We characterize these processes using mechanistic insights (tandem duplication for insertion, replication slippage for deletion). We discover significant sequence contexts that are susceptible to deletion or insertion. Using the new data, we are able to predict the rate of insertions and deletions at each genomic loci given the sequence surrounding it. We show the indel rate at different loci can vary within more than two orders of magnitude, making specific loci susceptible to rapid insertion or deletion and other loci immune to it. Our results suggest that indels are introduced into the genome by a random process, but that the rate of this random process is highly dependent and, to a great extent, predictable from the sequence. We demonstrate the significance of this indel rate variability by showing how synonymous codons in human exons are selected for low frame shifting indel potential.
A comprehensive compendium of short insertion/deletion events in primates and flies
Close, fully sequenced species grouped around one species with high quality annotation permit good single indel event statistics to be inferred. Three species are necessary, the two closest 'ingroups' are compared, while the third 'outgroup' defines the ancestor and thus distinguishes insertions from deletions. For the primates we compared human and chimpanzee with Rhesus macaque as the outgroup. Human indels inferred from these three species have recently been studied by Messer and Arndt  and using a non-primate outgroup by Chen et al. . For flies, we compared the Drosophila species D. simulans with D. sechelia, using D. melanogaster as the outgroup, refining an earlier study  that compared D. melanogastar with D. yakuba using D. pseudoobscura as the outgroup.
We further controlled for possible alignment problems in primates by performing direct searches for putative human and chimp inserts and their flanking 60 bp in the chimp and human genomes, respectively (see Materials and methods). We performed similar searches for the sequences flanking putative deletion points. All putative insertion or deletion sequences (including the flanking regions) that were aligned to the other genome without gaps were defined as questionable and removed from further analysis (Figure 1b). Comparison of two multiple alignment sets (based on the panTro1 and panTro2 assemblies) revealed that many of the questionable indels are inconsistent between versions of the alignment (data not shown). On the other hand, analysis of sequence quality data in the chimp assembly did not support a connection between dubious indels and low sequence quality.
Distribution of inserted and deleted sequence lengths
It is widely assumed (for example, by alignment algorithms) that gap lengths are distributed as geometric variables, but as shown in Figure 1c the length distributions of the events in our set are not geometric, and vary between insertions and deletions. In the primate lineages, the length distributions for both insertions and deletions may reflect two geometric regimes, making short and long events more probable than expected given a simple, single parameter geometric distribution. The shift between the two regimes occurs at length 8-10 bp for deletions and at length 10-12 for insertions. For flies, the deletion lengths are distributed as a simple geometric distribution. The insertions in flies are also distributed with two geometric regimes, one for short events (2-10 bp) and the other for longer ones (10 bp and more). The observed length distributions can indicate that multiple mechanisms are contributing to the insertion or deletion processes. Interestingly, the length distributions of questionable events are markedly different than those of the retained events, supporting our filtering scheme (Figure S1 in Additional data file 1). We were unable to detect specific families of sequences that significantly affect the length distribution in any of the tested lineages. Previous studies that argued for geometric gap length distributions [13–15] were based on smaller numbers of events than present in our set, and were, therefore, limited with respect to inference of the distribution of low frequency (that is, long) events (which are those that seem to break the simple geometric regime).
Most short insertions are accounted for by simple and complex tandem duplications, sometimes involving the reverse strand
The remaining putative insertions lack apparent sequence templates in their immediate neighborhood. To see if sequence templates for such insertions can be found in more remote chromosomal regions, we examined insertion events of length above 30 bp (for which genome-wide searches are specific) that lacked a tandem template. We detected only few cases where a possible insert template was located out of the immediate locus neighborhood. In no case did we find a possible insert template in a different chromosome. We next computed the average chimp assembly (panTro2) quality around putative insertion events with and without a tandem template. We could not detect a significant difference in the sequence quality around the two groups (Figure S2 in Additional data file 1). In flies we found that the many non-tandem long inserts were present (though mutated) in D. yakuba. This suggests another instance of incomplete lineage sorting , where structurally polymorphic loci persist throughout the speciation of the Drosophila species we analyzed. Other effects (for example, alignment artifacts) may also be contributing to the increase in non-tandem fly insertion fraction as a function of the insert length. To summarize, for our compendium, the dominant mechanism for insertions is tandem duplication, perhaps in several steps, and in a minority of cases the copy is from the complementary strand. A fraction of the gaps still cannot be rationalized using this model, and although there is some indirect evidence that suggest many of these are in fact alignment errors, other explanations are still possible.
Sequence preferences of short insertion events
Short deletions are marked by short matches at the deletion junction
Additional sequence preferences for short deletions
We next wished to test if higher level interactions between nucleotides are significantly associated with deletion or insertions events. We searched for such interactions systematically (see Materials and methods), identifying all pairs of positions relative to the deletion/insertion sequence in which the joint distribution of nucleotide pairs differ significantly from the genomic distribution of nucleotide pairs at the same distance. The strongest pairs other than the tandem effects we discussed above (compare Figure 4b; correlation between nucleotides spaces by the length of the deletion) indicated interaction between Cs and Gs in the positions adjacent to deletions of even lengths (Figure 5b). It is possible that some sort of G-C pairing at these positions contributes to the generation and stabilization of non-B-DNA structures and, therefore, enhances deletion propensity . In bacteria, palindromic sequence favors deletions since it stabilizes the fold-back configuration by internal base pairing .
The indel propensity model
Over 100-fold change in insertion/deletion probability given preferred sequence contexts
To test the predictive power of the models described above and to ensure the context model does not introduce overfitting, we performed standard cross validation. We divided the human genome into two (odd and even numbered chromosomes). We trained our models using the data from only odd-numbered chromosomes. We then computed the distribution of model scores for each type of event (length, lineage, type) for the background genomic sequence and around indel events occurring at even-numbered chromosomes (Figure 6b). Working with the highly similar primate genomes, we assumed the ancestral sequence is identical to the human genome (that is, we ignored point mutations), with the exception that inserted sequence has to be removed, and deleted sequence has to be retained. Cross validation confirmed the robustness of the indel propensity score, showing increasing indel probability for higher log odds values. The results indicate that the relative rate of insertions and deletions of all tested lengths vary by as much as a factor of 100 as a function of the model score. Loci with very high indel propensity have a very high probability for insertion or deletion (of a specific length) and behave much like micro-satellites. Loci with very low indel propensity may be almost indel free. Between these two extremes we observe a whole array of weaker phenomena, consisting of diverse sequence contexts with variable indel propensity.
Indel constraints in coding regions
The existence of a dense web of fully sequenced metazoan genomes around those of human and fly permits a thorough study of short indel events. Although their rate is below that of base substitution events, they account for a comparable number of base-pairs of sequence change and, thus, are potentially at least as important for evolving new functionality as are single base changes. The indel rate is also very context dependent, as is most immediately evident from our finding that half of the indels present between our ingroup species did not map exactly onto the outgroup. We believe this reflects a large number of multiple events that have occurred even though the primary events are rare (occurring on < 0.5% of the genome) for the genomes we compare.
New inserted sequence can usually be rationalized as a copy of adjacent sequence, but the copying process is sometimes complex or involves the complementary strand in primates. We found no systematic evidence for copying from far away (for example, 10 kb) in the size range of 50 bp or less. In the fly a possible explanation for many of the putative insertions that were not copied is incomplete lineage assortment, but this was not observed in primates. We note that there is a tradeoff to be made when choosing the outgroup: one too removed allows for multiple events and overlaid point mutations, and one too close and the phylogeny of the locus may not conform to that of the species, as is the case for some of the fly loci.
Deletions have fewer sequence constraints, namely only the propensity for the match of a few base-pairs flanking the junction (Figure 4). The relative rates of insertions to deletions differs between flies and primates for lengths over 10 bp, perhaps because most of the fly sequence is under some functional constraint [10, 23], whereas the primate genomes are much less constrained. The length spectra of both insertions and deletions have two power law regimes, which prima facie contradicts the scoring assumption of standard alignment algorithms.
One of the emerging themes from recent analyses of divergence in closely related genomes is that the mutational process is highly dependent on sequence context. Adjacent nucleotides are known to affect point mutation rates  but understanding of more complex sequence context and their possible roles in changing the mutational input and/or selective pressure are only now beginning to emerge (for example, as in the case of CpG dinucleotides [25, 26]. We have shown that a model using only the distribution of nucleotides suitably defined with respect to the indel predicts over 100-fold variation in the rates of the appropriate indel event. Interestingly, the sequence context of insertions and deletions of different lengths share only some of their features (for example, AT asymmetry) and differ in others (G-C coupling in deletions of even length), suggesting specific interaction with the replication machinery.
For molecular evolution, our indel propensity model can be used as a refined neutral standard in applications that search for categories of sequence that are evolving slowly due to putative selection. In particular, we have demonstrated how coding regions bias their codon usage to suppress frame shifting mutations. Indels may contribute more to the evolution of regulatory sequences than their frequency would suggest, because their size is comparable to a protein binding site. It would be interesting to see whether our model can explain some of the drift in binding sites that have been mapped on a genome scale [27–29].
Materials and methods
Alignments, indel detection and filtering
Primate sequences were downloaded from the remarkably useful UCSC genome browser site , using the hg18, panTro2 and rheMac2 assemblies. Multiple alignments were generated by extracting the primate sequences from the vertebrate maf files (28 vertebrate species, 2007 version) and concatenating contiguous fragments. To annotate known repeat sequences, we used the RepeatMasker data from the UCSC genome browser site.
We used release 4.3 of the D. melanogaster genome from . The D. simulans sequence was the 'mosaic' assembly from , which was generated using their genome assembler to combine the sequences from six D. simulans strains (this assembly is also used on the UCSC browser). The D. sechellia sequence is a contig library from the same source. Three way alignments were done with the TBA codes from , and the parameters T = 1, C = 2, and L = 10,000 were modified from their defaults.
We observed for both our fly ingroup species that our alignments predicted that about 6% of all exons had a 1 bp indel when compared with the D. melanogaster annotation. This is an unreasonably high rate of frame shifts, so we downloaded the alignments of the individual D. simulans strains against D. melanlogaster from . The six strains together with D. melanogaster where then multi-aligned using TBA and a new consensus was derived using the majority pattern among the D. simulans strains if one existed; otherwise, the strain matching D. melanogaster if one existed; otherwise, an array of 'N's.
With the new consensus the number of length 1 indels in exons fell to 0.4%, with the majority of these in regions of low coverage or homopolymer repeats. However, we continued to use the 'mosaic' alignment since it was no worse than the D. sechellia data, which we had no means to correct, and also the errors for indels larger than length 1 were tolerable (12% of length 2 indels and 0.5% of length 4 and larger could be spurious, estimated from the number of frame shifting indels observed (and assuming all of these are spurious) and multiplying by the fraction of coding sequence in the genome and the total number of indels we found).
To construct a reliable compendium of short insertions and deletions, we identified all gaps in the multiple alignments that had clearly defined boundaries in all three species (the outgroup aligned exactly with one of the ingroups) and were flanked by at least 20 bps of gapless matches. We annotated each gap as an insertion or deletion based on the known triplet phylogeny of primates and flies. We ignored cases that occurred in the outgroup lineage (and therefore could not be resolved as insertion or deletions). To control for arbitrary gap positions in the multiple alignments, we computed for each gap the set of all possible optimally scoring gap positions by sliding the gap in both the 5' and 3' directions, and computing the number of resulting mismatches at each position. All gap positions with the minimal number of mismatches were considered as candidates, and were assigned with a weight of 1/(Number of optimal gap positions). We used the weights when computing statistics for the insertion/deletion ensembles, unless otherwise noted. To prevent a bias from large families of short repeats, we filtered out all gaps that were within 40 bp of an annotated repeat in all the analyses reported, except for the data in Figure 1a. Including the repeats in the analysis did not affect the results significantly, nor did imposing a minimal percent identity on the ungapped flanking sequence.
To minimize alignment errors for the primate data, we further filtered events using direct genomic searches. For each insertion event we used the sequence of the insert flanked by 60 bp on each side. For deletions we used the flanking 60 bp around the deletion point. We then used J Kent's blat program (with standard parameters, see UCSC website ) to search for these sequences in the orthologous chromosome (searching the chimp genome for human indels and the human genome for chimp indels). We defined an event as questionable whenever blat returned a hit that spanned the insertion or deletion junction with 20 bp of flanking gapless match. The size distributions of retained and questionable events are shown in Figure S1 in Additional data file 1.
To characterize the sequence quality around different classes of indels, we used data from the chimp panTro2 assembly, extracted through the UCSC database. To generate supplementary Figure 2, we computed for each human insertion the minimal sequence quality (scaled from 0 to 97) in the aligned chimp sequence that surrounded the putative insertion point (20 bp for each side). Note that we did not use the sequence quality in the filtering process.
Looking for sources of inserted sequences
To identify possible sequence templates for short inserts, we applied several layers of analysis. First, we directly tested the number of mismatches between the insert and its immediately flanking sequence (on both sides). In cases where more than one alignment configuration was possible for the gap, we tested all possible configurations. Inserts that had at most one mismatch for the sequence in either side were considered as perfect tandems (this was feasible for indels of length more than 5). At the second level, we searched for the longest perfect match between a substring of the insert and all sequence within 'l + 10' nucleotides on either side of the insert (of length l). We compared this number to the longest match in the 1.1 kb upstream and downstream of the insert (excluding the 100 bp immediately flanking it on each side). All cases where the longest match near the sequence was longer than that in the larger surroundings were assumed to be complex tandem duplications, as the expected random fraction of such cases is 1%. We applied the same technique to classify insertions as matching the reverse strand.
Detecting nucleotide and nucleotide pair preferences
To compute the sequence context preferences of insertion and deletions, we grouped similar events by their type (length, insertion/deletion, and lineage). We also computed the GC content in a window of 400 bp around the event and grouped together events with similar GC content (using bins of 10%). We constructed the nucleotide profile for each event group by simple counting, and transformed the frequency at each position to log odds by comparing it to the background probability in sequences with similar GC content. To detect statistically significant correlations between pairs of positions relative to the insertion or deletion junction, we constructed the joint distributions of nucleotides for each pair of positions in the range -20 to +20 bp relative to insertions or deletions of a given type. We also computed the background joint distribution of nucleotide pairs at each distance. We used chi-square statistics to test if the two contingency tables differ, and a hyper-geometric test to check if particular pairs of nucleotides are correlated in a positive or negative way.
The indel propensity model
To model the sequence around indels of specific type (length, lineage, insertion/deletion), we constructed a positional Markov model that determines the probability of observing a nucleotide X at position i relative to the indel point by looking up a conditional probability table that is parameterized by the position itself and by the nucleotide at position 'i - 1' (for inserts) and by both the nucleotides at positions 'i - 1' at 'i - l' for deletions (l being the length of the event). The probability tables were inferred directly from the sequences in our compendium. To score a genomic locus for a certain event type, we computed the model likelihood and compared it to the likelihood of a background model that was constructed similarly, but trained using background sequences.
To test possible preferences of human exon against high indel propensity, we generated a set of randomized exons by shuffling synonymous codons while preserving GC content. This was done by first determining for each codon its regional GC content and then selecting a synonymous codon at random such that the expected overall GC content distribution at the synonymous sites before and after the randomization is similar. We then computed the fraction of exon loci with indel propensity score larger than 1, in both the real and randomized sets, and determined the significance of the detected differences using binomial statistics.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a pdf file including figures S1-S5.
insertion or deletion.
EDS was supported in part by the National Science Foundation under grant number DMR 0517138. AT holds an Alon fellowship.
- Lunter G, Ponting CP, Hein J: Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comp Biol. 2006, 2: e5-10.1371/journal.pcbi.0020005.View ArticleGoogle Scholar
- Clark AG: The search for meaning in noncoding DNA. Genome Res. 2001, 11: 1319-1320. 10.1101/gr.201601.PubMedView ArticleGoogle Scholar
- Gaffney DJ, Keightley PD: The scale of mutational variation in the murid genome. Genome Res. 2005, 15: 1086-1094. 10.1101/gr.3895005.PubMedPubMed CentralView ArticleGoogle Scholar
- Lunter G: Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007, 23: i289-296. 10.1093/bioinformatics/btm185.PubMedView ArticleGoogle Scholar
- Chuzhanova NA, Anassis EJ, Ball EV, Krawczak M, Cooper DN: Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat. 2003, 21: 28-44. 10.1002/humu.10146.PubMedView ArticleGoogle Scholar
- Kondrashov AS, Rogozin IB: Context of deletions and insertions in human coding sequences. Hum Mutat. 2004, 23: 177-185. 10.1002/humu.10312.PubMedView ArticleGoogle Scholar
- Ball EV, Stenson PD, Abeysinghe SS, Krawczak M, Cooper DN, Chuzhanova NA: Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mut. 2005, 26: 205-213. 10.1002/humu.20212.PubMedView ArticleGoogle Scholar
- Chen JM, Chuzhanova N, Stenson PD, Ferec C, Cooper DN: Complex gene rearrangements caused by serial replication slippage. Hum Mutat. 2005, 26: 125-134. 10.1002/humu.20202.PubMedView ArticleGoogle Scholar
- Thomas EE, Srebro N, Sebat J, Navin N, Healy J, Mishra B, Wigler M: Distribution of short paired duplications in mammalian genomes. Proc Natl Acad Sci USA. 2004, 101: 10349-10354. 10.1073/pnas.0403727101.PubMedPubMed CentralView ArticleGoogle Scholar
- Sinha S, Siggia ED: Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. Mol Biol Evol. 2005, 22: 874-885. 10.1093/molbev/msi090.PubMedView ArticleGoogle Scholar
- Chen FC, Chen CJ, Li WH, Chuang TJ: Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Res. 2007, 17: 16-22. 10.1101/gr.5429606.PubMedPubMed CentralView ArticleGoogle Scholar
- Messer PW, Arndt PF: The majority of recent short DNA insertions in the human genome are tandem duplications. Mol Biol Evol. 2007, 24: 1190-1197. 10.1093/molbev/msm035.PubMedView ArticleGoogle Scholar
- Chang MS, Benner SA: Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J Mol Biol. 2004, 341: 617-631. 10.1016/j.jmb.2004.05.045.PubMedView ArticleGoogle Scholar
- Gu X, Li WH: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol. 1995, 40: 464-473. 10.1007/BF00164032.PubMedView ArticleGoogle Scholar
- Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res. 2003, 31: 5338-5348. 10.1093/nar/gkg745.PubMedPubMed CentralView ArticleGoogle Scholar
- Maki H: Origins of spontaneous mutations: specificity and directionality of base-substitution, frameshift, and sequence-substitution mutageneses. Annu Rev Genet. 2002, 36: 279-303. 10.1146/annurev.genet.36.042602.094806.PubMedView ArticleGoogle Scholar
- Lovett ST: Encoded errors: mutations and rearrangements mediated by misalignment at repetitive DNA sequences. Mol Microbiol. 2004, 52: 1243-1253. 10.1111/j.1365-2958.2004.04076.x.PubMedView ArticleGoogle Scholar
- Sinden RR, Hashem VI, Rosche WA: DNA-directed mutations. Leading and lagging strand specificity. Ann N Y Acad Sci. 1999, 870: 173-189. 10.1111/j.1749-6632.1999.tb08878.x.PubMedView ArticleGoogle Scholar
- Pollard DA, Iyer VN, Moses AM, Eisen MB: Widespread discordance of gene trees with species tree in Drosophila : evidence for incomplete lineage sorting. PLoS Genet. 2006, 2: e173-10.1371/journal.pgen.0020173.PubMedPubMed CentralView ArticleGoogle Scholar
- Bacolla A, Wells RD: Non-B DNA conformations, genomic rearrangements, and human disease. J Biol Chem. 2004, 279: 47411-47414. 10.1074/jbc.R400028200.PubMedView ArticleGoogle Scholar
- dos Reis M, Savva R, Wernisch L: Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 2004, 32: 5036-5044. 10.1093/nar/gkh834.PubMedView ArticleGoogle Scholar
- Parmley JL, Hurst LD: Exonic splicing regulatory elements skew synonymous codon usage near intron-exon boundaries in mammals. Mol Biol Evol. 2007, 24: 1600-1603. 10.1093/molbev/msm104.PubMedView ArticleGoogle Scholar
- Andolfatto P: Adaptive evolution of non-coding DNA in Drosophila. Nature. 2005, 437: 1149-1152. 10.1038/nature04107.PubMedView ArticleGoogle Scholar
- Siepel A, Haussler D: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004, 21: 468-488. 10.1093/molbev/msh039.PubMedView ArticleGoogle Scholar
- Tanay A, O'Donnell AH, Damelin M, Bestor TH: Hyperconserved CpG domains underlie Polycomb-binding sites. Proc Natl Acad Sci USA. 2007, 104: 5521-5526. 10.1073/pnas.0609746104.PubMedPubMed CentralView ArticleGoogle Scholar
- Taylor J, Tyekucheva S, Zody M, Chiaromonte F, Makova KD: Strong and weak male mutation bias at different sites in the primate genomes: insights from the human-chimpanzee comparison. Mol Biol Evol. 2006, 23: 565-573. 10.1093/molbev/msj060.PubMedView ArticleGoogle Scholar
- Doniger SW, Fay JC: Frequent gain and loss of functional transcription factor binding sites. PLoS Comput Biol. 2007, 3: e99-10.1371/journal.pcbi.0030099.PubMedPubMed CentralView ArticleGoogle Scholar
- Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, Biggin MD, Eisen MB: Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput Biol. 2006, 2: e130-10.1371/journal.pcbi.0020130.PubMedPubMed CentralView ArticleGoogle Scholar
- Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, Macisaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet. 2007, 39: 730-732. 10.1038/ng2047.PubMedPubMed CentralView ArticleGoogle Scholar
- Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent WJ: The UCSC genome browser database: update 2007. Nucleic Acids Res. 2007, 35: D668-673. 10.1093/nar/gkl928.PubMedPubMed CentralView ArticleGoogle Scholar
- Fly Genome Database. [http://www.fruitfly.org]
- Genome Sequencing Center: D. simulans Mosaic Assembly. [ftp://genome.wustl.edu/pub/]
- Miller Lab Website. [http://www.bx.psu.edu/miller_lab/]
- DGBP Website. [http://www.dpgp.org/syntenic_assembly/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.