- Open Access
Identifying repeat domains in large genomes
© Zhi et al.; licensee BioMed Central Ltd. 2006
- Received: 13 June 2005
- Accepted: 5 January 2006
- Published: 31 January 2006
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries.
- Additional Data File
- Segmental Duplication
- Pairwise Alignment
- Repeat Domain
- Biological Origin
Repetitive elements form a major fraction of eukaryotic genomes. Though once dismissed as mere junk DNA, they are now recognized as "drivers of genome evolution"  whose evolutionary role can be "symbiotic (rather than parasitic)" . Examples of potentially beneficial evolutionary events in which repetitive elements have been implicated include genome rearrangements , gene-rich segmental duplications , random drift to new biological function [4, 5] and increased rate of evolution during times of stress [6, 7]. For these and other reasons, the study of repeat elements and their evolution is now emerging as a key area in evolutionary biology.
Individual repeat elements can be grouped into repeat families, each defined by the consensus sequence of its diverged copies. Repeat family libraries, such as Repbase Update libraries [8, 9] and RepeatMasker libraries , contain consensus sequences of known repeat families. Repeat families often contain shared subsequences, which we call repeat domains. Repeat domains can occur more than once within the same repeat family; for example, the ubiquitous human Alu family is dimeric . There are a number of cases of repeat families whose repeat domains are known to have different biological origins, for example, from repeat families with different modes of replication or from distinct retrovirus families. These repeat families and the domains they share are worthy of special attention, since they are assumed to result from interesting evolutionary events. We define a repeat family to be a composite repeat if it contains at least two repeat domains of different biological origin. Of course, discerning the biological origin of a repeat domain is a challenging endeavor. Nevertheless, human Repbase Update documents more than 10 repeat families as composite repeats, including the RICKSHA and Harlequin families. Many other composite repeats contain fragments from different retroviruses. Since composite repeats that contain only fragments of retroviral origin are probably products of retroviral recombinations, these are documented in Repbase Update as retroviral recombinations (see  for a review). Composite repeats are likely more than a mere curiosity: one composite repeat, SVA, is the third most active retrotransposon since the human/chimpanzee speciation . An additional example is found in the eel where a composite SINE repeat family borrowed a repeat domain from a different LINE family; this borrowed domain was experimentally shown to greatly enhance the retrotransposition rate of the SINE family .
Shared repeat domains yield important insights into repeat evolution, in the same way that multidomain protein organization yields insights into protein evolution [15, 16]. However, while the study of protein domains is a well-established research area, the study of repeat domains is still in its infancy. Indeed, RepeatGluer  is the only existing algorithm for repeat domain analysis. While RepeatGluer shows promise as a tool for repeat domain analysis, it is computationally intractable for large genomes. For large genomes, we propose that instead of identifying repeat domains de novo from genomic sequence, we identify repeat domains by analyzing repeat family libraries that are obtained via other means.
Finding all repeat domains is a difficult problem, since repeat family consensus sequences can share subsequences with themselves (for example, Alu repeats) or with other repeat families (for example, the composite repeat families described above). Thus, we argue that the ideal method for comparing repeat families and identifying repeat domains should allow for both self-similarities and shared similarities that appear in different orders in different sequences. Such similarities are difficult to capture in traditional multiple alignments that either attempt to align sequences over their entire length (for example, global alignment) or show small conserved regions of similarity (for example, local alignment) with no information about the location of these regions in the original sequence. Recently, several software programs including the Partial Order Alignment (POA) program , Threaded Blockset Aligner (TBA) , and the A-Bruijn Aligner (ABA)  were developed to address these shortcomings. ABA seems particularly well-suited both to the alignment of repeat family consensus sequences and to the decomposition of a library of such consensus sequences into repeat domains since ABA was designed for the alignment of sequences with both repeated and shuffled segments. However, we found that ABA could not automatically generate a repeat domain graph from a repeat library because repeat libraries frequently contain a large number of diverged sequences, including palindromic sequences. Below we describe how to overcome these difficulties.
In addition, since a repeat library typically contains several hundred to several thousand sequences, and the annotation of repeats is typically incomplete, the analysis of a repeat domain graph is a nontrivial task. Below we show several examples illustrating how particular queries in the repeat domain graph can provide powerful systematic analysis of repeat families in a repeat library, how topology of the repeat domain graph can help in elucidating evolutionary history, and how to deal with contaminants, which are common in de novo generated repeat libraries.
Applying the A-Bruijn graph to repeat library analysis: methodology and new algorithms
We represent an alignment of sequences in a repeat library as a directed graph called the repeat domain graph. The repeat domain graph of n sequences contains 2n source vertices and 2n sink vertices. A directed path in the graph from a source to sink vertex represents a sequence or the reverse complement of a sequence in the repeat library. The repeat domain graph typically contains several connected components. Each component corresponds to groups of repeat families with shared repeat domains, and can be analyzed individually. Edges in the repeat domain graph with multiplicity greater than one represent repeat domains that are shared between different repeat families, while single-multiplicity edges correspond to domains unique to a single family.
We construct repeat domain graphs using the framework of A-Bruijn graphs, which were first introduced and applied to the problems of DNA fragment assembly and de novo repeat classification in , and later extended to the alignment of protein sequences and genomic DNA sequences . The A-Bruijn graph is a general framework for handling sequences with repeated or shuffled domains and is constructed from a set of sequences and a set of pairwise alignments between these sequences. In practice, the A-Bruijn graph of a set of pairwise alignments often contains numerous short cycles, due to inconsistencies among the input alignments. These short cycles obfuscate the identification of the shared domains among these sequences and thus a series of graph heuristics is used for removing short cycles due to inconsistent alignments while retaining longer cycles due to shared domains. We discovered that these approaches were not sufficient to handle two complications that arise in repeat library analysis: namely, the need to align a large number of diverged sequences and the existence of palindromic sequences. The shortcomings of the method were not anticipated or addressed in earlier work because these issues did not arise in the problems addressed there: namely fragment assembly , where the input is a large number of very similar (greater than 95%) DNA sequences (reads), and the problems of multiple sequence alignment of a relatively small number of protein sequences or genomic DNA sequences . We developed new algorithms for the construction of the repeat domain graph that are modifications of the methods used to construct and simplify the A-Bruijn graph. Our new algorithms show significant improvement over the existing methods in the handling of inconsistent pairwise alignments and palindromic sequences, both of which are common in repeat libraries constructed de novo from genome sequences. These new algorithms are described and compared to the existing methods in the Materials and methods section.
Analysis of repeat domains in human Repbase
Fifteen repeat families containing domains shared with repeat families of different biological origin
Number of domains
Length of domains
In addition to repeat domains, the repeat domain graph also reveals known composite repeats in Repbase.
We remark that the subtle structure of shared repeat domains in this example are not clearly revealed by traditional row-column multiple alignment programs such as CLUSTALW , which align all sequences over their entire lengths. The repeat domain graph removes the restriction of aligning sequences over their entire length, and strikingly reveals the mosaic structure of these repeat families. We further remark that the correspondence between edges and repeat domains is only approximate. Determining the exact boundaries of repeat domains is a challenging problem, similar to the difficulty in defining the boundaries of protein domains. The ambiguity in boundary definition is manifested by complicated structures of short edges in the repeat domain graph. We ameliorate this ambiguity by contracting very short edges (length less than 20).
Discovering new composite repeats: repeats in C. elegans
We built the repeat domain graph of C. elegans repeat family library generated by Stein et al.  with the RECON program . This library contains 377 sequences of total length 251,168 bp. The resulting repeat domain graph contains 2,725 edges that organized into 464 connected components. Of these, 300 components represent 150 repeat families (and their reverse complements) that have neither self-similarities nor similarities with other repeat families. Another 109 connected components represent self-similarities among 86 repeat families that share no similarities with any other families. The remaining 55 connected components reveal the similarities and complex evolutionary relationship between the remaining 142 repeat families.
The complex evolutionary history of these repeat families is reflected in the mosaic structure of repeat domains. For example, repeat family E6 (path from source E6 to sink E6 in Figure 4) is decomposed into five repeat domains. Among them, the repeat domain with length 41 is shared with three other repeat families: E1, E2, and E4.
The edges with multiplicity greater than one form two paths - plus the two reflections of the paths resulting from the symmetry of repeat domain graph (red in Figure 4). We refer to these paths as the long path (containing five red edges) and the short path (containing three red edges). These paths delineate important parts of the repeats and may correspond to domains important for the propagation of the repeat elements .
These repeat families contain different combinations of edges on two red paths in Figure 4. These structures illustrate how one repeat family may borrow repeat domains from another repeat family. Repeat families E6, E2, and E4 contain only the short path; repeat families E5 and E7 contain only the long path; repeat family E3 contains two (partial) copies of the long path, with opposite strand configurations. Interestingly, repeat family E1 contains both paths: part of the long path followed by the short path.
These observations provide a more detailed description of the relationships among repeat families than the simple annotations that are part of the RECON library (Figure 4b). Specifically, the graph reveals a complicated relationship between these repeat families and suggests putative annotations of still unannotated repeat families in the RECON library (for example, those in Figure 4b).
Comparative repeat domain graph analysis
Comparing repeats across different species is a non-trivial task. Zhang and Wessler  compared the transposable elements (TEs) in Arabidopsis thaliana and Brassica oleracea via TBLASTN searches of the most conserved coding regions for each type of TE. They found nearly all TE lineages are shared between the two plants. Without complete repeat libraries, they were unable to compare repeats on the repeat family level. Stein et al.  performed a similar study, comparing repeat family libraries from C. elegans and C. briggsae, and report that "...despite their general similarities, we were not able to systematically identify ortholog pairs among the C. briggsae and C. elegans repeats ... we found no simple one-to-one mapping between them".
We compare two repeat family libraries by building a comparative repeat domain graph in the following way. Given two libraries X and Y, we first pool the sequences from both libraries into a single union library, then construct the repeat domain graph of the union library, and color the edges in the repeat domain graph according to whether they are from X only, from Y only, or from both X and Y. We call the resulting edge-colored graph the comparative repeat domain graph. Note that alternatively one could construct separate repeat domain graphs for X and Y then compare the two graphs, but this approach would introduce additional complexity in comparing graphs and should give essentially the same results. We further analyze repeat domains shared by both libraries (ancient domains), and repeat domains present in a single sequence (young domains), and study the evolutionary relationship between them.
We formed the comparative repeat domain graph using the C. elegans and C. briggsae repeat family libraries generated by Stein et al.  using the RECON algorithm . Indeed, because C. elegans and C. briggsae diverged roughly 100 million years ago, it is not surprising that only certain repeat domains present in a common ancestor are still present in both species. We are particularly interested in the discovery of these shared ancient repeat domains, whose conservation is suggestive of a role in repeat propagation, or alternatively may be due to horizontal transfer.
Four connected components formed by shared repeat domains (edges shared between C. briggsae and C. elegans)
Number of edges
Number of C. elegans + C. briggsae families
1 + 1
1 + 1
4 + 4
2 + 16
We analyzed each of these four connected components. The two shared edges with lengths 61 and 309 are in the same connected component in the comparative repeat domain graph. A translated sequence search revealed that they match essential parts in the transposase-coding sequence of the Mariner element. The edge of length 309 matches a set of hypothetical proteins in C. elegans, at residues 117 to 219. Those hypothetical proteins are all closely similar to transposases of other organisms, including Adineta vaga, human, and Stylochus zebra. The edge of length 61 (translated into a 20 amino acid sequence) does not have a significant BLAST result by itself. A BLAST search of the entire repeat family consensus sequence of Cb000007, which contains both edges, gave a result similar to what was obtained by searching the edge of length 309 alone.
The comparative repeat domain graph vividly depicts the complex evolutionary history of these repeat families: subtrees split by the green edge (indicated in Figure 5 by dashed boxes) separate repeat families from the two species, and suggest that the repeat domain shared by both species is an ancient repeat domain from a common ancestor, rather than the result of horizontal transfer. Each of these two subtrees induces a phylogeny of the included repeat families. We checked whether these phylogenies were consistent with a phylogeny derived from nucleotide substitutions in the segment of length 34 shared by these sequences (green edge in Figure 5). A phylogenetic tree (Figure 6) of the 10 sequences of length 34 constructed by CLUSTALW gives a phylogenetic tree that is remarkably consistent with the two subtrees in the comparative repeat domain graph. In particular, all three trees group C. elegans and C. briggsae families together. In addition, sequences -B2 and -B3 share few domains in the trees from the comparative repeat graph, consistent with their long separation on the CLUSTALW tree, while sequences E5 and E7 are close on all three trees. The similarity of the three trees validates the use of the comparative repeat domain graph to infer evolutionary history.
The structure of the comparative repeat domain graph raises a number of interesting and still unresolved evolutionary questions. For example, can we distinguish shared repeat domains between two species that arise from common ancestry from those that arise from horizontal transfer? How have such ancient repeat domains evolved in both genomes, and which repeat domains acquired independently in these genomes have contributed to the evolutionary success of some repeats over the past 100 million years? Finally, we remark that the repeat domain graph shown in Figure 5 was generated from the alignments shown in Figure 1. While Figure 1 contains essentially the same information about local similarities between these repeat families, the graph in Figure 5 organizes this information into a much more interpretable structure.
Analysis of de novorepeat family libraries
We now demonstrate how the repeat domain graph overcomes certain imperfections found in automatically constructed repeat family libraries and directly reveals composite repeats. Repeat family libraries have historically been constructed via manual curation. Recently, algorithms such as RepeatFinder , RECON , RepeatGluer , PILER  and RepeatScout  are increasingly automating the process of identifying repeat families from genomic sequence. For example, RECON has aided the construction of a library of chicken repeat families , and RepeatScout has been used to construct human, mouse and rat repeat family libraries that are nearly as thorough as manually curated libraries. However, the resulting de novo libraries (particularly for mammalian genomes) are frequently contaminated by sequences resulting from segmental duplications . We analyzed a human repeat family library that was automatically constructed by RepeatScout, and show how the repeat domain graph helps remove these contaminants and reveals composite repeat families.
We generated a repeat domain graph of a human library generated by RepeatScout containing 1,139 sequences of total length 0.68 M bp. Surprisingly, the resulting graph contains a large connected component that contains more than half of the input sequences. Upon close inspection, we found that this large component is connected by a small number of long edges of single multiplicity. An analysis using BLAT  revealed that the instances of each of these long edges in the genome are localized in a small number of narrow genomic regions. This suggests that these long edges do not represent repeat domains, but rather are tandem duplications, a known contaminant of de novo repeat identification programs like RECON or RepeatScout.
This discovery revealed an extra benefit of the repeat domain graph for repeat domain analysis: it directly reveals contaminants in automatically generated repeat family libraries. Moreover, the graph suggests a procedure for removing these contaminants. Briefly, we select the longest edge along the path of each repeat family whose total length exceeds 100 bp. We BLAT these edge sequences against the genome sequence and select BLAT hits whose length exceeds 80% of the edge length. We combine BLAT hits into clusters if they are less than 5 Mb apart on the genome. We compute the ratio of the number of hits to the number of clusters, and classify sequences whose ratio exceeds 2 as tandem duplications. Using this approach, 107 repeat families in the RepeatScout library were thus classified as tandem duplications and excluded from further analysis. We remark that this method can detect tandem segmental duplications, but not the dispersed segmental duplications. Distinguishing repeats from dispersed segmental duplications is a challenging and unsolved problem. It is possible that the repeat domain graph might be useful for this problem; however, this is beyond the scope of this paper.
We searched the repeat domain graph from RepeatScout for the RICKSHA composite repeat family described above, but were unable to find it. We determined the reason is that the RepeatScout library itself does not contain RICKSHA, probably due to the high sequence divergence of this repeat family. In addition, we conducted a comparative repeat domain graph analysis (Additional data file 4) for the de novo mouse and rat RepeatScout repeat libraries. We found the repeat domain graph helps in purging artifacts in de novo repeat libraries, in annotating the library, and in suggesting possible scenarios for repeat family evolution.
The computational analysis of repeats is becoming increasingly important as additional full genome sequences become available, particularly of repeat rich mammalian and plant genomes. In particular, the problem of identifying shared repeat domains is critical to understanding repeat evolution. This paper describes the first algorithmic advance on automatic identification of repeat domains in large genomes. We have applied our method to single-species analysis of human and C. elegans repeat family libraries and cross-species analyses between C. elegans and C. briggsae libraries and between mouse and rat de novo libraries, illustrating the discovery of their mosaic repeat domain structure and revealing interesting clues about repeat evolution.
We have only begun to explore the uses of the repeat domain graph in understanding the relationships between different repeat sequences. We demonstrated that the repeat domain graph reveals known repeat domains of different biological origin. Additional candidates of such domains can be directly identified by signatures in the graph. Repeat families with shared domains that represent putative composite repeat families can be further analyzed to check if their repeat domains do in fact have different biological origins; the PILER algorithm , which achieves high specificity in distinguishing between different classes of repeat families, may aid this process. The repeat domain graph opens up additional topics for further research. The library of repeat domains obtained using our decomposition procedure removes all of the redundancy in the original repeat family library. One could build a repeat masking program based on the repeat domain graph and network matching. Repeat subfamily classification algorithms (for example, ) can be applied to individual repeat domains to further understand their evolution.
The increasing use of de novo repeat identification tools demands careful analysis of the resulting libraries. Our repeat domain graph overcomes certain imperfections found in automatically constructed repeat family libraries, and might prove useful for comparison of repeat libraries generated by different repeat identification tools. As numerous new genomes with high repeat contents, such as mammals and plants, are sequenced and repeat family libraries will be typically automatically constructed, we expect that the applications of our method will multiply.
To understand the difficulty in applying the A-Bruijn graph to repeat analysis, we first review the basic concepts of A-Bruijn graph construction, first described in . Briefly, given n sequences and a set of pairwise local alignments between them, we first model each sequence S = s 1. . .s k as a directed path on k vertices. A pairwise local alignment between sequence S i and S j gives the instruction to glue together the paths corresponding to S i and S j at every pair of matched positions in the alignment. The gluing procedure is transitive, that is, if vertex x is glued to vertex y and vertex y is glued to vertex z, then vertices x and z are also glued. Thus, the set of glues define single-linkage clusters of vertices. The A-Bruijn graph construction is completed by contracting all single-linkage clusters of vertices into nodes, and stretching each remaining chains of l nodes, each containing m vertices, into an edge of length l and multiplicity m. The resulting A-Bruijn graph can be viewed as an amalgamation of n paths: each path corresponds to an input sequence, and similar regions among multiple sequences are represented as edges of high multiplicity. If the inputs are DNA sequences, we take both the direct and reverse strands of each repeat consensus sequence as input. Thus, the A-Bruijn graph of n DNA sequences contains 2n sources and 2n sinks and is symmetric: for any edge representing the alignment of m segments, its complement edge in the graph represents the reverse complements of the m segments.
In practice, a major obstacle to building an A-Bruijn graph from a set of pairwise alignments is handling inconsistencies in the alignments. These inconsistencies appear as short cycles in the A-Bruijn graph complicating the identification of sequence domains. We define short cycles as cycles of edges with a total length shorter than a predefined parameter, girth. We classify short cycles in an A-Bruijn graph as whirls if all edges of the cycle are oriented the same way, or bulges otherwise. The existing method for A-Bruijn graph construction uses an 'apply-all-glues-then-simplify (AAGTS)' strategy. Basically, all glues, that is, pairs of positions that are aligned in one of the input pairwise alignments, are applied to construct an initial A-Bruijn graph (often full of short cycles), and then a series of graph operations are applied that remove bulges and whirls. In , for example, the bulge and whirl removal procedure gave an approximate solution to the Maximum Subgraph with Large Girth (MSLG) problem.
In order to handle such complex inconsistent glues in repeat libraries, we designed and implement a new strategy for filtration of glues. Instead of applying all glues as in the AAGTS approach, we apply the glues one by one and watch for the creation of potential whirls. Specifically, if a pair of positions (a, b) is about to create a whirl, that is, if position b is in a node n(b), and n(b) contains a position that is on the same sequence as position a and is less than girth away from position a, then the gluing pair (a, b) is discarded. This conservative gluing procedure prevents the formation of whirls during the A-Bruijn graph construction, and thus a later whirl removal procedure is no longer necessary.
We compare our new whirl-filtration procedure to the existing AAGTS procedure used by ABA by computing the fraction of gluing pairs in the input pairwise alignments that are present in the resulting repeat domain and ABA graphs. To measure the differences between the two methods, we compute the ratio of: the number of gluing pairs in the pairwise alignments that are input to each method; and the number of gluing pairs present in the resulting ABA or repeat domain graph. If there are no inconsistencies in the input alignments, this ratio is 1. If there are inconsistencies in the input alignments, the resulting graphs would have fewer alignment positions, since both methods resolve inconsistencies by removing some aligned positions. Thus, the ratio would be less than 1; however, the best possible ratio is not known. We compute the number of gluing pairs in the input and in the resulting graphs in the following way. Gluing pairs in the input may be redundant. For example, for three positions i, j, and k, if pairs (i, j), (j, k) and (i, k) are aligned in the input, then since pair (i, k) can be inferred from pairs (i, j) and (j, k) by transitivity, only pairs (i, j) and (j, k) are sufficient to define same set of gluing operations for constructing the graph. Thus, when we count the number of gluing pairs in the input, we only count nonredundant sets of position pairs; for example, if three positions are aligned transitively, we only count two pairs. In general, if n positions are aligned transitively, we only count n-1 pairs. To count the number of gluing pairs in the resulting graph, we count the number of positions along the edges with a multiplicity higher than 1. For an edge with multiplicity m and length l, we count the number of gluing pairs as l(m - 1). This count is corrected with consideration of over-counting of the positions at common vertices shared by multiple edges.
Comparison of the AAGTS and whirl-filtration strategies
Number of non-redundant input POAP
Fraction of input POAP in graph
Difference in POAP
C. elegans RECON
The second complication in the construction of the repeat domain graph in repeat is the presence of palindromic sequences. The procedure for constructing an A-Bruijn graph of DNA sequences is designed to preserve the intrinsic symmetric structure of the entire graph. Thus, when gluing a pair of positions, the reverse complement pair of positions is also immediately glued. With palindromic sequences, the order of gluing needs to be coordinated carefully. The existing A-Bruijn construction algorithm did not consider palindromic sequences, and consequently we found that the direct application of the existing procedure often results in a repeat domain graph containing broken paths for a single sequence. We solve this problem by changing the A-Bruijn graph construction procedure so that all glues between positions from same strands are applied before those from opposite strands. Furthermore, we ensure the bulge removal procedure can correctly remove bulges contained in palindromic regions (see Additional data file 5 for details).
While the procedure for construction of the A-Bruin graph is independent of the particular local alignment method, in practice, the resulting repeat domain graph will vary with different input alignments. We determined that cross_match (P Green, unpublished) with the default scoring matrix and the gap penalties used by BLASTN is an effective tool for generating pairwise local alignment of repeat consensus sequences. In comparison, repeat domain graphs produced from alignments with BLASTN (using default parameters) were generally similar but typically contain more edges, thus artificially fragmenting repeat domains.
In all experiments, we selected local alignments with minimal length of 40 and minimal score of 30 (corresponding to BLAST E-value 1E-3) and input these into our method using default parameters with the minimal girth (-w) 40. We found that the topology of the repeat domain graph was similar when BLASTN was used to determine the input alignments (data not shown). However, we also observed the importance of filtering out low-complexity alignments, which both cross_match and BLASTN perform under their default options, but using different techniques. When low-complexity filtering was turned off in BLASTN with the '-F F' option, the repeat domain graph generally included larger connected components, reflecting the fact that more low-complexity regions were aligned into repeat domains.
We comment that although our method shares A-Bruijn graph framework with the RepeatGluer program , these programs have different goals. RepeatGluer is a de novo repeat family identification tool that attempts to identify all repeat families in an input genomic sequence. Unfortunately, it is currently not feasible to run RepeatGluer directly on long genomic sequences. Our approach takes an existing repeat family library as input and decomposes it into repeat domains.
We incorporated our new methods for A-Bruijn graph construction and simplification into a modified version of the ABA program, which is available at the ABA website . The program can also be run online at . Perl scripts used for analyzing the repeat domain graphs are available as Additional data file 6.
The following additional data are available with the online version of this paper. Additional data file 1 contains a set of browsable HTML files with a complete list of the connected components in the repeat domain graph of human Repbase. Additional data file 2 provides the full version of Table 1, listing repeat families containing domains shared with repeat families of different biological origins. Additional data file 3 is a figure showing a connected component containing three Y-forks in the human RepeatScout library repeat domain graph. Additional data file 4 provides an analysis of the comparative repeat domain graph from mouse and rat RepeatScout repeat libraries. Additional data file 5 shows an example of how palindromic sequences are handled by our revised algorithm. Additional data file 6 provides the repeat domain graph analysis software package
We thank Arian Smit for sharing his expertise in repeat analysis. We are grateful to Evan Eichler for discussions on the mosaic structure of segmental duplications, and to Neil Jones for technical assistance. B.R. is supported by a Career Award at the Scientific Interface (CASI) from the Burroughs Wellcome Fund.
- Kazazian HH: Mobile elements: drivers of genome evolution. Science. 2004, 303: 1626-1632. 10.1126/science.1089670.PubMedView ArticleGoogle Scholar
- Holmes I: Transcendent elements: whole-genome transposon screens and open evolutionary questions. Genome Res. 2002, 12: 1152-1155. 10.1101/gr.453102.PubMedView ArticleGoogle Scholar
- Bailey JA, Liu G, Eichler EE: An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet. 2003, 73: 823-834. 10.1086/378594.PubMedPubMed CentralView ArticleGoogle Scholar
- Kidwell MG, Lisch DR: Perspective: transposable elements, parasitic DNA, and genome evolution. Evolution Int J Org Evolution. 2001, 55: 1-24.View ArticleGoogle Scholar
- Brosius J: How significant is 98.5% 'junk' in mammalian genomes. Bioinformatics. 2003, 19 (Suppl 2): II35-View ArticleGoogle Scholar
- Capy P, Gasperi G, Biemont C, Bazin C: Stress and transposable elements: co-evolution or useful parasites?. Heredity. 2000, 85: 101-106. 10.1046/j.1365-2540.2000.00751.x.PubMedView ArticleGoogle Scholar
- Shapiro JA: Transposable elements as the key to a 21st century view of evolution. Genetica. 1999, 107: 171-179. 10.1023/A:1003977827511.PubMedView ArticleGoogle Scholar
- Jurka J: Repeats in genomic DNA: mining and meaning. Curr Opin Struct Biol. 1998, 8: 333-337. 10.1016/S0959-440X(98)80067-5.PubMedView ArticleGoogle Scholar
- Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000, 16: 418-420. 10.1016/S0168-9525(00)02093-X.PubMedView ArticleGoogle Scholar
- RepeatMasker. [http://repeatmasker.org]
- Batzer MA, Deininger PL: Alu repeats and human genomic diversity. Nat Rev Genet. 2002, 3: 370-379. 10.1038/nrg798.PubMedView ArticleGoogle Scholar
- Negroni M, Buc H: Mechanisms of retroviral recombination. Annu Rev Genet. 2001, 35: 275-302. 10.1146/annurev.genet.35.102401.090551.PubMedView ArticleGoogle Scholar
- Chimpanzee Sequencing and Analysis Consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437: 69-87. 10.1038/nature04072.View ArticleGoogle Scholar
- Kajikawa M, Okada N: LINEs mobilize SINEs in the eel through a shared 3' sequence. Cell. 2002, 111: 433-444. 10.1016/S0092-8674(02)01041-3.PubMedView ArticleGoogle Scholar
- Galperin MY, Koonin EV: Frontiers in Computational Genomics. 2002, Norwich: Caister Academic PressGoogle Scholar
- Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, et al: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004, 5: R7-10.1186/gb-2004-5-2-r7.PubMedPubMed CentralView ArticleGoogle Scholar
- Pevzner PA, Tang H, Tesler G: De novo repeat classification and fragment assembly. Genome Res. 2004, 14: 1786-1796. 10.1101/gr.2395204.PubMedPubMed CentralView ArticleGoogle Scholar
- Bailey JA, Yavor AM, Viggiano L, Misceo D, Horvath JE, Archidiacono N, Schwartz S, Rocchi M, Eichler EE: Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am J Hum Genet. 2002, 70: 83-100. 10.1086/338458.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics. 2002, 18: 452-464. 10.1093/bioinformatics/18.3.452.PubMedView ArticleGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004, 14: 708-715. 10.1101/gr.1933104.PubMedPubMed CentralView ArticleGoogle Scholar
- Raphael B, Zhi D, Tang H, Pevzner P: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 2004, 14: 2336-2346. 10.1101/gr.2657504.PubMedPubMed CentralView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMedPubMed CentralView ArticleGoogle Scholar
- Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al: The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 2003, 1: E45-10.1371/journal.pbio.0000045.PubMedPubMed CentralView ArticleGoogle Scholar
- Bao Z, Eddy SR: Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002, 12: 1269-1276. 10.1101/gr.88502.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang X, Wessler SR: Genome-wide comparative analysis of the transposable elements in the related species Arabidopsis thaliana and Brassica oleracea. Proc Natl Acad Sci USA. 2004, 101: 5589-5594. 10.1073/pnas.0401243101.PubMedPubMed CentralView ArticleGoogle Scholar
- Volfovsky N, Haas BJ, Salzberg SL: A clustering method for repeat analysis in DNA sequences. Genome Biol. 2001, 2: RESEARCH0027-10.1186/gb-2001-2-8-research0027.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005, 21 (Suppl 1): i152-i158. 10.1093/bioinformatics/bti1003.PubMedView ArticleGoogle Scholar
- Price A, Jones N, Pevzner P: De novo identification of repeat families in large genomes. Bioinformatics. 2005, 21 (Suppl 1): i351-i358. 10.1093/bioinformatics/bti1018.PubMedView ArticleGoogle Scholar
- Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716. 10.1038/nature03154.View ArticleGoogle Scholar
- Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Price AL, Eskin E, Pevzner PA: Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004, 14: 2245-2252. 10.1101/gr.2693004.PubMedPubMed CentralView ArticleGoogle Scholar
- A-Bruijn Aligner Website. [http://nbcr.sdsc.edu/euler/]
- ABA Web Interface. [http://aba.bioprojects.org/]
- Loeb DD, Padgett RW, Hardies SC, Shehee WR, Comer MB, Edgell MH, Hutchinson CA: The sequence of a large L1md element reveals a tandemly repeated 5' end and several features found in retrotransposons. Mol Cell Biol. 1986, 6: 168-182.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.