A clustering method for repeat analysis in DNA sequences
© Volfovsky et al., licensee BioMed Central Ltd 2001
Received: 24 January 2001
Accepted: 8 June 2001
Published: 1 August 2001
A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats.
The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.
We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
Repetitive sequences present many difficulties for genome sequencing and analysis. The presence of large numbers of repeats often confounds sequence assembly, especially if the repeats are long and highly conserved. The presence of low copy-number repeats can also confound assembly, especially for whole-genome shotgun sequencing projects . Once a genome has been assembled, repeats take on a new and more important role involving their biological function. Certain classes of repeats, such as transposons, perform a function by allowing mobile elements to move around a genome. Other classes belong to less well-defined categories with respect to their role, though they may be even more ubiquitous. Repetitive sequences appear to dominate the centromeres of many eukaryotes , and telomeric and subtelomeric repeats extend for thousands or tens of thousands of nucleotides at the ends of chromosomes. These repeats also appear elsewhere in the genome, for reasons as yet unknown. For these and other reasons, it is critical to both the assembly and analysis of genomic sequences to identify and characterize repetitive sequence elements.
Examining the output of REPuter and RepeatMatch for a complete bacterial genome, it quickly becomes obvious that many exact repeats are non-exact copies of one another. Whether a genome is a few or hundreds of megabases in length, the task of recognizing and describing how repeats resemble one another at this scale is too complicated to accomplish manually.
Here we describe a new system for the recognition of repeat classes in genome sequences. This system, RepeatFinder, is freely available from our website . In contrast to approaches that cluster together the results of BLAST searches (for example, Z.H. Bao and S. Eddy, unpublished data) our algorithm uses a comprehensive set of exact repeats as the basis for constructing repeat classes. It relies on the efficient suffix tree data structure for identification of exact repeats, which permits rapid identification of repeat classes even in sequences containing tens of millions of nucleotides. The algorithm does not make any prior assumptions about the number or structure of the classes. At its core is a merging procedure that produces the actual members of each repeat class using merging criteria described below, and it also builds a repeat map of the genome sequence.
We have applied this system to several complete microbial genomes [15,16,17,18,19,20,21], to the complete Arabidopsis thaliana genome , and to a large collection of rice bacterial artificial chromosome (BAC) end sequences [22,23]. The results of this analysis are described below. The output of the system gives a clear picture of all repeat classes identified in a genome or a sequence collection. It provides straightforward access to the actual repeat sequences as a multi-fasta file, simple statistical analyses of the results, and a procedure for identifying each class's most representative element. We describe here the computational techniques used in the system and demonstrate its use on several different genome sequences.
Results and discussion
In the initial repeat set, different repeats may be very close together (Figure 2b) and may even overlap (Figure 2c). This intricate picture can be simplified by constructing a more general type of repeat: a 'merging repeat' will be defined as a sequence that can be found in the whole genome sequence not less than twice, where occurrences of the merging repeat are permitted to be partial copies. Merging repeats, labeled M1-M4 in Figure 2, are created from initial repeat sequences that are close together or overlapping. Merging repeats maintain pointers to the initial repeats comprising them; for example repeat M1 (Figure 2b) has pointers to initial repeats A1 and B1. We shall also refer to these initial repeats as 'subrepeats' of M1
Using these properties, we can formulate a similarity condition between merging repeats. Two merging repeats M1 and M2 are similar if they have at least one common initial sequence, or there exists a sequence of merging repeats M1, N1, N2, ... N k , M2 such that each pair of merging repeats in the sequence shares at least one common subrepeat. The minimum number k of merging repeats needed to establish the similarity between M1 and M2 can be used as a similarity measure. For example, the merging repeats M2 and M4. in Figure 2b are similar with similarity measure k = 2 based on the sequence M2, M3, M1, M4.
One goal of our clustering algorithm is to distribute merging repeats into classes according to this similarity condition so that two rules are satisfied: first, elements in the same class (homogeneity elements) are highly similar to each other; and second, elements from different classes (separation elements) have no similarity to each other. The maximal similarity k defined on all merging repeats in a class can be used for assessing the overall similarity of the class members. (In this paper the measure of similarity k is used only for the definition of merging repeats).
In this study we use exact forward and reverse complement repeats as initial repeats for clustering. The method does not require the use of exact initial repeats, but can be applied easily to an initial set containing approximate repeats [3,13].
Our algorithm is based on first identifying all exact repeats in the input sequence, and then defining repeat classes by merging and extending these short exact matches. An exact repeat is represented by pair of coordinates (A1, A2) delimiting its location in the genome sequence, and by the repeat length l. We implemented an algorithm that uses either of two suffix tree methods, RepeatMatch  or REPuter  to determine all the exact repeats in a given sequence. (For more on suffix trees see .) The computational time and space requirements for both these systems are linear in the size of the input sequences, an essential requirement for any algorithm attempting to process whole eukaryotic genomes. The subsequent clustering procedure merges neighboring repeats and groups them into classes. The input to the system can be either a single genome sequence or a set of sequences. The clustering procedure consists of the following steps, which are described in more detail below.
Step 1: Selection and pre-processing. The list of coordinates of all exact repeats as output by RepeatMatch or REPuter can be interpreted as a partition of the original genome sequence. (The output of RepeatMatch and REPuter are very similar. We used REPuter in the example and in the subsequent repeat analyses of microbial genomes; for the A. thaliana genome and and the rice BAC end-sequence data we used RepeatMatch.) Each partition point has a reference to the pair coordinates (A1, A2) and the repeat length l. Each repeat corresponds to at least two partition points. Some repeats can be found in the sequence more than twice, and the corresponding partition points can appear with different coordinates and different lengths. To prepare the data for the merging procedure, we sort the list of partition points in increasing order, and in the case of duplicate first coordinates, in increasing order of second coordinates. (The clustering algorithm is order-independent; however, the linear nature of repeat data allows us to use this pre-processing step to simplify the clustering procedure without affecting the final clusters.) In particular cases it is useful to filter the original repeat data to remove certain types of repeats; for example, simple one-base (homopolymeric) or two-base repeats.
Step 2: Merging procedure. In outline, this procedure works by repeatedly merging together two exact repeats that either overlap or that occur within a limited distance (a gap) of each other. Specific values for the overlap and gap distance can be specified for each genome sequence. Whether the algorithm is merging repeats that overlap (Figure 2c), or merging repeats separated by a gap (Figure 2b), the new merging repeats will always have the property that significant subsequences of the repeat appear at least twice in the genome sequence.
At the time of merging procedure, we generate a repeat map of the genome sequence. This map is based on a linked-list data structure, which allows for rapid and simple modifications to the dynamically changing repeat data. Every merging repeat in the map is linked by pointers to all the merging repeats with which it shares exact repeats.
Step 3: Classification. This step defines the repeat classes. Each merging repeat will be assigned to a specific class if its list of references (that is, the repeats that were combined into the merging repeat) contains at least one repeat that already belongs to the class. If a merging repeat has references that belong to multiple distinct classes, then those classes are combined into one. If a merging repeat contains no references to an existing class, then the merging repeat forms a new class.
Step 4: BLAST searches and repeat class updates. The initial classification is based on exact repeats. To merge together similar but non-exact repeats, we use WU-BLAST [9,10] to search all merging repeats against all others. The resulting matches between the classes are used as input to an update procedure which redistributes all merging repeats into new classes. It is possible to skip this step if the initial repeat set contains approximate rather than exact repeats.
Step 1: Pre-processing
Step 2: Merging and repeat map generation
Using the list of partition points, we begin merging exact repeats using the following criteria. Given two partition points p1 = (A1, A2, l A ) and p2 = (B1, B2, l B ) where A1 <B1, we compute the distance between the non-overlapping repeats as
d(p1, p2) = max (0, B1 - A1 - l A + 1).
Next, given a maximum gap size G > 0, the 'merging with gap' protocol uses the rule that sequences corresponding to p1 and p2 are merged if
The 'merging with overlap' protocol only merges sequences that overlap one another; that is they are at least partially identical. We denote the overlap of two sequences as
0(p1, p2) = max (0, A1 + l A - B1 + 1) for A1 <B1
Then the criterion for 'merging with overlap' is as follows: given a minimum overlap proportion op, where 0 ≤ op ≤ 1, repeat points (A1, A2,l A ) and (B1, B2, l B ) are merged if at least one of the four repeats has overlap satisfying
0 (p1, p2) >op min (l A , l B ).
The parameter op is interpreted as a fraction of the shorter of the two repeats. Thus for op = 0.75, we will merge two overlapped sequences if the length of their overlap is at least 75% of the length of the shorter sequence.
Using either merging procedure, if two sequences are merged then the new sequence will be defined as a merging repeat with starting position M = A1 and with length l M = max (A1 + l A , B1+ l B ) - A1. The merging procedure is not permitted to merge pairs of partition points of the form (B1, B2, l B ) and (B2, B1, l B ). This condition avoids merging of tandem repeats and avoids repetitiveness within the merging repeats.
This procedure, by updating and creating new references, leads to the repeat map shown on the right of Figure 4. These references define the correspondences between all merging repeats. Each merging repeat maintains references to the other merging repeats with which it shares exact repeats; each exact repeat is assigned to the first merging repeat in which it appears. In our example, the merging repeat starting in coordinate 77 gets a reference to itself only, because its exact repeats have no previous references. The next repeat, starting in position 116, gets a reference to itself and to its mate the merging repeat 77. A data structure stores with each merging repeat its start coordinate, its length (l M ), the number of exact repeats it includes (n M ), and a list of references to itself and to other repeats (R1, R2, R3).
Step 3: Classification
Step 4: BLAST searches and further merging
Repeat analysis of microbial genomes
Sensitivity of the clustering method to different merging parameters
Minimal exact repeat 25 bp
Minimal exact repeat 50 bp
Number of merging repeats
Number of classes
Number of merging repeats
Number of classes
Repeat structure of microbial genomes
The longest merging
Number of classes
Number of classes
with more than
Defining the prototype for a repeat class
Small microbial genomes have relatively few types of repeats, and relatively few copies of each type. In contrast, our studies of longer eukaryotic genome sequences have uncovered tens of thousands of repeat classes and hundreds of thousands of merging repeats. In order to be able to process this data efficiently - in particular, in order to run the procedure where all classes are compared against each other using BLAST - we developed a procedure to define the most representative element for each class, which we call its prototype.
Referring to the repeat map shown in Figure 5, we use the length of the merging repeat (l M ) and the number of exact repeats (n M ) to defined the desirable properties for the prototype. The different merging protocols affect the properties of the prototype. Thus, in the 'merging with gap' procedure, the merging repeats with the longest lengths and with the greatest number of subrepeats should be the best candidates to represent the class. In this case, many members will consist of simple subsequences of the prototype. When we use the 'merging with overlap' procedure, we also look for the greatest number of subrepeats, but the length of the most representative repeat should be closer to the shortest repeat in the class. In this case the representative element will tend to match across most of its length to every member of the class.
for 'merging with overlaps'.
This non-negative function is a summary of the variance in the length and number of subrepeats from the desirable values for the class prototype. Then we solve the optimization problem of minimizing function F(l,n):
find (l,n) corresponding to an element in the single repeat class: min F(l,n).
If we get several elements that minimize this function, we select the one with the maximal number of subrepeats. Thus in our example (Figures 1, 3,4,5) the prototype for class 1 is the longest repeat starting in position 77, with l = 11 and n = 2. Likewise, the prototype for class 2 is the repeat starting at position 23, and for class 3 it is the repeat starting at position 38. We used this procedure in our studies of the genome sequences of A. thaliana  and rice BAC end sequences [22,23].
Repeat structure of the Arabidopsisgenome
Summary of repeat analysis of Arabidopsis genome and rice BAC end sequences
Number of classes in Arabidopsis
Number of classes in rice data
To find out more about the composition of the Arabidopsis repeats, each sequence was searched against AtRepBase  and the Arabidopsis gene database  (using a maximum BLAST E-value of 0.01 and at least 100% identity for Arabidopsis genes and at least 95% identity to AtRep-Base sequences). Of 105,434 repeat sequences that fall into 27,961 separate repeat classes, 2,124 sequences matched an annotated repeat sequence in AtRepBase, and 25,149 sequences matched a segment of an Arabidopsis gene. Comparing both sets of matches, only 417 of the repeat sequences were found to match both a gene segment and an annotated repeat sequence. The large number of repeats that match gene segments reflects the prevalence of segmental chromosomal duplications and tandem gene duplications in Arabidopsis. Due to the greedy 'merging with gap' method used to build the repeat classes, relatively few of the repeat classes contained an abundance of the repeat sequences; the largest repeat class contained 30,975 sequences of which 6,505 matched gene segments and 1,723 matched annotated repeats.
Prototype repeat sequences (Arabidopsis thaliana genome) that matched genes or annotated AtRepBase repeats
ATR0081 minisatellite 1 from 63767 to 63826
Putative O-methyltransferase 1
Putative reverse transcriptase
Putative receptor kinase
Putative receptor kinase
Putative retroelement pol polyprotein
Putative disease resistance protein
Putative Ser/Thr kinase
Putative NBS/LRR disease resistance protein
Putative phenylalanine ammonia-lyase
Putative reverse transcriptase
Putative disease-resistance protein
Putative disease-resistance protein
ATR0084 repeat 5 from 102144 to 105991
CHP-rich zinc finger protein-like
Putative cytochrome P450
Putative transposon protein
Putative serine carboxypeptidase
Putative CHP-rich zinc finger protein
Rice repeat database
Yuan et al.  recently reported on the construction of a rice repeat database that was generated by searching all available rice sequences for minisatellite sequences, mobile elements, rDNA, centromeric repeat sequences and telomeric repeat sequences. This database includes 215 sequences. We attempted to use the repeat finding system described here to enlarge this set, using as input the large collection of sequences from the Clemson University rice BAC end database .
Unlike either Arabidopsis or the microbial genomes, where a single genome sequence or a few large chromosomes were being processed, in this case we had 101,562 BAC end sequences with an average length of approximately 400–700 bp. We therefore developed a special pre-processing procedure which generates a single sequence (approximately 42 Mb long) from all the BAC ends. Each original sequence is represented by its coordinate in the new sequence. This procedure permits the algorithm to work with hundreds of thousands of different sequences simultaneously. The system found 5,208,206 exact repeat pairs with lengths from 25 bp to 728 bp, where the latter represents an entire BAC end that was repeated exactly. The maximum length of each repeat was bounded by the length of the BAC end sequence in which it was found. This length restriction was added to the merging procedure to avoid artificially long repeats that might mistakenly span more than one BAC end sequence. The pre-clustering procedure also includes filtering of the exact repeats data to remove simple-sequence repeats, which were determined to comprise over 40% of exact repeats. We merged the filtered exact repeats data, requiring an overlap of 95%. This resulted in 48,768 repeat classes, of which only 8,118 include more than two elements. Table 4 contains a summary of these repeat classes. A searchable rice repeat database, based on the prototypes of these classes, is available online at .
To test this new repeat database, we compared it to the set of annotated repeats based on known, expertly curated repeats . There were four general groups in this set: telomere/centromere repeats, transposon/transposon-like repeats, rDNA, and all the rest . We used BLAST to search annotated repeats against the rice repeat database, using an E-value cutoff of 10-8. Classification of the BLAST hits shows that the annotated repeats from the four distinct groups always fall into separate classes in the rice repeat database; in other words, the new database divides the previous repeat classes into a finer-grained set of repeats, but it does not merge any of the four known groups together.
Because of the use of the efficient suffix tree procedures, the system runs very fast, with the all-versus-all BLAST search consuming approximately 80% of the computation time. The running time of the exact repeat finder is about 10–15% of the total, with the other processes - merging, clustering and post-BLAST updating - using a relatively minor proportion of overall computation time. The running time depends on both the sequence length and the number of repeats; for example, small microbial genomes take just 3–15 minutes, whereas the highly repetitive rice repeat database took about two days to process. The memory needed for computation is dominated by the requirements of the suffix tree used for the initial repeats computation [11,12,13]; this can grow to many gigabytes for large eukaryotic chromosomes.
We describe a new system for rapid identification of all repeats in genome sequences and assignment of these repeats to similarity classes. The system has been used to analyze the repeat structure of several complete microbial genomes, and the much larger genome of the model plant A. thaliana. We also used it to create a new rice repeat database, based on an analysis of a large BAC end sequence database. This new computational tool should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
We thank N. El-Sayed, O. White, J.F. Heidelberg, M.-I. Benito, H.M. Khouri, T.V. Feldblyum, M. Pop, J.R. Buchoff and M.F. Shumway for helpful comments, suggestions and discussion. This work was supported in part by NSF grants KDI-9980088 and IIS-9902923 and by NIH grant R01-LM06845.
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al: The genome sequence of Drosophila melanogaster. Science. 2000, 287: 2185-2195. 10.1126/science.287.5461.2185.PubMedView ArticleGoogle Scholar
- The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Leung M-Y, Blaisdell BE, Burge C, Karlin S: An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J Mol Biol. 1991, 221: 1367-1378. 10.1016/0022-2836(91)80131-D.PubMedPubMed CentralView ArticleGoogle Scholar
- Agarwal P, States DJ: The Repeat Pattern Toolkit (RPT): analyzing the structure and evolution of the C. elegans genome. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, ISMB 94, 1-9. Menlo Park, CA: AAAI Press,. 1994Google Scholar
- Kannan SK, Myers EW: An algorithm for locating nonoverlapping regions of maximal alignment score. SIAM J Comput. 1996, 25: 648-662.View ArticleGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27: 573-580. 10.1093/nar/27.2.573.PubMedPubMed CentralView ArticleGoogle Scholar
- RepeatMasker. [http://ftp.genome.washington.edu/RM/RepeatMasker.html]
- Bedell JA, Korf I, Gish W: MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics. 2000, 16: 1040-1041. 10.1093/bioinformatics/16.11.1040.PubMedView ArticleGoogle Scholar
- Gish W, States DJ: Identification of protein coding regions by database similarity search. Nat Genet. 1993, 3: 266-272.PubMedView ArticleGoogle Scholar
- Washington University School of Medicine: Index of /blast/blast. [http://blast.wustl.edu/blast]
- Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucleic Acids Res. 1999, 27: 2369-2376. 10.1093/nar/27.11.2369.PubMedPubMed CentralView ArticleGoogle Scholar
- Kurtz S, Schleiermacher C: REPuter - fast computation of maximal repeats in complete genomes. Bioinformatics. 1999, 15: 426-427. 10.1093/bioinformatics/15.5.426.PubMedView ArticleGoogle Scholar
- Kurtz S, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: Computation and visualization of degenerate repeats in complete genomes. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 2000, 228-238. . Menlo Park, CA: AAAI-Press,Google Scholar
- TIGR software tools. [http://www.tigr.org/softlab/]
- Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996, 273: 1058-1073.PubMedView ArticleGoogle Scholar
- Fraser CM, Norris SJ, Weinstock GM, White O, Sutton G, Clayton R, Dodson R, Gwinn M, Hickey E, Ketchum KA, et al: Complete genomic sequence of Treponema pallidum, the syphilis spirochete. Science. 1998, 281: 375-388. 10.1007/s004410050434.PubMedView ArticleGoogle Scholar
- Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, et al: Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature. 1999, 399: 323-329. 10.1038/20601.PubMedView ArticleGoogle Scholar
- Read TD, Brunham RC, Shen C, Gill SR, Heidelberg JF, White O, Hickey EK, Peterson J, Utterback T, Berry K, et al: Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. Nucleic Acids Res. 2000, 28: 1397-1406. 10.1093/nar/28.6.1397.PubMedPubMed CentralView ArticleGoogle Scholar
- Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, Fleischmann RD, Ketchum KA, Klenk HP, Gill SR, Dougherty BA, et al: The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature. 1997, 388: 539-547. 10.1038/41483.PubMedView ArticleGoogle Scholar
- Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JE, Ketchum KA, Hood DW, Peden JF, Dodson RJ, et al: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science. 2000, 287: 1809-1815. 10.1126/science.287.5459.1809.PubMedView ArticleGoogle Scholar
- Nierman W, Feldblyum TV, Laub MT, Paulsen IT, Nelson KE, Eisen J, Heidelberg JF, Alley MRK, Ohta N, Maddock JR, et al: Complete genome sequence of Caulobacter crescentus. Proc Natl Acad Sci USA. 2001, 98: 4136-4141. 10.1073/pnas.061029298.PubMedPubMed CentralView ArticleGoogle Scholar
- Mao L, Wood TC, Yu Y, Budiman MA, Tomkins J, Woo S, Sasinowski M, Presting G, Frisch D, Goff S, et al: Rice transposable elements: a survey of 73,000 sequence-tagged-connectors. Genome Res. 2000, 10: 982-990. 10.1101/gr.10.7.982.PubMedPubMed CentralView ArticleGoogle Scholar
- Clemson University rice BAC end database. [http://www.genome.clemson.edu/projects/rice/rice_bac_end]
- Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press,. 1997View ArticleGoogle Scholar
- AtRepBase. [http://nucleus.cshl.org/protarab/AtRepBase.htm]
- Arabidopsis gene sequence database. [http://www.tigr.org/tdb/e2k1/ath1/ath1.shtml]
- Yuan Q, Liang F, Hsiao J, Zismann V, Benito M-I, Quackenbush J, Wing R, Buell R: Anchoring of rice BAC clones to the rice genetic map in silico. Nucleic Acids Res. 2000, 28: 3636-3641. 10.1093/nar/28.18.3636.PubMedPubMed CentralView ArticleGoogle Scholar
- Oryza sativa repeat database search. [http://www.tigr.org/tdb/rice/blastsearch.shtml]