Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach
© Strong et al.; licensee BioMed Central Ltd. 2003
Received: 21 March 2003
Accepted: 28 July 2003
Published: 29 August 2003
The genome of Mycobacterium tuberculosis was analyzed using recently developed computational approaches to infer protein function and protein linkages. We evaluated and employed a method to infer genes likely to belong to the same operon, as judged by the nucleotide distance between genes in the same genomic orientation, and combined this method with those of the Rosetta Stone, Phylogenetic Profile and conserved Gene Neighbor computational methods for the inference of protein function.
Identification of operon boundaries, however, becomes more challenging when dealing with adjacent genes in the same orientation [3, 4]. If two genes, A and B, are transcribed separately, as seen in Case 3, the minimum genetic elements required upstream from the start codon of gene B would be its corresponding -10 and -35 bp elements, transcription start site, 5' UTR, and any additional gene-specific promoter elements. As for the downstream elements of gene A, we would expect a 3' UTR and a transcription termination site. Although these genetic elements may overlap with both coding and non-coding elements of adjacent genes, the unique nucleic acid sequence requirements of each of these elements make substantial element overlap less likely if the genes are transcribed independently. If instead both genes are part of a multigene operon, the only upstream requirement of gene B is a single ribosome binding site (RBS) which may be in the intergenic region, or may overlap the coding region of gene A. Intuition suggests, as shown in Figure 2 Case 4, the intergenic spacing between genes in a common operon is shorter than the intergenic spacing of genes encoded by separate transcription units.
Previous studies have examined the nucleotide length distribution of the 5' UTRs, 3' UTRs, intergenic regions and space between RBSs and start sites of transcription in the genome of E. coli . Salgado et al. utilized a set of experimentally determined E. coli operons to examine the distance distributions of intergenic regions between genes within operons and those found experimentally to be at transcription boundaries, and used these values to compute a log-likelihood score for predicting transcription units in E. coli . Salgado et al. also demonstrated that short intergenic distances are common between adjacent genes of documented operons in E. coli , and subsequent analyses by Moreno-Hagelsieb et al. have suggested that short intergenic distances may be the case for operon members in most other prokaryotic genomes . We have employed a similar dataset as Salgado et al. , obtained from RegulonDB , to evaluate the accuracy of operon inferences using various distance thresholds in E. coli, as well as to calculate a posterior probability of identifying E. coli genes of a common operon, given the intergenic distance separating two adjacent genes in the same orientation.
Intergenic distance thresholds have also been used both as single distance cutoffs [7, 8] and for the construction of probabilistic models , to infer gene function based on predicted operon structure. Building on these studies, we evaluate a prokaryotic genome with minimal experimental evidence regarding operon organization. We provide a method for evaluating operon predictions, based on distance and orientation constraints, as well as a combined computational approach to infer protein function and protein linkages, based on the organization of the prokaryotic genome.
In addition to exploiting operon organization, we have also combined the Operon method (OP) with that of the Rosetta Stone (RS) , Phylogenetic Profile (PP) , and conserved Gene Neighbor (GN) [7, 12] method. While the Operon method focuses on the analysis of a single genome, in this case M. tuberculosis, the Rosetta Stone, Phylogenetic Profiles, and conserved Gene Neighbor methods focus on the analysis of multiple genomes. Individual proteins that are functionally linked by the Rosetta Stone method occur as a single 'fusion' protein in another organism. The Phylogenetic Profile method functionally links proteins that occur in a correlated manner, for example: a group of genes which may have a shared pattern of presence or absence throughout various genomes. And finally, the conserved Gene Neighbor method links genes that occur as chromosomal neighbors in multiple genomes, often a characteristic of bacterial operons as well as clustered genes of related function.
Although the conserved Gene Neighbor method has been used previously to identify potential operon members , the method is distinct from that of the Operon method. Functional linkages established by the Operon method rely only on a single genome, where genes are functionally linked based on a specified intergenic distance threshold. The conserved Gene Neighbor method, in contrast, compares all available sequenced genomes in order to identify genes that are located in close chromosomal proximity to each other in multiple genomes.
Total number of Mycobacterium tuberculosis genes with linkages based on the Operon method, employing orientation and a nucleotide distance threshold
Predicted operon groups
Genes with links
In M. tuberculosis, we find that six of the eight genes of the documented mammalian cell entry (mce1) operon are linked with a distance threshold of 0 bp, while the other two genes are included by slightly increasing this threshold to 5 bp. Also, we observe that all four members of the documented oligopeptide permease operon (oppA-D) are linked with a 0 bp distance threshold, and are flanked by genes in opposite orientations. As we increase the distance threshold from 0 bp to 100 bp we see the inclusion of genes accounting for a substantial percentage of the tuberculosis genome into putative operons. The expanded coverage enables inclusion of operon members such as the groEL1 groES operon pair, which is separated by 95 bp, and enables linkage of the eight members of the likely arginine biosynthesis operon (Rv1652-Rv1659), which includes intergenic separations ranging from -3 to 80 bp.
Evaluation of functional linkages (Operon method)
Assessment, by keyword recovery, of the functional linkages established by the Operon method at various distance thresholds
Functional links between SwissProt Annotated Proteins
Functional links with no keywords in common
Correct keywords recovered
Maximum false positive fraction*
The method of keyword recovery allows us to evaluate a set of linkages based on known functional annotations. By comparing the SWISS-PROT keywords we can quantitatively evaluate different thresholds or different methods for inferring protein function. The maximum false positive fraction is the fraction of functionally-linked proteins that, based on their current annotation, do not share any function in common. We use the term 'maximum' because there are many reasons why two genes may not have any keywords in common, ranging from incomplete biochemical or genetic characterization to the use of different vocabularies to describe similar functions. The quantity 1-maximum false positive fraction indicates the fraction of pairwise links that have one or more keywords in common, and therefore some function in common.
Total number of non-annotated genes that have one or more links to an annotated gene at the various distance thresholds
Non-annotated genes with links
Non-annotated genes linked to one or more annotated genes
Combined methods: Operon, Rosetta Stone, Phylogenetic Profiles and conserved Gene Neighbors
Summary of the number of functional linkages between M. tuberculosis genes established by each of the four prediction methods of this paper, alone and in combination
Operon (≤100 bp)
100 bp Operon and RS
100 bp Operon and PP
100 bp Operon and GN
100 bp Operon, RS and PP
100 bp Operon, RS and GN
100 bp Operon, PP and GN
All four methods
100 bp Operon, PP, GN and RS
The most substantial overlap results from that of the Operon method and the conserved Gene Neighbor method. We see that 18% of the links identified by the Operon method at the 100 bp threshold are also identified by the conserved Gene Neighbor method. Both of these methods are used to identify potential operons in microbial genomes, but the conserved Gene Neighbor method only identifies operons that are conserved in multiple genomes. The Operon method, in contrast, is based solely on the intergenic distance between genes in the same orientation so it is able to identify potential operons even in the absence of homologous genes in other organisms. Ermolaeva et al. noted that many bacterial operons are organized in a similar manner in diverse genomes , and employed a combination of the conserved Gene Neighbor method along with a distance threshold to predict operons in microbial genomes .
We also see that a number of functional linkages inferred by the Operon method overlap with linkages inferred by the Phylogenetic Profile and Rosetta Stone methods. Moreno-Hagelsieb et al. previously compared genes within operons to those at transcription boundaries, and demonstrated that genes in known operons are more likely to have similar Phylogenetic Profiles, are more likely to occur as conserved Gene Neighbors and are more likely to occur as fusion proteins than genes at transcription boundaries . Yanai et al. also examined fusion genes, and noted many instances where individual components that constituted a 'fusion' gene were found as separate genes organized into common operons .
Keyword recovery scores for the Operon method alone and in combination with the Rosetta Stone (RS), Phylogenetic Profile (PP), and conserved Gene Neighbor (GN) methods
Number of links between SWISS-PROT annotated proteins
Keyword recovery %
Maximum false positive fraction
Keyword recovery of random links (100 trials) %
Random links SD %
Keyword recovery signal to noise
0 bp Operon
25 bp Operon
50 bp Operon
75 bp Operon
100 bp Operon
100 bp Operon and RS
100 bp Operon and GN
100 bp Operon and PP
100 bp Operon, PP and GN
100 bp Operon, PP and RS
100 bp Operon, RS and GN
All four methods
Evaluation of coverage and establishing a distance threshold
The linked population is more heavily weighted at the short distances, with a mean of 27 base pairs, while those that were not identified as linked by the three methods, have a mean of 94 base pairs. In fact, this result corresponds well with the study of Moreno-Hagelsieb et al., where they observed short distances between adjacent genes predicted to be functionally linked by an extension of the conserved Gene Neighbor method . If we assume the profile depicted in Figure 5a represents the profile for true operon members in M. tuberculosis, we infer that functional linkages based on the Operon method at a distance of 50 bp may have a coverage of more than 80%, while functional linkages established at the 100 bp cutoff may allow inclusion of more than 90% of true operon pairs. This method may also be useful for confirming short intergenic distances between genes in common operons in various microbial genomes.
A closer examination of Figure 5a and 5b reveals the similarities and differences among the two populations. The large occurrence at 0 distance in Figure 5b shows probable linkages that are detected by the Operon method but are undetected by the Rosetta Stone, Phylogenetic Profile or conserved Gene Neighbor methods. Furthermore, the large occurrence of potential linkages in Figure 5b above 200 bp is excluded by the Operon method, with a threshold of less than or equal to 100 bp.
Moreno-Hagelsieb et al. suggested a method for examining all adjacent gene pairs within the same direction (WD), and proposed a formula to calculate the fraction that exist in common operons . Moreno-Hagelsieb estimated that 50% of E. coli WD pairs may be in shared operons [3, 4], and using their formula we estimate that 57% of M. tuberculosis WD pairs are in common operons. According to Skovgaard et al., E. coli and M. tuberculosis appear to have a similar quality of assigned open reading frames , therefore it is likely that the higher percentage of M. tuberculosis WD pairs in operons is not a result of over assignment of open reading frames. Since M. tuberculosis has a higher percentage of WD pairs in operons, a higher distance threshold may be tolerated in M. tuberculosis than the 50 bp threshold used previously in E. coli .
Although we have selected the 100 bp threshold for evaluation of the combined computational approach, a lower threshold, such as 50 bp or 25 bp, results in a higher keyword recovery and a lower maximum false positive fraction (Table 2). The best scores for the Operon method, as we might expect, are achieved at a distance threshold of 0 bp. Although the keyword recovery and maximum false positive scores are improved at shorter distance thresholds, the coverage is decreased. Combining the Operon method with those of the Rosetta Stone, Phylogenetic Profile or conserved Gene Neighbor methods yield a dramatic improvement in both the keyword recovery and the maximum false positive fraction (Figure 4), and enables the use of a larger distance threshold, such as 100 bp. Gene pairs that are linked by two or more methods are very likely to share some function in common, even with a distance threshold of 100 bp (Figure 4).
For comparison, we have also included the distance profile of adjacent genes in experimentally documented E. coli operons (Figure 5c, data obtained from RegulonDB ). The distributions of both the E. coli operon data (Figure 5c) and the M. tuberculosis linked data (Figure 5a) tend to have shorter intergenic distances than the set of M. tuberculosis genes that are not functionally linked by the Rosetta Stone, Phylogenetic Profile or conserved Gene Neighbor methods (Figure 5b). Most prokaryotic organisms do not have extensive data regarding experimentally documented operons, so we suggest that this method may serve as an alternate method for identifying intergenic distance distributions of potential operon members in less characterized microbial genomes.
Example of combined approach
Inference of protein function and operon organization
A possible operon involved in RNA degradation may be encoded by the three genes Rv2925c(rnc), Rv2926c and Rv2927c, as shown in Figure 8b. Functional links between Rv2926c and Rv2925c(rnc) were identified by the Operon, Rosetta Stone and conserved Gene Neighbor methods, while Rv2926c was linked to Rv2927c by the Operon and conserved Gene Neighbor methods. Rv2926c and Rv2927c have the Sanger annotation 'hypothetical protein' and 'conserved hypothetical protein', respectively. Since Rv2925c(rnc) has the annotation 'RNAse III', we may assign a putative function to both Rv2926c and Rv2927c as possibly involved in RNA degradation. Notice that the functional link between Rv2926c and Rv2925c(rnc) is supported by three separate forms of evidence (OP, RS and GN), while the only direct link between Rv2927c and Rv2925c(rnc) is by the Operon method. Further support for this link is established indirectly due to the two functional links to Rv2926c.
While the functional linkages established by the Rosetta Stone method in combination with the Operon method seem to be useful at identifying functionally-linked partners, the overlap between the Operon method and the conserved Gene Neighbor methods is more prevalent. Figure 8c shows a region with numerous functional linkages to the penicillin-binding protein, Rv2163c(pbpB). Here we see a string of four genes, three of which have the Sanger annotation of 'hypothetical protein' or 'conserved hypothetical protein'. Based on the functional linkages established by the Operon method, Phylogenetic Profiles and conserved Gene Neighbors methods, we may assign a putative function to Rv2164c, Rv2165c and Rv2166c, similar to that of the pbpB, which is involved in cell wall biosynthesis.
Applications to the identification of possible drug targets
In order to link Rv1503c and Rv1504c to a possible function or pathway we can examine all of the other functional linkages. Although both Rv1503c and Rv1504c have a number of functional linkages, there are some common linkages between them. Both Rv1503c and Rv1504c have functional linkages to the genes Rv1302(rfe) and Rv3464(rmlB). Interestingly, both of these genes, rfe and rmlB, are important elements of the arabinogalactan biosynthesis pathway . Arabinogalactan is an essential component of the M. tuberculosis cell wall, and the arabinogalactan biosynthetic pathway is of major medical relevance, since two of its downstream members, EmbA and EmbB, are primary targets of the tuberculosis drug Ethambutol. Although there have been efforts to identify members of this pathway, only a fraction of the pathway members are known . Here we propose that Rv1503c and Rv1504c may be important members of the arabinogalactan biosynthesis pathway, possibly organized into a shared operon, encoding functionally-linked proteins.
The 'probable glutamine synthase' gene Rv1878 is linked to the 'conserved hypothetical protein' Rv1879 by both the Operon Method and the Rosetta Stone method, possibly indicating a functional relationship between these two genes. In Arabidopsis thaliana, both domains encoded by M. tuberculosis genes Rv1878(glnA3) and Rv1879 occur as a single fused protein, as seen in Figure 10. A closer look at the phylogenetic distribution of proteins homologous to Rv1879 reveals that homologs of this protein are present in only a handful of prokaryotic genomes, and as fusion proteins with glutamine synthase-like domains in some fungi and plants. Glutamine synthetase homologs, on the other hand, are found in all living organisms. We speculate that Rv1879 may have a functional relationship to the 'probable glutamine synthase' gene Rv1878, and may encode for a protein that links this glutamine synthase homolog to a previously undescribed pathway. The absence of an M. tuberculosis Rv1879 homolog in mammals, and the possible association with a glutamine synthetase homolog, may make this protein a promising drug target candidate.
In this study we have focused on the Rosetta Stone, Phylogenetic Profiles and conserved Gene Neighbor linkages that overlap with the Operon linkages. The Operon method links genes to other genes at a particular genetic locus. In contrast, the Rosetta Stone, Phylogenetic Profiles and conserved Gene Neighbor methods can link genes that lie near or far along the chromosome. Thus by combining these methods we are able to identify functional linkages involving genes that are not part of operons as well as genes that are.
The organization of the M. tuberculosis genome holds many clues to the possible function of hundreds of previously uncharacterized proteins. The use of the Operon method, based on distance between genes in the same orientation, appears to provide a useful tool for the prediction of protein function as well as the identification of possible operon members. The coverage of linkages at various distances may be inferred by the distance profile of genes known to be functionally linked (Figure 5a), and the accuracy of these functional links may be represented by the maximum false positive rate (Table 2).
Taken together, using the Operon method alone with a threshold distance of 100 bp, we may expect to link over 90% of the true operon members in the M. tuberculosis genome. The accuracy of the 8,468 functional links established at that distance, may be inferred from the maximum false positive fractions in Table 2, roughly 60% ([1-max. false positive fraction] × 100%) of these links may represent links between genes that have at least some functional similarity, many probably representing true operon members. We expect that the maximum false positive fractions represent the upper limit of true false positive pairwise linkages since the annotations of many genes may be incomplete.
In further support of the idea that the functional similarities between genes in the same orientation is due primarily to operon structure, is the observation that common function is related to the distance between genes in the same orientation, as depicted in Figure 3. Genes separated by a combined intergenic distance of more than 250 bp are no more likely to share a common function than randomly selected pairs. In contrast to operon prediction methods based on conserved gene strings [8, 22], methods based on intergenic distance thresholds allow identification of operon members without the dependency on identifiable homologues in other sequenced genomes [3, 23].
From our examination of experimentally documented operons in E. coli we expect that the Operon method would be able to identify functional relationships among proteins involved in a wide variety of functional categories. For example, in E. coli we observe operons containing genes involved in common metabolic pathways, multi-protein complexes, membrane-bound transport complexes, as well as genes involved in cell structure, cell adaptation, DNA replication, transcription, translation, regulatory functions and a number of other cellular activities.
Although the coverage of the Operon method alone allows us to identify thousands of potentially functionally-linked genes, a combined approach with the Rosetta Stone, Phylogenetic Profiles and conserved Gene Neighbor methods allows us to establish higher confidence links, as demonstrated in Figure 4 and Table 5. The Operon method in combination with any of the other methods results in an increase in the keyword recovery and a decrease in the maximum false positive fractions. The combination of the 100 bp threshold Operon inferences with either the Rosetta Stone, Phylogenetic Profile or conserved Gene Neighbor method exemplifies this. Although the 100 bp threshold Operon inferences alone have a keyword recovery of 45% and a maximum false positive fraction of 0.35, used in combination with either the Rosetta Stone, Phylogenetic Profile or conserved Gene Neighbor method yields a keyword recovery increase to 60-67% and a maximum false positive fraction decrease to 0.05-0.16, depending on the combination. Especially notable are the Operon inferences that overlap with Rosetta Stone inferences. The high keyword recovery and low maximum false positive fractions for this combination may be an indication that these links represent not only genes with similar functions that are organized into a common operon, but also may suggest proteins that may physically interact.
Many linkages inferred by the Rosetta Stone, Phylogenetic Profile and conserved Gene Neighbor methods overlap with those of the Operon method. The highest overlap results from that of the conserved Gene Neighbors method. In many cases this would be expected since genes organized in an operon would have a tendency to coevolve as a single unit, rather than as separate units, therefore these genes would be observed as 'Neighbors' in multiple prokaryotic genomes. The Phylogenetic Profiles overlap would result from the observation that operon members are often involved in shared pathways or complexes, and therefore would be expected to evolve in a correlated fashion. Finally, the overlap with the Rosetta Stone links may reflect the observation that the Rosetta Stone method, like the Operon method, often links proteins that either physically interact or are in the same pathway. The Phylogenetic Profile and conserved Gene Neighbor methods have also been used previously to confirm operon predictions based on intergenic distances [3, 23].
There are a number of potential applications for this combined method, ranging from the prediction of protein function based on functional linkages to annotated proteins, to the reconstruction of biochemical pathways. Zheng et al. have employed a combination of gene proximity and phylogenetic profiles to examine the co-evolution of gene clusters in E. coli , while Pellegrini et al. have used similar methods to those described here to construct a network of interconnected proteins within the Mycoplasma genitalium genome .
Here we have applied a combined method to investigate the genome of the pathogenic bacterium M. tuberculosis, and have demonstrated functional links from a number of previously uncharacterized proteins to specific biochemical pathways. Included in these, we have identified five novel proteins that may be functionally involved with pathways involved in the biosynthesis of components of the mycobacterium cell wall. By applying these methods to the entire genome of the pathogenic M. tuberculosis, we have identified many other novel genes that are linked to numerous biochemical pathways, some that may eventually serve as potential drug targets. Combined, these methods will also enable the genome-wide analysis of other prokaryotic genomes, and will aid in the identification of novel partners in both characterized and uncharacterized biochemical pathways. Newly inferred functional linkages are given at .
Materials and methods
M. tuberculosisgene coordinates
Gene name, length, coordinates and orientation were downloaded from the Pasteur Institute TubercuList web server . Gene coordinates were adjusted to include the stop codon of each gene.
Sanger Institute Functional Annotations
Sanger Institute M. tuberculosis H37Rv Functional Annotations were obtained from the Sanger M. tuberculosis web server .
SWISS-PROT Functional Annotations
M. tuberculosis SWISS-PROT Keywords were obtained from the Swiss Institute of Bioinformatics and European Bioinformatics Institute (EBI) SWISS-PROT web server .
Pairwise links between functionally-linked proteins were evaluated by a keyword recovery scheme  using the SWISS-PROT annotation for each of the tuberculosis proteins. For keyword recovery scores, pairs were evaluated only when both members of the pair had at least one SWISS-PROT keyword. The uninformative keywords: hypothetical protein, three-dimensional structure, transmembrane and complete proteome were discarded. Each pair of functionally-linked proteins received a preliminary score for the number of corresponding keywords which were shared between the two linked proteins. For example, consider the functional link at the 0 bp threshold between Rv0350 and Rv0351. Rv0350 has three SWISS-PROT keywords: ATP-binding, Chaperone and Heat shock. Rv0351 has two SWISS-PROT keywords: Chaperone and Heat shock. This link is assigned a preliminary score of two. This process was repeated for all linked genes, and a global measure of keyword recovery was derived by summing the individual link keyword scores and dividing by the total number of query keywords, as shown in Table 2.
The maximum false positive fraction was calculated by dividing the number of pairwise functional links that had no keywords in common by the total number of pairwise links. This estimate of false positives is presumably an upper limit because two linked genes might have related functions even in the absence of overlapping annotated functions.
The keyword recovery of random links is calculated by establishing the same number of random pairwise links between SWISS-PROT annotated genes as there is for real links, and calculating the keyword recovery as described above.
A database consisting of the gene name, start coordinate, end coordinate and gene orientation was constructed and employed to determine functional links between genes using distance and orientation parameters. A series of genes is considered functionally linked if the nucleotide distance between genes in the same orientation was less than or equal to a specified distance threshold. Multiple genes were linked if a series of genes in the same orientation all had intergenic distances less than or equal to the defined distance threshold, as shown in Figure 1b.
Rosetta Stone method
Proteins were functionally linked by the Rosetta Stone method if individual proteins were found to be present as a single fused protein in another organism, as described by Marcotte et al. . In this case, if individual M. tuberculosis proteins have significant homology to distinct regions of a single 'fusion' protein in another organism then they are indicated as functionally linked by this method. A probabilistic score is calculated by estimating the likelihood of observing Rosetta Stone proteins given the number of homologs each protein has.
Phylogenetic profile method
Phylogenetic profiles were used to identify proteins that evolved in a correlated fashion, as described by Pellegrini et al. . A phylogenetic profile for each M. tuberculosis protein was created in the form of a bit vector, by searching for the presence or absence of homologs in each of the available fully-sequenced genomes. The presence of an identifiable homolog in a particular genome was indicated by the integer 1 in the bit vector at the position corresponding to that genome, while the absence of a homolog was indicated by the integer 0. Phylogenetic profiles were then clustered based on the similarity of profiles, resulting in clusters of genes with similar profiles and likely related functions.
Conserved Gene Neighbor method
Functional links were established by the conserved Gene Neighbor method where genes appear as chromosomal neighbors in multiple genomes, as described by Overbeek et al.  and Dandekar et al. . For all possible pairs of M. tuberculosis genes, the nucleotide distance between homologs of these genes in all available sequenced genomes was calculated. Genes that were in close proximity in multiple genomes were indicated as functionally linked by this method. A probabilistic score reflects the likelihood of observing the intergenic distance between a pair of genes across all sequenced genomes.
Estimated fraction of adjacent gene pairs within the same direction (WD) that belong to operons
We employed the equation given by Moreno-Hagelsieb et al.  to estimate the fraction of M. tuberculosis WD pairs that are in common operons. The fraction of M. tuberculosis WD pairs with an intergenic distance between -20 bp and 30 bp were divided by the fraction of E. coli WD pairs with an intergenic distance between -20 bp and 30 bp. This number was then multiplied by 0.5, which was previously estimated to be the fraction of E. coli WD pairs that are in operons [3, 4].
M.S. is supported by a USPHS National Research Service Award GM07185.
- Madigan M, Martinko J, Parker J: Brock Biology of Microorganisms. 2000, New Jersey: Prentice Hall, 9thGoogle Scholar
- Lodish H, Baltimore D, Berk A, Zipursky SL, Matsudaira P, Darnell J: Molecular Cell Biology. 1995, New York: Scientific American Books, 3rdGoogle Scholar
- Moreno-Hagelsieb G, Collado-Vides J: A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002, 18 Suppl 11: S329-S336.View ArticleGoogle Scholar
- Salgado H, Moreno-Haelsieb G, Smith T, Collado-Vides J: Operons in Escherichia coli : genomic analysis and predictions. Proc Natl Acad Sci USA. 2000, 97: 6652-6657. 10.1073/pnas.110147297.PubMedPubMed CentralView ArticleGoogle Scholar
- Yada T, Nakao M, Totoki Y, Nakai K: Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics. 1999, 15: 987-993. 10.1093/bioinformatics/15.12.987.PubMedView ArticleGoogle Scholar
- Salgado H, Santos-Zavaleta A, Gama-Castro S, Millan-Zarate D, Diaz-Peredo E, Sanchez-Solano F, Perez-Rueda E, Bonavides-Martinez C, Collado-Vides J: RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res. 2001, 29: 72-74. 10.1093/nar/29.1.72.PubMedPubMed CentralView ArticleGoogle Scholar
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999, 96: 2896-2901. 10.1073/pnas.96.6.2896.PubMedPubMed CentralView ArticleGoogle Scholar
- Ermolaeva MD, White O, Salzberg SL: Prediction of operons in microbial genomes. Nucleic Acids Res. 2001, 29: 1216-1221. 10.1093/nar/29.5.1216.PubMedPubMed CentralView ArticleGoogle Scholar
- Pellegrini M, Thompson M, Fierro J, Bowers P: Computational method to assign microbial genes to pathways. J Cell Biochem Suppl. 2001, Suppl 37: 106-109. 10.1002/jcb.10071.PubMedView ArticleGoogle Scholar
- Marcotte EM, Pellegrini M, Ho-Leung N, Rice D, Yeates T, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science. 1999, 285: 751-753. 10.1126/science.285.5428.751.PubMedView ArticleGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999, 96: 4285-4288. 10.1073/pnas.96.8.4285.PubMedPubMed CentralView ArticleGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998, 23: 324-328. 10.1016/S0968-0004(98)01274-2.PubMedView ArticleGoogle Scholar
- Marcotte EM, Pellegrini M, Thompson MJ, Yeates T, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature. 1999, 402: 83-86. 10.1038/47048.PubMedView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Trevino V, Perez-Rueda E, Smith TF, Collado-Vides J: Transcription unit conservation in the three domains of life: a perspective from Escherichia coli. Trends Genet. 2001, 17: 175-177. 10.1016/S0168-9525(01)02241-7.PubMedView ArticleGoogle Scholar
- Yanai I, Wolf YI, Koonin EV: Evolution of gene fusions: horizontal transfer versus independent events. Genome Biol. 2002, 3: research0024.1-0024.13. 10.1186/gb-2002-3-5-research0024.Google Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1474. 10.1126/science.277.5331.1453.PubMedView ArticleGoogle Scholar
- Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001, 17: 425-428. 10.1016/S0168-9525(01)02372-1.PubMedView ArticleGoogle Scholar
- Fultz P, Kemper J: Wild-type isopropylmalate isomerase in Salmonella typhimurium is composed of two different subunits. J Bacteriol. 1981, 148: 210-219.PubMedPubMed CentralGoogle Scholar
- Tamakoshi M, Yamagishi A, Oshima T: The organization of the leuC, leuD and leuB genes of the extreme thermophile Thermus thermophilus. Gene. 1998, 222: 125-132. 10.1016/S0378-1119(98)00482-X.PubMedView ArticleGoogle Scholar
- Hatfull GF, Jacobs WR: Molecular Genetics of Mycobacteria. 2000, Washington DC: ASM PressGoogle Scholar
- Harth G, Horwitz MA: An inhibitor of exported Mycobacterium tuberculosis glutamine synthetase selectively blocks the growth of pathogenic mycobacteria in axenic culture and in human monocytes: extracellular proteins as potential novel drug targets. J Exp Med. 1999, 189: 1425-1436. 10.1084/jem.189.9.1425.PubMedPubMed CentralView ArticleGoogle Scholar
- Wolf Y, Rogozin IB, Kondrashov AS, Koonin EV: Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001, 11: 356-372. 10.1101/gr.GR-1619R.PubMedView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Collado-Vides J: Operon conservation from the point of view of Escherichia coli, and inference of functional interdependence of gene products from genome context. In Silico Biol. 2002, 2: 87-95.PubMedGoogle Scholar
- Zheng Y, Roberts RJ, Kasif S: Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol. 2002, 3: research0060.1-0060.9. 10.1186/gb-2002-3-11-research0060.View ArticleGoogle Scholar
- Computational Functional Linkages . [http://www.doe-mbi.ucla.edu/TB/computational/linkages]
- Pasteur Institute TubercuList. [http://genolist.pasteur.fr/TubercuList/index.html]
- The Sanger Institute: M. tuberculosis. [http://www.sanger.ac.uk/Projects/M_tuberculosis/Gene_list/]
- SWISS-PROT. [http://www.expasy.org/sprot/sprot_details.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.