Characterization of probiotic Escherichia coliisolates with a novel pan-genome microarray
© Willenbrock et al.; licensee BioMed Central Ltd. 2008
Received: 30 July 2007
Accepted: 18 December 2007
Published: 18 December 2007
Microarrays have recently emerged as a novel procedure to evaluate the genetic content of bacterial species. So far, microarrays have mostly covered single or few strains from the same species. However, with cheaper high-throughput sequencing techniques emerging, multiple strains of the same species are rapidly becoming available, allowing for the definition and characterization of a whole species as a population of genomes - the 'pan-genome'.
Using 32 Escherichia coli and Shigella genome sequences we estimate the pan- and core genome of the species. We designed a high-density microarray in order to provide a tool for characterization of the E. coli pan-genome. Technical performance of this pan-genome microarray based on control strain samples (E. coli K-12 and O157:H7) demonstrated a high sensitivity and relatively low false positive rate. A single-channel analysis approach is robust while allowing the possibility for deriving presence/absence predictions for any gene included on our pan-genome microarray. Moreover, the array was highly sufficient to investigate the gene content of non-pathogenic isolates, despite the strong bias towards pathogenic E. coli strains that have been sequenced so far.
This high-density microarray provides an excellent tool for characterizing the genetic makeup of unknown E. coli strains and can also deliver insights into phylogenetic relationships. Its design poses a considerably larger challenge and involves different considerations than the design of single strain microarrays. Here, lessons learned and future directions will be discussed in order to optimize design of microarrays targeting entire pan-genomes.
Bacterial isolates are traditionally classified into species by bacteriological methods, and subtyped within the species by phenotypic or genotypic characterization. For the identification and subtyping of Escherichia coli isolates, a wide variety of typing methods have been developed. A recent addition to this spectrum is array comparative genomic hybridization (aCGH) . Thus, microarray hybridization is becoming a standard procedure to evaluate the genetic content of a bacterial species. For E. coli, a microarray covering the gene content of seven strains was recently developed for the characterization of emerging pathogens . However, since then, many additional E. coli strains and plasmids have been sequenced, and the total number of genes potentially present in E. coli strains, the so-called 'pan-genome' [3, 4], increases with each new E. coli genome sequenced. A microarray chip approximating the complete pan-genome of E. coli would provide optimal sensitivity to characterize isolates. Here, we present a novel design of a microarray covering the complete currently known genome content of 32 sequenced genomes. Such a pan-genome microarray can be used for more precise characterization of novel strains, including emerging pathogens, and can also deliver insights into phylogenetic relationships.
Phylogenetic relationships are commonly determined by bacterial subtyping. Due to the complex sexual behavior of bacteria, phylogenetic trees obtained with individual genes often do not correspond to each other. Although multilocus sequence typing is now regarded by many as a good standard to determine phylogenetic relationships between and within bacterial species, it does not always reflect the true genetic diversity of members of a species; trees based on multilocus sequence typing may, therefore, differ significantly from a tree based on whole gene content . A pan-genome microarray may offer a suitable alternative to complete genome sequencing for extracting the necessary gene content to construct a realistic phylogenetic tree based on conserved gene content. The recent technological development in sequencing and the consequent price drop have led to an explosion of available genome sequences and perhaps within a few years will lead to sequencing being a faster and cost effective alternative to CGH microarray analysis. However, at the moment, sequencing is still more costly and less time efficient than hybridization experiments, while hybridization experiments potentially also can provide information regarding gene expression.
Here, we determine an approximate E. coli pan-genome, based on 24 E. coli and 8 Shigella genomes available at the time of analysis (November 2006). The inclusion of Shigella genomes was justified as the genus division between Shigella and Escherichia is historical but taxonomically incorrect [5, 6]. For simplicity, the Shigella and E. coli genomes are collectively referred to as E. coli. From these genomes we construct an E. coli pan-genome microarray. The technical performance of this pan-genome microarray is assessed by the correct identification of present and absent genes from the completely sequenced genome of the MG1655 isolate of E. coli strain K-12 (hereafter referred to as MG1655) and strain O157:H7 EDL933 (EDL933 for short), collectively referred to as the control strains. Pathogenic E. coli isolates are highly overrepresented in the available genome sequences and, hence, on our pan-genome chip. We assessed whether this chip could nevertheless be useful for characterization of non-pathogenic isolates by hybridizing four probiotic E. coli isolates to the chip. These isolates are part of a commercially available product (Symbioflor2) marketed for human use as an enhancer of the immune system. The product contains viable bacteria comprising at least four genotypes of commensal E. coli. By characterizing their gene content, we investigated the phylogenetic relationship of these isolates to other E. coli strains.
Defining the E. colicore-genome and pan-genome
Sequences included in the microarray design
NCBI Proj ID
E. coli 042 chromosome
E. coli 042 plasmid
E. coli 101-1 chromosome
E. coli 53638 chromosome
E. coli 536 chromosome
E. coli B chromosome
E. coli B171 chromosome
E. coli B171 plasmid
E. coli B7A chromosome
E. coli CFT073 chromosome
E. coli E11019 chromosome
E. coli E22 chromosome
E. coli E2348 chromosome
E. coli E2348 pB171 plasmid
E. coli E2348 p9123 plasmid
E. coli E2348 pGEPAT plasmid
E. coli E24377A chromosome
E. coli F11 chromosome
E. coli H10407 chromosome
E. coli HS chromosome
E. coli K12-MG1655 chromosome
E. coli K12-W3110 chromosome
E. coli O103Oslo chromosome†
E. coli O157RIMD0509952 chromosome
E. coli O157RIMD0509952 pO157
E. coli O157RIMD0509952 pOSAK1
E. coli RS218 chromosome
E. coli RS218 plasmid
E. coli UTI189 chromosome
E. coli UTI189 plasmid
E. coli VR50 chromosome†
E. coli APEC-O1 chromosome
E. coli O157EDL933 chromosome
E. coli O157EDL933 plasmid
S. boydii Sb227 chromosome
S. dysenteriae M131649 chromosome
S. dysenteriae Sd197 chromosome
S. dysenteriae Sd197 pSD1197
S. flexneri 2457T chromosome
S. flexneri 301 chromosome
S. flexneri 301 pCP301 plasmid
S. flexneri 8401 chromosome
S. sonnei 53G chromosome
S. sonnei Ss046 chromosome
S. sonnei Ss046 pSS plasmid
In designing the E. coli pan-genome microarray, genes were grouped based on their nucleotide sequences since the probes are based on DNA oligonucleotides. Moreover, the parameters to group genes for similarity were adapted compared to the parameters used for protein similarity to define the core and pan-genome in order to improve differentiation between the nucleotide sequences of similar E. coli genes found in different strains. For this purposes the '50% sequence similarity of 50% of the sequence' conservation criteria  was found to be sub-optimal. Instead, genes were grouped into gene groups with a slightly different and somewhat stricter homology criteria (see Materials and methods for details), producing a higher number of groupings. This resulted in a total of 11,872 gene groups present in all 32 genomes, compared to the smaller pan-genome of 9,433 gene groups resulting from comparison at the protein sequence level. Of the 11,872 gene groups, 2,041 consisted of genes found in all 32 strains. Thus, the stricter grouping criteria applied here produced a lower number than the currently estimated core genome size of 2,241 protein gene groups for 32 E. coli genomes.
In the presented design strategy, the inclusion of 32 E. coli strains in the microarray design necessitated the employment of a common standardized gene prediction strategy since some of the genomic sequences had poor or non-existing gene annotations. One option is to either include as many open reading frames as possible as potential genes (in a 'more is better' strategy) or, alternatively, to use EasyGene, a well performing and conservative gene predictor. One can argue that a 'more is better' strategy is preferred to the conservative gene prediction so that fewer genes would be missed. However, including spurious hypothetical genes in the design would potentially obstruct the probe design phase both in the grouping of gene families and in excluding otherwise perfect probes due to cross-hybridization to these false genes. Furthermore, in case of prediction of gene content in control and novel strains by hybridizing genomic DNA to the array, such false positives are just as unwelcome as false negatives. Nonetheless, absence of too many important E. coli genes is not desirable either. We therefore compared the genes predicted by EasyGene with the high-quality annotation of the K-12 MG1655 strain (version U00096.3). This revealed that of the 238 protein encoding genes not predicted by EasyGene, 206 were hypothetical genes, leader peptides, frameshifts, gene fragments or pseudogenes. Of the remaining 32 genes, 12 were present in at least one other E. coli strain considered in the design. Consequently, only 20 genes of potential interest were missed by EasyGene. Since this is less than half a percent of the genome (20/4,331 = 0.46%), we considered that the advantages of conservative standardized gene finding outweighed the disadvantages of missing a small minority of genes.
Benchmarking the chip design
Following the filtering step, several gene groups were left with only few probes targeting them, and we found it necessary to remove groups that were targeted by three or fewer probes from further analysis. This reduced the average number of false positives from 267 to 87 (for MG1655) and from 638 to 405 when analyzing all control samples with regard to genes found to be present from analysis of log2 hybridization signals compared to genes predicted present from the genome sequence. On the other hand, gene groups represented by few probes were not as likely to result in false negatives since removal of these groups did not change the average number of false negatives significantly (data not shown).
Sensitivity and false discovery rate based on analysis of log2 intensities
In contrast to the MG1655 control strain, we did not observe enrichment in hypothetical genes among false positives for EDL933. In this case we suspect that the 'false positives' were actually true genes mistakenly missed by EasyGene. In support of this, EasyGene did actually predict only 4,664 genes for the EDL933 main chromosome compared to the 5,349 annotated in GenBank, possibly due to a number of unknown nucleotides still present in the published genome sequence . Gene expression profiling of these genes would confirm if these are in fact true genes that are expressed and thus incorrectly missed by EasyGene. Preliminary data from a gene expression study run in parallel with this work demonstrated that the gene expression profile of these genes indeed resembled that of other genes present in the EDL933 genome (Sekse C, Friis C, Wasteson Y, Ussery DW and Willenbrock H, unpublished results). This observation supports our interpretation that they are actually not false positives generated by bad chip manufacturing, hybridization artifacts or poor analysis approaches, but a consequence of an ambiguous DNA sequence that any gene predictor would have ignored. Ideally, they should have been categorized as true positives. Consequently, the low FDR obtained from the other control strain, MG1655, is a better indicator of our pan-genome chip performance.
Log2 intensity results versus log2 ratio results for test samples MG1655 and EDL933
Analysis of probiotic E. colistrains
Comparison of Symbioflor2 isolates to predictions for control strain samples
No. of predicted genes
No. of genes in common with (based on log2 intensities):
'Novel' sample genes not in (based on log2 intensities):
Apart from the hemolysin genes and a gene annotated as 'putative iron-regulated outer membrane viruence gene', no other virulence genes were detected in the probiotic isolates. The observed genetic relatedness of probiotic strains to a virulent strain illustrates that both pathogenic and non-pathogenic E. coli strains use common strategies for adaptation to their niche. Of the genes found to be present in the probiotic isolates but not in a non-pathogenic E. coli strain (MG1655), many were bacteriophage-derived. Nevertheless, complete prophages were not present and variation between and within phage gene content between the four probiotic isolates suggested these bacteriophages have been introduced in independent events. Transposases and other insertion sequence-related genes provided further evidence of the influence of mobile DNA on introducing genetic variation in a bacterial population. Of interest were genes present in the probiotic isolates but absent in MG1655 that were annotated as having general metabolic functions. A closer analysis of these findings would be necessary to assess if such genes provide improved fitness for colonization of the human gut, and so could explain the probiotic nature of the isolates. Also, one should keep in mind that the K-12 isolates represent a reduced E. coli genome, and some essential metabolic genes are known to be missing in these isolates. Complete lists of annotated genes found in each of the four Symbioflor2 isolates but not in the MG1655 control strain is provided as Additional data file 2.
The design of a microarray covering more than 30 genomes proved to be a considerable challenge. Multiple aspects had to be considered but the greatest difficulty was to filter out false positives, at the risk of introducing additional false negatives. The level of similarity between gene sequences should justify conserved annotation, but the borders of significance are diffuse and poorly defined. This is a consequence of biological processes that undergo gradual genetic changes. On one hand, probes should cover all versions of the same gene, but at the same time they should be able to distinguish between different genes and even, when relevant, distinct versions of the same gene in different strains. In light of this, conventional microarray design strategies, such as inclusion of mismatch probes for background estimation, will not work when dealing with multiple genomes. One can never ensure that a perfect match is absent for such probes in novel strains. Moreover, because the array should be able to interrogate for the presence of genes at the DNA level (as presented in this paper), the number of probes per gene should be allowed to vary. A higher number of probes is required for a sufficient coverage of long genes, whereas low quality probes would result if attempting to design the same number of probes for very short genes. Consequently, the challenge is to define, in a sensible way, such goals and to search for the best possible solution. Our pan-genome approach proved to be a suitable solution.
Recently, the idea of an 'open pan-genome' was introduced, where each newly sequenced strain would continue to add novel genes to the pan-genome of the species. It was suggested that Streptococcus agalactiae would have an open pan-genome, with the consequence that despite sequencing hundreds of strains, novel genes would still be discovered . E. coli is likely to also have an open pan-genome since it colonizes multiple environments, complex microflora biotopes, and, therefore, has multiple ways and sources for exchanging and obtaining genetic material . In line with this expectation, Chen and co-workers  predicted that each new E. coli genome would add 441 genes to the E. coli pan-genome pool. However, this prediction of 'new genes' is possibly too high, since it was based on seven very diverse E. coli genomes only. Genome size differs considerably within the species, from the relatively small K-12 strains to the larger pathogenic O157:H7 strains. Therefore, we believe that our estimate of the E. coli pan-genome and the core genome is closer to what might be expected in the 'real world', since it is based on a much larger number and variety of strains. Thus, the number of added new genes per genome has dropped to about 79 genes when including data from 32 different strains, and may decrease further with improved genome annotations. This smaller estimate is in the same order of magnitude as predicted for other pan-genomes, such as Streptococcus (27 per genome for group A and 33 for group B) . Still, our estimate for E. coli may be too conservative if the true diversity of E. coli is still insufficiently covered by the current genome sequences, that is, environmental and non-mammalian strains are under-represented, and the addition of these may initially add a significant number of novel genes to the E. coli pan-genome.
Furthermore, we were able to come up with a more accurate prediction of the E. coli core genome. Previously, the size of the E. coli core genome was assessed, based on seven different E. coli strains, to consist of between 2,865 and 3,475 genes [2, 19]. Based on the 32 genomes included in this study, we predict that the size of the core genome will approach approximately 1,560 essential genes, about half that of the previous estimates. We believe the current estimate to be more accurate, as it is based on a much larger number of genomes. However, in the present study, several unfinished genome sequences were included. Improving these both in terms of sequencing and assembly and in gene annotation quality, may result in an increased core genome size if the current partly finished genome sequences are missing core genes.
To assess the performance of the chip as well as to identify the best way of analyzing data from it, control sample hybridizations were analyzed. Comparative hybridizations on dual channel microarrays have the advantage of reduced noise due to limited variations of probe hybridization efficiencies. However, a dual channel analysis is limited to probes covering the control sample so that noise reduction applies only to probes hybridizing to genes present in the control sample. Although the false positive rate was slightly higher for the single-channel analysis approach, we demonstrate that sensitivity is only marginally lower than for the dual channel approach while information can also be extracted regarding genes not present in the control sample. Consequently, this analysis approach offers a favorable possibility for deriving predictions for any gene present on the pan-genome microarray.
Pathogenic E. coli genomes are highly overrepresented on this pan-genome chip because the majority of the E. coli genomes sequenced to date are from pathogenic strains, and few originate from environmental sources or are commensal strains. Nonetheless, we found that the chip was widely useful for characterizing the gene content of non-pathogenic E. coli isolates and for investigating the non-pathogenic nature of these E. coli isolates.
Lessons learned from this microarray can be used to design better arrays in the future. Although we considered all designed probes for the chip, including probes with low specificity to all strains in a given gene group, based on our analysis of experimental results, we have found that a filtering step is necessary to remove less specific probes. Moreover, gene groups for which only few probes could be designed (above the probe score cutoff) were not as reliable as gene groups represented by a larger number of probes. While this is not surprising, it nonetheless makes it a difficult task to accurately probe for these genes.
Based on 32 E. coli and Shigella genome sequences, we have developed an E. coli pan-genome microarray representing the current pan-genome of E. coli. Although any individual E. coli genome contains between 4,000 and 5,000 genes, we find about twice as many distinct gene groups in the total gene pool examined. High-density pan-genome microarrays can be quite useful for characterizing either DNA content or gene expression from unknown E. coli strains. Thus, we found the technique highly sufficient to investigate gene content of four non-pathogenic E. coli isolates despite the strong bias for pathogenic strains represented on the pan-genome array. The four analyzed probiotic E. coli isolates share a gene pool very similar to the E. coli K-12 strains, and additional strain-specific genes were often phage genes, transposases, insertion elements and metabolic genes. It remains to be seen to what degree these genes contribute to the probiotic nature of the isolates. Generally, we conclude that our high-density pan-genome array provides an excellent tool for characterizing the genetic makeup of unknown E. coli strains and can also deliver insights into phylogenetic relationships.
Materials and methods
Twenty-four E. coli chromosome sequences that were publicly available at the time of analysis (as one or multiple contigs) and nine plasmid sequences belonging to seven of the sequenced strains were included in this study. In addition, eight Shigella chromosomes were included (two S. sonnei, three S. flexneri, two S. dysenteriae and one S. boydii) with their three corresponding plasmids (Table 1).
Probe and microarray design
All considered genome and plasmid sequences (Table 1) were searched for genes using EasyGene version 1.0 or 1.2 [7, 8] in order to standardize gene finding. Genes were screened for homology using BLAST  in order to prevent redundancy of the probes. Genes were placed in a group when homologous by the following criteria: E-value <10-5, bitscore >55, and alignment constituting 50% or more of the longer of the two aligned sequences. Genes with no homology were represented as 'singles'. Groups of genes ('multiples') were aligned using ClustalW with default settings  and a consensus sequence was derived using the most frequent nucleotide at each position, weighted by its background frequency in all genes. The probe design strategy employed by OligoWiz  was used for probe selection. Two additional scores were introduced as parameters for the probe design software dealing with consensus sequences: a weighted conservation score and a gap score. The weighted conservation score uses Shannon's information measure  for conservation at each nucleotide position in a probe. According to , the influence of a mismatch on measured hybridization intensities varies with its position, with positions towards either end having less influence. Therefore, each position was weighted according to a second order polynomial function. The probe's weighted conservation score is the product of the weighted position scores for mismatch basepairs. The gap score was used to identify probes that targeted gaps in the alignment of multiples. This score was used to design probes that specifically identified conserved regions of all genes in each group (thus attempting to avoid gaps).
All probes were designed as 55-60 mers, with variable length and sequence to optimize for the same melting temperature. Only standard nucleotides (GATC) were considered in the probe design. In total, 305,285 probes covering 11,768 gene groups and singles were designed. A detailed description of the microarray design may be found in Additional data file 3. The probe design was given the NimbleGen design_id 5137 and is available upon request.
Filtering of probes
Probes were aligned against all predicted gene sequences included using NCBI-BLAST blastn version 2.2.11  and the identity of each probe with each gene sequence was determined in base pairs. Sequences were extracted for which the ratio [bp identity/probe length] was >0.8. Probes that either matched more than one single group or failed to match all genes in the design group were excluded from further analysis. Furthermore, groups for which three or less probes remained after filtering were removed from the subsequent analysis due to their increased risk of generating false positives (see Results). This resulted in a reduction to 224,805 probes covering 9,252 gene groups and singles. Consequently, the number of probes targeting each gene ranged from 4 to 29 with a median coverage of 27 probes per gene.
Annotation of gene groups
Each gene group in the probe design was annotated according to the results from hits against the UniProtKB/Swiss-Prot release 52.5 and UniProtKB/TrEMBL release 35.5 protein database  using NCBI-BLAST Blastp version 2.2.11 . Only alignments covering >50% of the gene lengths and having 50% or better identity within the alignment were included. Among all the sequences within each group, the hit producing the highest percent identity was chosen. In this way, 5,348 of our 11,872 gene groups could be annotated against Swiss-Prot and 9,320 of our 11,872 gene groups could be annotated against TrEMBL. Thus, while Swiss-Prot generally produces more reliable annotations, the number of annotations produced was quite low. Consequently, when available, genes were assigned the more reliable Swiss-Prot annotation, otherwise it was assigned the TrEMBL annotation if one was available. Gene groups that could not be assigned an annotation were assigned hypothetical proteins.
Strain selection, DNA preparation and hybridization
Control strain K-12 MG1655 was kindly provided by Flemming G Hansen (CBS, BioCentrum-DTU, The Technical University of Denmark) and genomic DNA from control strain O157:H7 EDL933 was kindly provided by Camilla Sekse (Norwegian Veterinary school, Oslo). As test strains, Kurt Zimmermann from Symbiopharm (Herborn, Germany) supplied four probiotic E. coli isolates, designated G1/2, G3/10, G4/9 and G5, from their commercially available Symbioflor2 product. G1/2 has previously been serotyped as O rough:K-:H- and O rough:H-, G3/10 as O 35,129:K-:H-, G4/9 as O rough:K-:H-, and G5 as O rough:K-:H-.
All test strains were grown overnight in Luria-Bertani (LB) broth with continuous agitation , and DNA was isolated as described previously . The genomic DNA was labeled with cy3 or cy5 and hybridized to NimbleGen custom arrays according to NimbleGen standard protocols for CGH (prepared and hybridized by NimbleGen (Madison, Wisconsin USA)). The raw data are available from the Gene Expression Omnibus (GEO) database  with series accession number GSE8595.
The probes were mapped to each gene group including position according to the design and analyzed as described previously  with minor modifications. Briefly, a position-dependent segmentation algorithm was employed to partition data points into present and absent sequence segments constituting any given gene. For this, we used circular binary segmentation  with default settings as implemented in DNAcopy developmental version 1.2.1 written for the R statistical language . As recommended by the authors, the data were first smoothed and subsequently segmented. Segmentation was followed by merging the output with MergeLevels  with a fixed threshold at the standard deviation between segmented log2 intensities and observed log2 intensities, or the standard deviation of segmented log2 ratios.
Consequently, following noise reduction by segmentation and merging, the cutoff for log2 intensities was found at the merged value between these two distribution maxima with the least segments assigned to it. All segments with merged values above this cutoff were predicted as present. Since ratios are calculated only for genes present in the control sample, and given the likely high similarity between a test sample and control sample of the same species, most genes are assumed present. Consequently, here the present level was estimated as the merged level to which most segments had been assigned. Moreover, for a gene to be called present, at least 90% of its probes should be found to be present. Accordingly, the samples were both analyzed individually as log2 intensities and combined with the appropriate control experiment, as log2 ratios.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a table providing a ranked list of each Symbioflor2 isolate's similarity to chip design strains. Additional data file 2 contains complete lists of annotated genes found in each of the four Symbioflor2 isolates but not in the MG1655 control strain. Additional data file 3 contains a detailed description of the microarray design.
comparative genomic hybridization
false discovery rate.
The authors are grateful for support from the Danish Research Councils, as well as The Danish Center for Scientific Computing. We also wish to thank Flemming G Hansen, Kurt Zimmermann and Camilla Sekse for contributing E. coli strains. We would also like to acknowledge Carsten Friis, Aron C Eklund, Jon Bohlin and colleagues at CBS for many helpful contributions and discussions.
- Dorrell N, Hinchliffe SJ, Wren BW: Comparative phylogenomics of pathogenic bacteria by microarray analysis. Curr Opin Microbiol. 2005, 8: 620-626. 10.1016/j.mib.2005.08.012.PubMedView ArticleGoogle Scholar
- Willenbrock H, Petersen A, Sekse C, Kiil K, Wasteson Y, Ussery DW: Design of a seven-genome Escherichia coli microarray for comparative genomic profiling. J Bacteriol. 2006, 188: 7713-7721. 10.1128/JB.01043-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae : implications for the microbial "pan-genome". Proc Natl Acad Sci USA. 2005, 102: 13950-13955. 10.1073/pnas.0506758102.PubMedPubMed CentralView ArticleGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev. 2005, 15: 589-594. 10.1016/j.gde.2005.09.006.PubMedView ArticleGoogle Scholar
- Yang J, Nie H, Chen L, Zhang X, Yang F, Xu X, Zhu Y, Yu J, Jin Q: Revisiting the molecular evolutionary history of Shigella spp. J Mol Evol. 2007, 64: 71-79. 10.1007/s00239-006-0052-8.PubMedView ArticleGoogle Scholar
- Lan R, Reeves PR: Escherichia coli in disguise: molecular origins of Shigella. Microbes Infect. 2002, 4: 1125-1132. 10.1016/S1286-4579(02)01637-4.PubMedView ArticleGoogle Scholar
- Larsen TS, Krogh A: EasyGene - a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics. 2003, 4: 21-10.1186/1471-2105-4-21.PubMedPubMed CentralView ArticleGoogle Scholar
- Nielsen P, Krogh A: Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005, 21: 4322-4329. 10.1093/bioinformatics/bti701.PubMedView ArticleGoogle Scholar
- Tannock GW: Molecular assessment of intestinal microflora. Am J Clin Nutr. 2001, 73: 410S-414S.PubMedGoogle Scholar
- Hartl DL, Dykhuizen DE: The population genetics of Escherichia coli. Annu Rev Genet. 1984, 18: 31-68. 10.1146/annurev.ge.18.120184.000335.PubMedView ArticleGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.PubMedView ArticleGoogle Scholar
- Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 4084-4091. 10.1093/bioinformatics/bti677.PubMedView ArticleGoogle Scholar
- Perna NT, Plunkett G, Burland V, Mau B, Glasner JD, Rose DJ, Mayhew GF, Evans PS, Gregor J, Kirkpatrick HA, et al: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature. 2001, 409: 529-533. 10.1038/35054089.PubMedView ArticleGoogle Scholar
- Roos V, Nielsen EM, Klemm P: Asymptomatic bacteriuria Escherichia coli strains: adhesins, growth and competition. FEMS Microbiol Lett. 2006, 262: 22-30. 10.1111/j.1574-6968.2006.00355.x.PubMedView ArticleGoogle Scholar
- Damian M, Usein CR, Tatu-Chitoiu D, Palade AM, Popovici N, Ciontea S, Nica M, Grigore L: Incidence of virulence-encoding genes among enteric Escherichia coli strains isolated from healthy subjects. Roum Arch Microbiol Immunol. 2005, 64: 34-38.PubMedGoogle Scholar
- Bettelheim KA, Kuzevski A, Gilbert RA, Krause DO, McSweeney CS: The diversity of Escherichia coli serotypes and biotypes in cattle faeces. J Appl Microbiol. 2005, 98: 699-709. 10.1111/j.1365-2672.2004.02501.x.PubMedView ArticleGoogle Scholar
- Schierack P, Steinruck H, Kleta S, Vahjen W: Virulence factor gene profiles of Escherichia coli isolates from clinically healthy pigs. Appl Environ Microbiol. 2006, 72: 6680-6686. 10.1128/AEM.02952-05.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen Q, Savarino SJ, Venkatesan MM: Subtractive hybridization and optical mapping of the enterotoxigenic Escherichia coli H10407 chromosome: isolation of unique sequences and demonstration of significant similarity to the chromosome of E. coli K-12. Microbiology. 2006, 152: 1041-1054. 10.1099/mic.0.28648-0.PubMedView ArticleGoogle Scholar
- Chen SL, Hung CS, Xu J, Reigstad CS, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer RR, Ozersky P, et al: Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli : a comparative genomics approach. Proc Natl Acad Sci USA. 2006, 103: 5977-5982. 10.1073/pnas.0600938103.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMed CentralView ArticleGoogle Scholar
- Wernersson R, Nielsen HB: OligoWiz 2.0 - integrating sequence feature annotation into the design of microarray probes. Nucleic Acids Res. 2005, 33: W611-615. 10.1093/nar/gki399.PubMedPubMed CentralView ArticleGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18: 6097-6100. 10.1093/nar/18.20.6097.PubMedPubMed CentralView ArticleGoogle Scholar
- Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, Shannon KW, Lefkowitz SM, Ziman M, Schelter JM, Meyer MR, et al: Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol. 2001, 19: 342-347. 10.1038/86730.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- O'Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R: High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 2002, 3: 275-284. 10.1093/bib/3.3.275.PubMedView ArticleGoogle Scholar
- Sambrook J, Fritsch EF, Maniatis T: Molecular Cloning: a Laboratory Manual. 1989, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 2Google Scholar
- Grimberg J, Maguire S, Belluscio L: A simple method for the preparation of plasmid and chromosomal E. coli DNA. Nucleic Acids Res. 1989, 17: 8893-10.1093/nar/17.21.8893.PubMedPubMed CentralView ArticleGoogle Scholar
- Barrett T, Edgar R: Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol. 2006, 411: 352-369. 10.1016/S0076-6879(06)11019-8.PubMedPubMed CentralView ArticleGoogle Scholar
- Bioconductor. [http://www.bioconductor.org]
- Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW: A DNA structural atlas for Escherichia coli. J Mol Biol. 2000, 299: 907-930. 10.1006/jmbi.2000.3787.PubMedView ArticleGoogle Scholar
- Hallin PF, Binnewies TT, Ussery DW: Genome update: chromosome atlases. Microbiology. 2004, 150: 3091-3093. 10.1099/mic.0.27582-0.PubMedView ArticleGoogle Scholar
- Zoomable Hybridization and Blast Atlas for 'Characterization of Probiotic Escherichia coli Isolates Using a Novel Pangenome Microarray'. [http://www.cbs.dtu.dk/services/GenomeAtlas/suppl/zoomatlas/?zpid=ecoli_pangenome]
- NCBI GenomeProjects. [http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi]
- EasyGene 1.2. [http://servers.binf.ku.dk/cgi-bin/easygene/search]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.