Molecular signature of hypersaline adaptation: insights from genome and proteome composition of halophilic prokaryotes

A comparative genomic and proteomic study of halophilic and non-halophilic prokaryotes identifies specific genomic and proteomic features typical of halophilic species that are independent from genomic GC-content and taxonomic position.


Background
Halophiles are organisms adapted to thrive in extreme conditions of salinity. There is a wide range of halophilic microorganisms belonging to the domains Archaea and Bacteria. The intra-cellular machinery of these prokaryotes has evolved to function at very high salt concentrations [1][2][3][4][5]. A detailed understanding of the molecular mechanisms involved in the halophilic adaptation not only provides insight into the factors responsible for genomic and proteomic stability under high salt conditions, but also has importance for potential applications in the field of protein engineering [6,7].
The stable and unique native structure of a protein is a basic requirement for its proper functioning [8][9][10][11]. To understand molecular adaptation in hypersaline environments, it is important to address fundamental problems involving protein stabilization and solubility. An apparent way to achieve protein stability is to choose and arrange amino acid residues in their primary sequences in a specific or selective way. Several earlier works have revealed the elevated frequencies of negatively charged residues on protein surfaces as one of the most prominent features of halophilic organisms [1,4,[12][13][14][15][16]. The higher usage of negatively charged amino acid residues leads to organization of a hydrated salt ion network at the surface of the protein [17] and formation of salt bridges with strategically positioned basic residues [18], regulating the stability of proteins. But an increase of acidic residues on protein surfaces is not the only possible adaptation to high salinity [13,19]. Earlier works have also pointed towards relatively low hydrophobicity as another adaptation to high salt environments [4,20]. Therefore, a clear and comprehensive picture of protein signatures for halophilic adaptation remains elusive.
Several studies have suggested that high genomic GC-content (well above 60%) is also a common feature of extreme halophiles, presumably to avoid UV induced thymidine dimer formation and possible accumulation of mutations [14,19]. The newly sequenced genome of the extreme halophilic organism Haloquadratum walsbyi is so far the only exception, with a remarkably low genomic GC-content of 47.9% [21]. At the codon usage level, a strong GC-bias was observed for Halobacterium sp. NRC1 [14], but not for H. walsbyi [21]. Thus, at the genomic level, the GC-bias is not a universal feature for adaptation to high salinity and other specific features of nucleotide selection may also be involved.
The current report presents an extensive and systematic analysis of the genome and proteome composition of halophilic organisms, along with a comparative study of non-halophiles, with a view to characterize the molecular signatures of halophilic adaptation. We consider 6 completely sequenced obligatory halophiles and compare them with 24 non-halophiles from various phyla of both Archaea and Bacteria with comparable GC-content to minimize the phylogenetic influence and the effect of mutational bias on their nucleotide/amino acid usage patterns. We examine the preferences, if any, in amino acid replacements from non-halophile to halophile orthologs in an attempt to understand which residues are instrumental for halophilic adaptation. Finally, we show how observed patterns of change in amino acid compositions in response to extreme conditions of the environment are related to physical principles that govern stability of proteins under such conditions. This study examines in detail the genome and proteome-wide adaptations to extreme environments, knowledge of which has important potential applications in various fields, including the engineering of industrial biomolecules.

Clustering of halophiles by amino acid composition
Clustering on the relative abundances of different amino acid residues reveals a clear segregation of the halophilic organisms from the non-halophiles ( Figure 1). The left panel of Figure 1 depicts the unweighted pair group average clustering on the relative abundances of different amino acid residues in the encoded proteins of the 6 extreme halophilic and 24 nonhalophilic organisms under study (Table 1) with respect to those of Escherichia coli, while the right panel offers a pictorial representation of relative amino acid usage in the respective organisms. As the relative abundances of the residues increase from 0.35 to 1.80, the color of the respective block changes from red to green, that is, the greener the color, the more abundant is the residue in that organism compared to E. coli. Halophilic organisms show quite distinct usage of amino acid residues compared to non-halophiles, elucidated by the presence of either more red or green blocks in Figure 1. Among the prominent trends are significant increases in Asp, Glu, Val, and Thr residues and decreases in Lys, Met, Leu, Ile, and Cys in halophilic proteomes. Usage of Ile is lower in all halophiles except H. walsbyi, probably due to its significantly lower genomic GC-content ( Table 1). The increase in negatively charged (Asp and Glu) and Thr residues and the decrease in Lys and strong hydrophobic residues (Ile, Met, Leu) are consistent with earlier reports [4,12,14,18,22]. A relatively higher frequency of Val in extreme halophiles compared to non-halophiles supports the observation of Madern et al. [15], but contradicts earlier propositions on under-representation of all strong hydrophobic residues in halophiles [4,23].
Similar to the cluster analysis, correspondence analysis (COA) on amino acid usage also segregates the halophilic organisms from the non-halophiles along the second principal axis (Figure 2a). The first two principal axes of the COA contribute 16.29% and 13.79%, respectively, to the total variability. A strong negative correlation (r 2 = 0.57, p < 10 -7 ) of axis 1 with the GC-content of the respective genomes identifies GC-bias as one of the major sources of inter-species variation in global amino acid composition, while the contributions to axis 2 come from hydrophobicity (negative correlation, with r 2 = 0.65, p < 10 -7 ) and the ratio of negatively versus positively charged amino acid residues (positive correlation, with r 2 = 0.26, p < 10 -7 ) of the encoded gene products of the organisms. This indicates, therefore, that the proteins of halophilic organisms are characterized by less hydrophobicity (or higher hydrophilicity) and relatively higher usage of negatively charged amino acids compared to non-halophile proteins. Figure 2b, c also supports the corollary that the features of halophilic proteomes are unique and quite distinct from those of non-halophiles with respect to hydrophobicity and usage of negatively charged amino acids (as predicted by isoelectric point distribution of encoded proteins).
All these trends are specifically exhibited by halophiles irrespective of their taxonomic origins and their genomic GCcontent (Additional data file 1 and Table 1). For instance, five archaeal halophiles appear in a distinct cluster, far away from other closely related archaeal species like Methanosaeta thermophila, Thermoplasma acidophilum and so on ( Figure 1). The salt-adapted bacteroidetes/chlorobi Salinibacter ruber also intermingles with these halophilic archaea -wide apart from Pelodictyon luteolum, its closest non-halophilic taxonomic relative. H. walsbyi, a halophile with relatively low GCcontent (47.9%), appears in the same cluster along with the GC-rich halophiles, while the three non-halophilic species with similar GC-content and (E. coli, 50%; Shigella boydii 47.4%; and Yersinia pestis, 47.8%) cluster with the other non-halophiles, most of which are characterized by much higher GC-content. It is worth mentioning at this point that organisms with high growth temperature also cluster together (Figure 1), of which two methanogenic organisms (M. thermophila and Methanothermobacter thermautotrophicus) share the same node. The distinct branching pattern of three thermophiles with relatively low genomic GC-content (T. acidophilum, Thermotoga maritima and Thermococcus kodakarensis) suggests that the overall GCcontent also plays a significant role in shaping the amino acid composition of such organisms, as observed previously by Kreil and Ouzounis [24]. The exact topology of the cluster and values indicated by the colored blocks depend on the choice of standardization and the algorithm used for their construction, but the resulting grouping of the organisms in Figure 1 does not change significantly from that obtained using actual amino acid compositions of the respective organisms. These observations point towards convergent evolution of halophilic proteomes for specific amino acid composition, despite their varying GC-bias and widely disparate taxonomic positions.

Comparison with non-halophilic orthologs
A comparison of orthologous proteins (cytosolic and membrane proteins separately) between halophilic and non-halophilic organisms was performed to identify the underlying factors for halophilic adaptation in more detail. Table 2 sum-Grouping of halophiles and non-halophiles according to their standardized amino acid usage Figure 1 Grouping of halophiles and non-halophiles according to their standardized amino acid usage. Standardized amino acid composition of halophiles and nonhalophiles grouped by unweighted pair group average clustering. The left panel depicts the unweighted pair group average clustering on the relative abundances of different amino acid residues in the encoded proteins of organisms with respect to those of E. coli. The distance in the clustering is Euclidean distance. The right panel is a pictorial representation of relative amino acid usage in the respective organisms. The over-representation or underrepresentation of amino acid residues in the organisms are shown in green and red colored blocks, respectively. Archaeal species are denoted in pink color and the species adapted to high temperature (optimum growth temperature ≥ 65°C) are underlined. Organism abbreviations are listed in Table 1. Linkage distance marizes different proteomic properties of four sets of orthologous cytosolic proteins between halophiles and nonhalophiles. In all cases, there is a significant increase in negatively charged, hydrophobic (Val) and borderline hydrophobic (Thr) residues and a decrease in positively charged, large hydrophobic and Cys residues ( Table 2). Among negatively charged residues, the abundance of Asp (44% for set I, 65% for set II, 69% for set III and 55% for set IV) is higher than that of Glu (16% for set I, 43% for set II, 26% for set III and 34% for set IV). Similar trends were observed for the membrane proteins, although fairly large differences in amino acid usage were not found (data not shown).
We determined the frequencies of all possible amino acid replacements (that is, (20 × 19)/2 = 190 possible pairs of replacements) between the orthologous sequences in the direction from non-halophile to halophile proteins (Additional data files 2-5). There are 59 (31% of all possible pairs), 51 (26% of all possible pairs), 81 (42% of all possible pairs) and 76 (40% of all possible pairs) pairs of amino acids for sets I, II, III and IV, respectively, that have significant directional replacement bias (p < 10 -2 for set II; p < 10 -3 for set I; and p < 10 -6 for sets III and IV). They contribute 56%, 52%, 66% and 63% of the replacements for set I (28,267 of the 50,403 observed replacements), set II (10,815 of the 20,685 observed replacements), set III (69,974 of the 105,771 observed replacements), respectively (Additional data files [6][7][8][9]. The top 20 replacements in all these sets suggest that there are two clear trends in amino acid substitution patterns in terms of highest gain as well as highest ratio (Table d usage were not found (data not shown).3). These are: Lys (nonhalophile) substituted by other residues (halophile); and other residues (non-halophile) substituted by acidic residues, especially Asp (halophile). Lys→Asp topped the list of most significantly biased substitutions in terms of ratio in all the sets under study, indicating that this trend is independent of GC-composition and phylogeny. Another notable trend is Ile/ Leu (non-halophile) substituted by Val/other residues (halophile). In set II, where the orthologs are of similar GC-composition, there is a prevalence of overall gain in Asp, Glu, Val and Thr, which are also gained in sets I, III and IV in halophile from non-halophile orthologs (Table 3). Thus, there is a prevalence of overall gain in Asp, Glu, Val and Thr and the most prominent losses common in all four groups are Lys, Ile, Met, Leu and Cys in halophile from non-halophile orthologs. This result suggests that such gains and losses indeed represent an imprint of halophilic adaptation, and not the dragging effect of mutational bias or taxonomic differences.

Secondary structure comparison of orthologous sequences
The results of various traits observed from predicted secondary structure for four sets of orthologs are shown in Table 4. For all sets there are higher propensities for the formation of random coil regions and lower propensities for the formation of helical structures in the encoded proteins of halophiles compared to non-halophile proteins. We measured all nine types of secondary structure replacements of amino acid residues between four sets of orthologous protein sequences from non-halophilic organisms to halophilic organisms (Table 5). In all data sets, residues having higher propensities for helix or sheet formation in non-halophile proteins are replaced by residues having higher propensities for coil formation in halophile orthologs. The differences in the contributions of individual amino acids to the predicted secondary structures between halophiles and non-halophiles for four sets of proteins are given in Figure 3. The large hydrophobic (Leu, Met) and positively charged (Lys) amino acids with higher helical propensity are significantly underrepresented, whereas the Asp residue, with higher coil forming propensity, is greatly over-represented in halophile proteins. There is also a significant decrease in Ile and an increase in Val and Thr residues, all of which have higher sheet-forming propensities.

Comparison between known protein structures
One pair of crystal structures of the protein malate dehydrogenase (MDH) from halophilic Haloarcula marismortui and its ortholog from non-halophilic Chlorobium vibrioforme was selected and the secondary structures of these proteins were calculated with the help of the program MolMol. There is a marked decrease in helix forming regions in H. marismortui MDH (43.7% decrease) compared to C. vibrioforme MDH (48.5% decrease). The comparison of aligned sequences of secondary structure regions using the DSSP pro-  Table 1. gram also lends supports to this notion ( Figure 4). In the MDH of H. marismortui (pI = 4.2; Hydrophobicity = -0.408), the cumulative frequency of Asp and Glu is 20.5%, whereas in C. vibrioforme MDH (pI = 5.3; Hydrophobicity = 0.136) it is 12.9%.

Amino acid preference in halophiles is not a consequence of mono-nucleotide composition bias
The distinct amino acid usage pattern in halophiles might have originated from compositional bias operating at the nucleotide level, or from the preference for, or avoidance of, specific amino acid residues as a tool for halophilic adaptation. With a view to distinguish between these two possibilities, we randomly re-shuffled the nucleotides in the coding sequences of all genomes and calculated the average amino acid composition of the hypothetical protein sequences of halophiles and non-halophiles obtained from the theoretical translation of the reshuffled gene sequences. If the selection had operated at the mono-nucleotide level, proteins translated from such randomly reshuffled hypothetical sequences of halophiles should feature similar trends as depicted by their true proteomes, since the nucleotide bias of the reshuffled sequences would have remained the same as those of the real gene sequences. On the contrary, if the distinct amino acid composition of halophile proteomes had evolved due to environmental adaptation of these extremophiles, the trends in amino acid usage in reshuffled hypothetical sequences would differ from those of actual halophilic proteins. In Figure 5, the striking difference between average amino acid compositions of halophilic and non-halophilic organisms for real proteomes and hypothetical proteomes simulated from reshuffled DNA suggest that some factors, other than the mono-nucleotide usage, influence the amino acid composition of proteins to maintain structure and function under halophilic conditions.

Genomic signature of halophiles
We calculated the dinucleotide abundance of all genomes to find out whether any specific nucleotide composition has significant influence on the genomic signature of obligatory halophiles. Clustering on dinucleotide abundance by cityblock (Manhattan) distance clearly segregates the halophilic organisms (with over-representation of GA/TC, CG and AC/ GT dinucleotides) from the non-halophiles (Figure 6a; Additional data file 10) irrespective of their archaeal or bacterial origin. In other words, the dinucleotide abundance profiles of halophilic genomes bear some common characteristics, which are quite distinct from those of non-halophiles and, hence, these may be regarded as specific genomic signatures for salt-adaptation. Cluster analysis on dinucleotide frequencies at the first and second codon positions of genes for all organisms also yielded separate clusters for halophiles and non-halophiles ( Figure 6b). The higher frequencies of occurrence of GA, AC and GT dinucleotides at the first and second codon positions (Additional data file 11) undoubtedly reflect the requirements for Asp, Glu, Thr and Val residues in halophile protein sequences. Therefore, halophiles have a specific genome signature at the dinucleotide level, and this trend seems to be linked to a specific amino acid composition of proteins for halophilic adaptation. The high temperature adapted organisms seem to cluster together according to their overall dinucleotide relative abundance value except Thermoplasma acidophilum. However, on the basis of dinucleotide frequencies at the first and second codon positions of genes, these organisms cluster together irrespective of any phylogenetic relationship. In order to figure out the possible impact of the relative abundance of specific dinucleotides on the mechanical properties of halophilic genomes, we calculated the likelihood of their sequences forming a Z-DNA structure, using ZHunt software [25]. We found that there is a significant correlation (r 2 = 0.54, p < 10 -4 ) between the propensity of DNA to flip from the B-form to the Z-form per kilobase of genome and the relative abundance of the CG dinucleotide.

Synonymous codon usage bias in halophiles
In an attempt to examine whether the pattern of synonymous codon usage in halophiles follows any specific signature, COA was performed on the relative synonymous codon usage (RSCU) of 82,927 predicted open reading frames (ORFs) from 30 microbial genomes (listed in Table 1). The axis 1-axis 3 plot in Figure 7a of the COA on RSCU values exhibits two distinct clusters, the halophile and non-halophile genomes being segregated along the third major axis, whereas the axis 1-axis 2 plot in Figure 7c separates thermophilic organisms from mesophiles, indicating distinct usage of synonymous codons in thermophiles, as reported earlier [8,26]. This is the first report that the pattern of synonymous codon usage in the halophilic prokaryotes is different from that in the non-halophilic prokaryotes. Axis 1 values show highly significant correlation with the GC 3 values (r 2 = 0.85, p < 10 -7 ), indicating separation of genomes according to their genomic GC-content.
While differences in genomic GC-content and high temperature adaptation explain variations along the first and second major axes (representing 19.4 % and 11.1% of total variation, respectively) of the COA of RSCU, the variation along the third major axis (representing 9.1% of total variation) separates the halophiles from the non-halophiles. The distribution of codons along axis 3 ( Figure 7b) depicts that the major contributors to this pattern are the distinct usage of synonymous codons encoding Arg (CGA and CGG being preferred by halophiles), Val (GUC is most preferred by halophiles), Thr (ACG is preferred by halophiles), Leu (CUC is the most preferred codon in halophiles) and Cys (UGU is generally preferred by halophiles). Comparison of codon usage values of 5,000 genes from both extremes of axis 3 shows that there are 18 and 14 codons, usage of which is significantly higher in the genes from the positive extreme and the negative extreme, respectively (Additional data file 12). Of the genes at the positive extreme, 97.5% are from halophiles, whereas 99.9% of the genes at the negative extreme are from non-halophiles. This means that in spite of their long-term evolutionary history, genes of the halophiles, in general, have converged to similar patterns of codon usage, which is quite distinct from the patterns followed by genes of non-halophilic organisms.

Discussion
The present study discerns the nucleotide and amino acid biases in extreme halophiles and thereby characterizes the Table 3 Top 20 amino acid pairs of 4 orthologous groups according to differences and ratios in number of forward (non-halophiles to halophiles) and backward (halophiles to non-halophiles) replacements

Top 20 amino acid pairs of 4 orthologous groups according to differences and ratios in number of forward (non-halophiles to halophiles) and backward (halophiles to non-halophiles) replacements
genomic/proteomic determinants of halophilic adaptation in prokaryotes. From this study, it appears that specific trends in amino acid usage are required for halophilic adaptation of organisms, irrespective of their genomic GC-content and taxonomic position. Evidence in favor of specific selection on dinucleotide and synonymous codon usage are apparent for halophiles (Figures 6a and 7a). Also, with regard to protein secondary structure, residues having lower propensities for forming alpha helical regions and higher propensities for forming coil-forming regions are preferred more in halophiles than non-halophiles ( Table 4). All of these findings strongly support the notion of convergent evolution not only at the level of proteome composition, but also at the level of genome organization of the microorganisms adapted to high salt environments.  All the replacements of amino acid pairs are significant at p < 10 -3 for set I, p < 10 -2 for set II, and p < 10 -6 for sets III and IV, except replacements marked with asterisks. Organism abbreviations are listed in Table 1.   In order to subtract out the phylogenetic influence, we have included both bacterial and archaeal organisms in the dataset ( Table 1). The dataset of halophilic organisms contains all available completely sequenced halophilic archaeal and bacterial species in the public domain, while the dataset of nonhalophiles contains genome sequences of eight archaeal and sixteen bacterial species of diverse taxonomic origins. Among the archaeal non-halophiles are M. thermophila from Methanogens group II and T. acidophilum from Thermoplasmatales -organisms very close to haloarchaea as per the 16s rRNA tree (Additional data file 1). Among bacterial non-halophiles, we chose members from different bacterial phyla, such as proteobacteria, firmicutes, cyanobacteria, actinobacteria and especially P. luteolum from the bacteroidetes/chlorobi group to which the halophilic bacteria S. ruber belongs (Additional data file 1). It can be concluded, therefore, that the determinants of genomic/proteomic architecture in halophilic organisms are high salt adaptation specific, and transcend the boundary of phylogenetic relationships and the genomic GCcontent of the species.

Top 20 amino acid pairs of 4 orthologous groups according to differences and ratios in number of forward (non-halophiles to halophiles) and backward (halophiles to non-halophiles) replacements
We have considered two chromosomes of H. marismortui in our analysis and found that the amino acid usage, dinucleotide relative abundance and synonymous codon usage of chromosome II are quite different from those of chromosome I (Figures 1, 2a, 6, and 7a), whereas they are relatively closer to each other in the 16s rRNA tree (Additional data file 1). This observation supports the earlier notion [27] that almost the entire chromosome II of H. marismortui might have been acquired later during evolution, while its rRNA operon might have originated through duplication and subsequent divergence from the rRNA operons of chromosome I.
Our study clearly indicates that halophile proteins prefer to use Asp, Glu, Val and Thr at the expense of Lys, Met, Leu, Ile and Cys. Among the residues favored in the halophilic proteome, Asp and Glu are negatively charged and may localize in patches on protein surfaces. By binding a network of hydrated cations, they help in the maintenance of protein activities at high salt concentrations [12,14,18,22]. The less common residues in high salt-adapted organisms include the positively charged residue Lys and several large and strongly hydrophobic residues like Leu, Ile and Met. An empirical correlation between halophilic adaptations of some proteins and their relatively low hydrophobicity was reported earlier [28]. It is interesting to note that although halophilic proteomes are, in general, characterized by lower hydrophobicity compared to non-halophiles, the usages of Val and Thr are significantly higher in them ( Figure 1, Table 2). Usage of the strong hydrophobic residue Ile is also relatively higher in H. walsbyi, possibly due to its significantly lower genomic GC-content. At high salt concentrations, proteins are, in general, destabilized [29]. Halophile proteins have, therefore, evolved specific mechanisms that allow them to be both stable and soluble in the high cytoplasmic NaCl/KCl concentration. In this environment the hydrophobic residues of newly synthesized proteins are exposed to high salt concentrations, leading to non-specific inter-or intramolecular interactions of their side chains, which may compete with proper intramolecular burial within the correct conformation [30]. Probably to minimize this possibility, all soluble halophilic proteins have a lower number of hydrophobic residues. The increase in negative charge on the surface of halophile proteins counteracts the lower dielectric constant at high salinity and thus provides for enhanced protein solubility.
Our results show that there is a marked, significant difference in the predicted secondary structures of halophile and nonhalophile proteins. In proteins with higher percentages of helix structure, there is an increased overall packing that imparts more rigidity [31] and, hence, a decrease in regions with helix-forming propensities in halophile proteins probably makes them more flexible. As protein flexibility and protein function are strongly linked [ (59257) Values represent ratios in number of forward (non-halophiles to halophiles) and backward (halophiles to non-halophiles) replacements. Entries in bold are significant at p < 10 -3 . Organism abbreviations are listed in Table 1.

Secondary structure replacements of four sets of halophile proteins and their non-halophile orthologs
Variations in amino acid content of different secondary structural regions forming regions or, in other words, an enhancement in protein flexibility, might be a strategy of halophile proteins for adaptation to extreme salt environments. It is worth mentioning in this context that Radivojac et al. [34] divided native proteins into four flexibility categories and found that flexible but ordered proteins are characterized by higher average hydrophilicity and higher occurrence of negatively charged residues, especially Asp. From a structural viewpoint, Asp is recognized as an alpha-helix breaker, whereas Glu is favorable for alpha-helix formation [35]. The coiled regions of proteins are known to prefer Asp over Glu in general [36]. These could be plausible reasons why halophile proteins use Asp residues more than Glu residues.
A striking observation that has not been reported earlier and deserves mention is the consistent lower usage of Cys residues in halophile proteins. Cys residues are usually overrepresented in non-flexible regions due to the formation of rigid disulfide-bridges [33,37]. Avoidance of Cys residues in halophile gene products might give them more flexibility in high salt environments. Increased usage of Val in halophile proteins may also make them more flexible, because the strong hydrophobic residue Val has a lower helix formation propensity than other strong hydrophobic residues such as Leu, Met and Ile [36]. Thus, halophile proteins might have evolved to be more flexible but ordered and exhibit distinct secondary structure composition that has helped them to avoid aggregation and/or loss of function in extreme salt environments.
Like proteome composition, halophilic adaptation is also associated with a specific genome signature. The obligatory halophiles generally contain GC-rich genomes (well above 60%), except for H. walsbyi (genomic GC-content of 48.7%). A high GC-content in halophilic genomes is thought to help in avoiding UV-induced thymidine dimer formation and the possible accumulation of mutations in their specialized habitat (shallow coastal lagoons), characterized by high levels of UV irradiation [14,19]. In H. walsbyi, the disadvantage of a low GC-genome is thought to be partly compensated for by the presence of a relatively higher number (four copies) of photolyases [21]. Our analysis reveals that the genomes of all obligatory halophiles show definite dinucleotide abundance (higher abundance values for CG, GA/TC and AC/GT) compared to non-halophiles (Figure 6a, Additional file 10). The genomic signature revealed by dinucleotide abundance analysis for halophiles, in general, is not species-specific, but salt adaptation specific, and hence may be an outcome of convergent evolution. The higher frequency of GA, AC and GT dinucleotides at the first and second codon positions undoubtedly reflects the requirement for Asp, Glu, Thr and Val residues in halophile protein sequences. The higher occurrence of the CG dinucleotide leads to a higher stacking energy, thus imparting stability to genomic DNA [38]. It is also known that high salt concentrations have a strong influence on the transition of B-DNA to Z-DNA and the relative stabilization of Z-DNA increases with increasing salt concentration [39]. Hence, the enhancement of the total stacking interaction (base-pair stacking and deoxyribose purine stacking) could contribute to the propensity of short d(CG)n sequences to adopt the Z-conformation [40]. A significant correlation (r 2 = 0.54, p < 10 -4 ) between the propensity of Z-DNA formation per kilobase in genomes with a relatively high abundance of CG dinucleotides supports this notion. We also observed that the pattern of synonymous codon usage in halophiles is significantly different from that in non-halophiles (Figure 7, Additional data file 12). Essentially, our results show that codon usage pattern among the 30 genomes (6 halophiles and 24 non-halophiles) is determined by three major factors: overall GC-bias (explained by the first major axis); a temperature dependent factor (explained by the second major axis); and a salinity dependent factor (explained by the third major axis). The COA on RSCU thus provides convincing evidence that synonymous codon usage in halophiles follows a similar trend, which is quite distinct from the trends observed in non-halophiles. Since the difference in synonymous codon usage between halophiles and non-halophiles is not due to a simple difference in the nucleotide content of the genomes, it seems that natural selection may be linked to the codon usage pattern of halophilic prokaryotes.

Conclusion
The present study demonstrates the generality of the mechanisms of macromolecular adaptation of extreme salt-loving organisms, irrespective of their genomic GC-content and taxonomic position. At the protein level, these include: convergent evolution towards a specific proteome composition, characterized by low hydrophobicity; over-representation of acidic residues, especially Asp; higher usage of Val and Thr; lower usage of Cys; and a lower propensity for helix formation and a higher propensity for coil structure. Among the signatures of halophilic adaptation at the DNA level, the abundance of GA, AC and GT dinucleotides may partly be coupled with the specific amino acid requirements, while CG dinucleotide abundance may be an additional halophilic signature of DNA stability at high salt concentration. The synonymous codon usage in halophiles also seems to have converged to a single pattern regardless of their long-term evolutionary history.

Sequence retrieval
All protein coding sequences of the chromosomes of 6 extreme halophiles (grow optimally in approximately 3.5 M Average amino acid composition of real and hypothetical proteomes  Table 1.   Table 1. NaCl) and 24 non-halophiles from Archaea (both euryarchaeota and crenarchaeota) and bacteria (including proteobacteria, firmicutes, cyanobacteria, actinobacteria, bacteroidetes/ chlorobi, and so on) were retrieved from NCBI GenBank (version 145.0) and Halolex databases [41] (listed in Table 1). Except for H. walsbyi, all the extreme halophilic organisms are GC-rich, so to minimize the GC-compositional effect on amino acid usage comparison (as well as on codon usage), most of the chosen non-halophilic organisms are similarly GC-rich, while some others have GC-content comparable to that of H. walsbyi.

Cluster analysis and correspondence analysis on amino acid usage
To find the differences in amino acid usage between extreme halophilic and non-halophilic organisms, the cluster analysis on amino acid composition was carried out using STATIS-TICA (version 6.0, published by Statsoft Inc., Tulsa, Oklahoma, USA) for all 30 organisms ( Table 1). The amino acid usage of E. coli was chosen as a well-defined reference for standardization. Subsequently, using a program developed in Visual Basic, a 31 × 20 matrix was generated, where the rows and the columns correspond to data sources (that is, organisms in the cluster) and standardized amino acid usage values, respectively. The over-representation or underrepresentation of standardized amino acid usage values of the organisms in the matrix are shown in green or red colored blocks in Figure 1, respectively.
COA on amino acid usage was performed using the program CODONW 1.4.2 [42] to identify the major factors influencing the variation in amino acid frequencies. These analyses generate a series of orthogonal axes to identify trends that explain the variation within a dataset, with each subsequent axis explaining a decreasing amount of the variation.

Dinucleotide analysis and reshuffling of DNA sequences
In order to identify any halophile-specific genome signature, dinucleotide abundance values [38,43] of genomes of halophiles and non-halophiles were calculated. Clustering of organisms on dinucleotide abundance values was done by the single linkage method and the nearest neighbor analysis was carried out using city-block (Manhattan) distance, calculated by summing the (absolute) differences between point coordinates. Dinucleotide frequencies at all three codon positions of each gene were also calculated and clustering was done using the single linkage method with Euclidean distance, which corresponds to the length of the shortest path between two points. Reshuffling of DNA sequences of ORFs was performed by swapping two randomly chosen nucleotides [44] in the sequence except start and stop codons (we rejected shuffling in cases where stop codons appeared within the ORFs), and repeating this swapping procedure for 3N times, where N is the length of the sequence.

Amino acid exchange bias and secondary structure prediction with orthologous sequences
Four sets of orthologous sequences between halophiles and non-halophiles were identified (according to the comparable GC-content of the species and also according to the close phylogenetic relationships) using the BlastP program [45] using a cutoff of E = 1.0 × 10 -10 . Hits less than 60% similar and having more than 20% difference in length with the query were removed from the dataset. Putative membrane proteins and proteins likely to be secreted or localized to the cell surface, predicted using TMHMM2.0 [46] and SignalP3.0 [47], were also separated out. Using these criteria we identified four sets of orthologs. Set I included 287 orthologous proteins of two closely related species -the halophile S. ruber and the nonhalophile P. luteolum -both belonging to the phylum bacteroidetes/chlorobi. Set II contained 104 orthologous sequences from two species with similar GC-content (Table 1) -the halophilic archaeon H. marismortui (Ch-I) and non-halophilic bacteria Pseudomonas putida. Set III contained 584 orthologous proteins from a halophilic and a thermophilic archaeon, namely H. marismortui (Ch-I) and M. thermophila. Set IV incorporated 574 orthologous proteins of the halophilic archaeon Natronomonas pharaonis and uncultured methanogenic archaeon RC -I.
The amino acid sequences of these four sets of orthologous genes were aligned using ClustalW [48] and the amino acid replacements were arranged in a 20 × 20 matrix using Substitution Pattern Analysis Software Tool (SPAST), a program in C++, developed in-house [49]. Secondary structure prediction of orthologous protein sequences were carried out using the Predator program [50]. The content of amino acid residues in helix, sheet and random coil regions were computed. Secondary structure replacements were calculated by aligning orthologous protein sequences. All these calculations were performed by a C++ program developed in-house.
While examining the trends in amino acid or secondary structure replacements, the direction of conversion of non-halophile proteins to extreme halophile proteins were taken by convention as the 'forward' direction. Under unbiased conditions, the ratio of forward to reverse replacements was expected to be 1:1 for each pair of replacements. To test this hypothesis, the observed and expected numbers (based on a 1:1 ratio) were recorded for each pair of residues belonging to a particular group. In all cases, the chi-square test was applied to assess the significance of the directional bias, if any, at significance levels of 10 -3 to 10 -6 . For each pair of replacements, the first and second rows of the 2 × 2 contingency table represented the number of replacements from one particular residue (say, i) to another (say, j) of the pair and the total count of the remaining replacements (say, k) from the residue i (where k ≠ j), respectively.

Indices used to identify the trends in codon and amino acid usage
Indices like RSCU [51], GC-content at third codon position, amino acid frequencies and average hydrophobicity (Gravy score) [52] of protein coding sequences were calculated to find out the factors influencing codon and amino acid usage. The isoelectric point (pI) of each protein was calculated using the Expasy proteomics server [53]. Calculation of the likelihood of a DNA sequence forming a Z-DNA structure was done using the ZHunt server [25].

Comparison with known protein secondary structures
We obtained one pair of protein structures for extreme halophilic and non-halophilic organisms from the Protein Data Bank. The pair contains (Blast p-value 1e -38 ) MDHs from H. marismortui (1D3A) [54] and C. vibrioforme (1GV1) [55]. The secondary structures of the modeled proteins were calculated using MolMol [56] and DSSP [57].

Authors' contributions
SP and SKB made substantial contributions to the conception of the study, devised the overall strategy, performed genome sequence analysis and drafted the manuscript, developed relevant programs for sequence analysis and performed sequence alinment. SD participated in the initial work, development of the work plan and manuscript preparation. ETH made thoughtful and constructive suggestions during preparation of the manuscript. CD participated in the design and coordination of the study and revised the manuscript critically for important intellectual content. All authors read and approved the final manuscript.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure demonstrating the phylogenetic relationship with 16s rRNA. Additional data file 2 is a table listing the trends in amino acid replacements in P. luteolum and S. ruber orthologs. Additional data file 3 is a