The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation

A proteomics analysis identifies one third of the predicted Toxoplasma gondii proteins and integrates proteomics and genomics data to refine genome annotation.


Background
Toxoplasma gondii is an obligate intracellular protozoan parasite that infects a wide range of animals, including humans. It is a member of the phylum Apicomplexa, which includes parasites of considerable clinical relevance, such as Plasmodium, the causative agent of malaria, as well as important veterinary parasites, such as Theileria, Eimeria, Neospora and Cryptosporidium, some of which like Toxoplasma are zoonotic. In common with the other Apicomplexa, T. gondii has a complex life-cycle with multiple life-stages. The asexual cycle can occur in almost any warm-blooded animal and is characterized by the establishment of a chronic infection in which fast dividing invasive tachyzoites differentiate into bradyzoites that persist within the host tissues. Ingestion of bradyzoites via consumption of raw infected meat is an important transmission route of Toxoplasma. By contrast, the sexual cycle, which results in the excretion of infectious oocysts in feces, takes place exclusively in felines.
The genome of Toxoplasma has been sequenced, with draft genomes of three strains of Toxoplasma (ME49, GT1, VEG) as well as chromosomes Ia and Ib of the RH strain available via ToxoDB [1]. ToxoDB is a functional genomic database for T. gondii that incorporates sequence and annotation data and is integrated with other genomic-scale data, including community annotation, expressed sequence tags (ESTs) and gene expression data. It is a component site of ApiDB, the Apicomplexan Bioinformatics Resource Center, which provides a common research platform to facilitate data access among this important group of organisms [2]. ToxoDB reflects pioneering efforts that have been made toward the annotation of the Toxoplasma genome. Nevertheless, although the assembly and annotation of the Toxoplasma genome is far in advance of most other eukaryotic pathogens, significant deficiencies still remain; in common with many other genome projects, annotation has thus far not taken into account information provided by global protein expression data and neither have these data been available to the user community in the context of other genome resources.
There is now an abundance of transcriptional expression data for Toxoplasma, including expression profiling of the three archetypal lineages of T. gondii. Transcriptional studies have also provided evidence for stage-specific expression via EST libraries, microarray analysis and SAGE (serial analysis of gene expression) [3][4][5][6]. Clusters of developmentally regulated genes, dispersed throughout the genome, have been identified that vary in both temporal and relative abundance, some of which may be key to the induction of differentiation [4,6]. Global mRNA analysis indicates that gene expression is highly dynamic and stage-specific rather than constitutive [6]. However, the study of individual proteins has also implicated the involvement of both post-transcriptional and translational control [7][8][9] and the potential regulation of ribosome expression has also been proposed [10]. Evidence may also point to possible epigenetic control of gene expression, fol-lowing observations of a strong correlation between regions of histone modification and active promoters [11,12].
Until now the study of global gene expression in T. gondii and the use of expression data to inform gene annotation has been almost exclusively confined to transcriptional analyses. Whilst a relatively small number of proteins have been studied in considerable detail, published proteomic expression data are limited to small studies employing two-dimensional electrophoresis (2-DE) separation of tachyzoite proteins [13,14], or to specific analysis of Toxoplasma sub-proteomes that have been implicated in the invasion and establishment of the parasite within the host cell [15][16][17][18].
This paper reports the first multi-platform global proteome analysis of Toxoplasma tachyzoites resulting in the identification of nearly one-third of the entire predicted proteome of T. gondii and represents a significant advance in our understanding of protein expression in this important pathogen. We describe also the development of a proteomics platform within ToxoDB to act as a public repository for these, and other, proteomic datasets for T. gondii. Our data are now available as a public resource and add a vital hitherto missing dimension to the expression data within ToxoDB. Moreover, the addition of detailed protein expression information within an integrated genomic platform highlights the value of protein expression data not only in interpreting transcriptional data (both ESTs and microarray data), but also provides valuable insights into the annotation of the genome of T. gondii.

Two-dimensional electrophoresis proteome map of T. gondii tachyzoites
Urea-soluble lysates from cultured T. gondii tachyzoites were resolved using broad (pH 3-10) and narrow (pH 4-7) range 2-DE gels (Figures 1 and 2; Additional data files 1 and 2). The protein identity of individual protein spots was obtained using electrospray mass spectrometry (Additional data files 3 and 4). In total, 1,217 individual protein spots were identified by 2-DE analysis, 783 detected by the pH 3-10 separation and 434 by the pH 4-7 separation. In many instances proteins from separate spots shared the same identity. Examples of clusters of proteins with the same identification are shown boxed in Figures 1 and 2, and these most likely represent isoenzymes, or proteins with post-translational modification. Many gel plugs contained more than one protein and this is represented by overlapping boxes in the figures. Accounting for redundancy between gels and assuming post-translational variants are the products of a single gene, these data represent the expression of 616 non-redundant Toxoplasma genes, of which 547 correspond to release4 gene annotation and 69 are described by alternative gene models or open reading frames (ORFs) that do not correspond to a release4 annotation (discussed further in the 'Genome annotation' section below). Forty release4 genes (which exhibited a range of masses, isoelectric points and functional annotations) were uniquely identified using 2-DE analysis; that is, they were not detected by either the gel liquid chromatography (LC)-linked tandem mass spectrometry (MS/MS) or multidimensional protein identification technology (MudPIT) approaches described in the following sections.

MudPIT analysis of T. gondii tachyzoites
Whole tachyzoite protein was partitioned into Tris-soluble and Tris-insoluble fractions, and each processed for MudPIT analysis; this resulted in 1,300 and 2,328 protein identifications, respectively, and a total non-redundant dataset comprising 2,409 proteins, which comprises 2,121 release4 and 288 alternative gene models (Additional data files 10 and 11). Of the release4 genes identified, 15.3% were identified uniquely in the Tris-soluble fraction and 48.0% were identified uniquely in the Tris-insoluble fraction.
When the results using all three proteomic platforms were combined, a total of 2,252 non-redundant release4 protein identifications were obtained from the tachyzoite stage of the parasite. This represents expression from approximately 29% 2-DE proteome map (pH 4-7) of T. gondii tachyzoite proteins Figure 2 2-DE proteome map (pH 4-7) of T. gondii tachyzoite proteins. Protein spots were visualized using colloidal Coomassie. Spots with the same protein identification are boxed (for detailed numbering, see Additional data file 2). Abbreviations (also refer to Figure 1): PSAT, phosphoserine amino transferase; IF4E, translation initiation factor 4E; BCDC E1, branched-chain alpha-keto acid dehydrogenase; SOD, superoxide dismutase; OGDC E2, dihydrolipoamide succinyltransferase component of 2-oxoglutaratedehydrogenase complex; EGF1b, elongation factor 1 beta; ubiquitin-E2, ubiquitin-conjugating enzyme E2; F-1,6 bisP aldolase, fructose, 1,6 bis phosphate aldolase; PGK, phosphoglycerate kinase; F1,6 b Pase, fructose 1,6 bis phosphatase; U5 snRNP, U5 snRNPspecific 40 kDa protein (hPrp8-binding); Dihydrolipoyl DH, Dihydrolipoyl dehydrogenase, third enzyme of PDC, OGDC, BCDC. of the total number of currently predicted release4 genes. Figure 4 illustrates the degree of overlap between the datasets derived using each of the three proteomic platforms. MudPIT generated the largest number of identifications; however, a number of proteins were uniquely identified using the gelbased approaches (59 for 1-DE; 40 for 2-DE). Other studies have also highlighted the benefits of a multi-platform proteomic approach and the advantages and disadvantages of each platform have been discussed extensively elsewhere [19]. Notably, the gel-based proteomic platforms detected, on average, more peptides per protein identification than Mud-PIT. Overall across all platforms, only approximately 6% of the 2,252 proteins identified were based on single peptide evidence; this represents a relatively low proportion compared to other apicomplexan proteomic studies [19][20][21] and is probably accounted for partly by the extensive data from gel-based proteomics in addition to the MudPIT analysis. In addition to the release4 genes, 394 non-redundant alternative gene models and ORFs were also identified from the entire dataset. These data represent sets of peptides that map more comprehensively to alternative models and ORFs than the release4 gene models, and have considerable implications for genome annotation, as discussed below.

Functional analyses and key pathways of the tachyzoite proteome
Each individual protein detected by proteomics was submitted to the motif prediction algorithms SignalP [22] and TMHMM [23] and also to subcellular localization prediction programs, for example, PATS (apicoplast) [24], PlasMit (mitochondrion) [25], WoLF PSORT (general) [26] and Gene Ontology (GO) cellular component prediction downloaded from ToxoDB. Toxoplasma genome predictions suggest that 11% of proteins contain a signal peptide and 18% contain transmembrane domains (information available at ToxoDB). Virtually identical proportions were detected in this study in the expressed proteome of tachyzoites (10% and 18%, respectively). Analysis of the 394 alternative gene models and ORFs gave closely similar proportions (results not shown). This  represents expression of more than one-quarter of the predicted numbers of membrane and secreted proteins within one life-cycle stage of the parasite. Assuming non-biased sampling, these results imply no enrichment for membrane proteins in tachyzoites. Similar proportions of signal peptide and transmembrane containing proteins were observed in the expressed proteome of Plasmodium falciparum [20]. The Toxoplasma proteins showed a wide distribution of sub-cellular localizations, demonstrating broad sampling, with cytoplasmic, nuclear and mitochondrial locations well represented ( Figure 5a; Additional data file 12). Many proteins were also potentially involved in secretory pathways and were assigned to the endoplasmic reticulum-Golgi, the plasma membrane and extracellular locations.
The functional analysis of the expressed proteome presented in Figure 5b (see also Additional data file 13) was constructed using the GO classifications listed on ToxoDB, which are largely based on bioinformatics interpretation. Each release4 gene was then assigned to a specific Munich Information Centre for Protein Identification (MIPS) category within the Fun-CatDB functional catalogue [27]. Some genes are without a GO classification and were assigned a putative MIPS category using additional information provided by Blast similarities, Pfam domain alignments [28], InterPro [29], orthologs, Toxoplasma paralogs, and from independent literature searches. Functional categories that are highly represented are metabolism, protein fate, protein synthesis, cellular transport, transcription and proteins with binding functions. A large proportion (36%) of the proteins have 'unknown function', indicating the difficulty of obtaining functional information using sequence similarity methods alone. Functional assignments were also constructed for hits to alternative gene models and ORFs, revealing similar relative proportions of functional categories, except for a larger proportion (70%) of proteins with unknown function, presumably due to the sequences being atypical, or incompletely predicted (Additional data file 14). The implications of the functional categories discovered are examined in the Discussion.
Tachyzoites are thought to rely upon both glycolysis and the tricarboxylic acid cycle, unlike the bradyzoites, which are thought to be largely dependent upon glycolysis [7]. Virtually every component of the glycolysis/gluconeogenesis pathway predicted for Toxoplasma was identified as being expressed in tachyzoites by proteomic analysis, as illustrated in Figure  6. Additionally, considerable coverage of the oxidative phosphorylation and tricarboxylic acid cycle pathways was also identified from the expressed proteome dataset (data not shown; see ToxoDB for further details). Several enzymes of the glycolytic pathway have been shown to be modulated during differentiation [6,7], with some showing stage-specific isoforms, such as enolase and lactate dehydrogenase [8]. The level of mRNA expression does not always mirror that of the expressed protein, indicating a degree of translational control or changes in mRNA stability [8]. However, it should be noted that detecting low levels of protein can be problematic. One example is glucose-6-phosphate isomerase (76.m00001). Western analysis detected expressed protein in bradyzoites but not tachyzoites despite the presence of abundant mRNA transcripts in both stages [30]. However, glucose-6-phosphate isomerase was successfully detected in tachyzoites in this whole cell proteome analysis (Additional data file 5, gel slices [40][41][42], again illustrating the sensitivity of our proteome approach. Figure 7a illustrates the degree of correlation between release4 genes for which EST expression data are available and genes for which the total proteome dataset identified in this study has provided evidence of expression. By including all the tachyzoite and bradyzoite cDNA evidence from RH, ME49, VEG, CAST, COUG and MAS strains (available at Tox-oDB), most (91%) of the proteins found in this study were corroborated by EST data. Approximately half of these were confirmed in both bradyzoite and tachyzoite stages by EST analysis, suggesting that many of the proteins may have common, house-keeping functions. Although the EST coverage of the total number of release4 genes listed at ToxoDB is relatively high (68% for tachyzoite ESTs alone), for 266 release4 genes detected in this study using proteomics there was no corresponding tachyzoite EST evidence, apparently reflecting inadequacies in the coverage of the EST data. The distribution of cellular functions amongst these 266 expressed proteins is representative of the entire proteome dataset, indicating that EST evidence is lacking for many different proteins and not specific for a particular type or category of function (data not shown).

Comparison with EST expression data
Conversely, comparison of RH strain-specific tachyzoite ESTs with the proteome dataset revealed that 57% of genes for which there was EST transcript evidence were not corroborated by the detection of expressed protein in this study. This is likely to be explained by a number of contributing factors, including the difficulty in detecting low copy number, transient and unstable proteins. It is also possible that a small number of non-coding ESTs are present in the database for which no protein product would be expected.

Comparison with microarray data
Microarray analysis of the RH strain of T. gondii has been performed previously (data available through ToxoDB; A Bahl and DS Roos unpublished). The analysis provides extensive coverage of the genome (99.5% of release4 genes were assayed), and the results have been cross-referenced with the proteins identified. As it is difficult to determine the correct signal:noise ratio above which mRNA levels can be considered to be indicative of a gene being switched on (all genes represented on the array exhibit some signal, yet not all are expressed), the microarray results were divided into quartiles of mRNA expression level for the purposes of this comparison. Those genes in the bottom 25% were described as zero  Metabolic pathway coverage: glycolysis/gluconeogenesis Figure 6 Metabolic pathway coverage: glycolysis/gluconeogenesis. Component enzymes of the glycolysis/gluconeogenesis pathways predicted to be present in Toxoplasma from genome analysis are colored. Virtually every component of the glycolysis/gluconeogenesis pathway predicted for Toxoplasma was identified as being expressed in tachyzoites by proteomic analysis. Green and blue indicate genes for which expression has been confirmed in tachyzoites in this study by mass spectrometric data; blue also signifies genes for which post-translational modification is likely as indicated by the evidence from twodimensional gels. Red indicates genes for which expression of predicted components has not been confirmed in this study. Genome Biology 2008, 9:R116 detectable mRNA above baseline, and alternatively those in the bottom 50% were described as having zero or low detectable mRNA level. The Venn diagrams in Figure 7b illustrate the degree of overlap between release4 genes, for which ≥ 25 percentile and ≥ 50 percentile mRNA expression was detected by microarray analysis, and the genes identified by our proteomic study. The results illustrate that some genes with zero or low mRNA can still be identified in a proteome study (204 proteins matching the < 25% group and 632 proteins matching the < 50% group). The detection of these proteins is intriguing and there may be several possible explanations.  Number of G ene Also Identified by proteomics P ercentile of Microarray E xpres s i on been detected, and only 204 have been detected from the bottom quartile), indicating, as expected, that there is some correlation between mRNA abundance and protein abundance.

Genome annotation and generation of a public proteome interface for Toxoplasma
The mass spectrometry data in this study were searched against a database containing the current set of predicted proteins from ToxoDB (referred to here as release4), predicted proteins derived from alternative gene models (GLEAN, TigrScan, TwinScan and Glimmer), ESTs and a translation of all six ORFs (see Materials and methods). As such, the proteome data can provide evidence that an alternative gene model is the correct prediction, or that a gene has not been predicted at all in the genome. For the majority of annotated genes, integration of the expressed peptide data has provided direct confirmation of the correct prediction of ORFs and positioning of exon-intron boundaries, including a large number of hitherto 'hypothetical proteins'. The further significance and importance of this corroboratory evidence become more apparent when considering the minority of cases where the peptide expression data are in conflict with the gene prediction algorithms. Approximately 15% of the complete proteome dataset consists of peptide hits to regions of the scaffold where there are discrepancies with the new gene annotation and peptides mapped more convincingly to alternative gene models or ORFs (that is, 394 protein coding sequences). Of the 394 alternative gene models and ORFs detected, most are described as 'hypothetical' with minimal information available and were detected using MudPIT analysis. These hits can be viewed at ToxoDB using the queries and tools option that guides the user to a main menu page from which gene expression confirmation via mass spectrometry can be accessed. The option of refining the search to a single or combination of proteomic approaches, and of searching either annotated genes or ORFs, is available. By adopting the GBrowse viewing option, the user can examine in detail individual ORFs and the integrated peptide sequence data.
An example is illustrated in Figure 8 of a region of the scaffold where peptide evidence supports the presence of an expressed ORF but the new prediction algorithm has not assigned a gene in the corresponding region. Eleven peptides map to TgGlmHMM_3355 and TgTigrScan_5280 but the release4 annotation does not predict an exon in this region. Additional peptides in this region map to exons of the neighboring gene 46m.02877; however, these peptides could also be assigned to the coding sequence of TgGlmHMM_3355 and/or TgTigrScan_5280. In this case, the peptide evidence appears to indicate that gene 46m.02877 could have an incorrect start methionine and be missing an amino-terminal exon.
In other cases, peptide identifications are able to identify errors in the predicted reading frame or strand orientation as illustrated in Figure 9. Here 12 peptides derived from 35 individual spectra originating from both 1-DE and MudPIT approaches provided matching hits to TgGlmHMM_1717, TgTwinScan_4462 and TgGLEAN_7850, whereas the new gene prediction algorithm (assigned 50.m05694) is predicted to lie on the opposite strand and TgTigrScan_8273 uses a different reading frame. The various algorithms also differ in the predictions of the length and number of exons, although peptide evidence supports a single exon. In this example, the peptide expression data have provided supporting evidence for the correct reading frame and the large number of peptide hits to one region only indicates that the gene is likely to comprise a single exon.
Other discrepancies involving the positioning of the exonintron boundaries exist and, in some cases, the alternative gene annotation models such as TgGlmHMM, TgTigrScan, TgTwinScan and TgGLEAN correlate more closely with the co-ordinates of the peptide data. In Figure 10, 12 peptides from MudPIT analysis map to a region of the scaffold (X: 3917326-3920484) that is annotated with gene 28.m00300, comprising two exons. Five of the twelve peptides match the second exon of gene 28.m00300. While it appears that peptides match the scaffold in the region of 28.m00300 exon 1, these peptides have been predicted from a different frame translation. Of further note is that one peptide maps to the predicted intron region of gene 28.m00300. Alternative gene models vary considerably in this region of the scaffold in both the number and positioning of the exons and all 12 peptides only appear in TgGlmHMM_2666, which does not have an intron at this location, providing evidence that this model is most likely to be correct.
An important use of peptide identification is to confirm that intron-exon (splice) boundaries have been correctly predicted; these are notoriously difficult to predict accurately in genome sequence using informatics approaches alone. If a peptide sequence spans an intron, matching regions from the splice donor and acceptor of two exons, this provides strong evidence that splicing has been correctly predicted for these exons. In total, our study identified 2,477 intron spanning peptides in the official release4 annotation, providing supporting evidence that these splice sites have been correctly predicted. In addition, peptides aligning across 421 splice boundaries predicted from alternative gene models only have been identified. This number is highly significant, as the identifications provide strong evidence that the alternative gene model is correct for this region, allowing the genome annotation to be improved. One example of a peptide spanning an intron is shown in Figure 8, where peptides have been identified that span an intron between exons predicted by TwinScan and Glimmer only.

Discussion
Draft genomes now exist for the majority of clinically important protozoa, including most Apicomplexa. Providing an accurate interpretation of gene annotation and expression from these genomes is essential to understanding the biology of host-pathogen interactions and in gaining a better understanding of the relationship between gene transcription and protein expression. Of particular importance is an appreciation of the limitations that transcriptional data alone place on our interpretation of how pathogens respond as they develop through different life-stages, or during key processes such as invasion and establishment within their hosts. Such an observation has potentially huge implications for expression profiling and for the reliance on microarray data to describe changes in gene expression. In this paper we describe how global proteomic data for T. gondii provides Proteomic data enable us to understand what is actually expressed, as opposed to what might be, or has the potential to be, expressed in an organism. In general, the functional characterization and protein localization profile detected in T. gondii in this study fits well with that of the rapidly dividing and invasive tachyzoites, which would be expected to be highly metabolically active, with gene expression, protein synthesis, remodeling and degradation all necessary processes involved in active parasite cell division and required for successful host cell invasion. A similar profile was recently obtained for the expressed proteome of the invasive form of Cryptosporidium [19]. Penetration and maintenance within the host cell would require expression of many apical organelle proteins involved in invasion (category: cell rescue, defense and virulence), as has been observed for the invasive stages of Plasmodium and Cryptosporidium [19,20,31]. In agreement, 44 proteins were assigned to an apical organelle location in Figure 5a. Recent work has also shown the recruitment of host endoplasmic reticulum, mitochondria and net-works of intimately proximal microtubules facilitating active transport of host nutrients to the parasite [32][33][34][35]. Notably, proteins involved in cellular transport are well represented, with more than 200 expressed in this life cycle stage. A significant proportion of proteins falls into the broad category 'proteins with binding functions', including proteins involved in the cytoskeleton that are also required for motility, an important function during invasion. Many proteins were also detected that would be expected to be expressed at low or temporal levels within the cell, such as those involved in cell cycle control (641.m01576, 38.m00005) or signal transduction (65.m01199, 59.m06067, 55.m04992, 49.m05708, 50.m05649). This suggests that the sensitivity of our proteomic analyses was high.
Perhaps most notable were the large number of proteins (36%) for which no information is available and these proteins are listed as unclassified. A similarly large proportion (39%) of proteins with unknown function were detected in just one life cycle stage (the sporozoites) of Cryptosporidium by proteomic analysis [19] and in the proteome of four life cycle stages of P. falciparum (that is, 51%) [20]. More than Peptide evidence indicating alternative frame shift half the predicted genes of Toxoplasma are annotated as 'hypothetical' in the genome. In this analysis, around 800 genes annotated as 'hypothetical protein' were identified, allowing these annotations to be updated to 'confirmed protein'. Functional analysis was also carried out on the 394 alternative gene models and ORFs and revealed a far greater proportion of proteins for which a functional assignment could not be determined (70% compared to 36%). This result reflects the limited annotation available for alternative gene models and ORFs, partially due to the short length of many of these sequences and difficulties obtaining functional information by sequence similarity search if the predicted ORF or alternative gene models do not closely resemble the correct gene sequence.
Toxoplasma has a complex life cycle comprising four additional life cycle stages not studied here: the infective sporozoite, two sexual stages and the encysted bradyzoite. Many house-keeping proteins will be common to all stages, although the proportion of shared proteins is not currently known. In this analysis, approximately one-third of the predicted number of release4 genes were detected in the proteome of the tachyzoite, although it is important to remember that these predicted genes will include stage-specific genes not expressed in the tachyzoite stage, so the actual proportion of proteins detected compared to those expected is likely to be considerably higher, although how much higher is impossible to determine at this stage. Whole cell proteome analysis of the related apicomplexan parasite, Cryptosporidium parvum, Peptide evidence indicating alternative exon positioning and sequence annotation Figure 10 Peptide evidence indicating alternative exon positioning and sequence annotation. The position of ORF X-1-3917326-3920484 in the genome scaffold is indicated by a red line on the grey track at the top of the figure and this region is expanded below, the red triangle demarking the ORF length. Predicted exons are indicated as blue boxes, linked by zigzag lines to indicate the position of exon/intron boundaries. Gene 28.m00300 is shown with two exons. ESTs are shown as dark blue or brown boxes. Peptides aligning with this region are shown in yellow. The predicted sequence for ORF X-1-3917326-3920484 is shown as an insert and sequence that matches exon 2 of gene 28.m00300 is shown in blue. Sequence for which there is matching peptide evidence is shown in red. Purple lettering indicates the positioning of the 'intron-located' peptide, mass spectrometric evidence for which is shown in the right hand insert.
indicated expression of a similar proportion of the genome from the infective sporozoite stage [19], and this parasite also exhibits multiple life cycle stages. Whether the protein set detected is close to the complete proteome of the life cycle stage or limited by the detection levels of the mass spectrometric techniques is not yet clear. Previous microarray analysis of sporozoites, gametocytes and blood stage life cycle stages of Plasmodium indicated 35% of genes were shared [36] whereas this figure decreased to 6% at the proteome level [20,37]. It is likely that some of this discrepancy results from technical limitations associated with detecting low abundance proteins, although it is possible that post-transcriptional regulation also plays a role. In Toxoplasma, analysis of 568 EST assemblies from three life cycle stages, tachyzoites, bradyzoites and oocysts, indicated 16% of genes are stagespecific and, hence, that a large proportion of the genes is shared [5]. A similar figure of 18% was obtained via SAGE analysis [6].
The comparison of the detected proteome with microarray results also reveals some interesting discrepancies. Of the least abundant 25% mRNA values, which would usually be described as no measurable mRNA signal above baseline, 204 proteins are detected. In contrast, of the genes with most abundant mRNA (top 25%, approximately 1,900 genes), only half of these are detected by proteome analysis. The most abundant proteins are likely to have been sampled preferentially in this analysis, and as such, we can hypothesize that many of the genes expressing high mRNA levels do not exhibit similarly high abundances of protein product. Without an in-depth absolute quantitative study of the complete Toxoplasma proteome, which is highly challenging with current technology, these results should not be over-interpreted. However, it appears that there is a considerable degree of control that regulates the level of protein abundance, independent of the rate of transcription in tachyzoites.
Our proteome data have been integrated and aligned with the genome sequence at ToxoDB. The interface provided enables visual inspection of peptides matched to the most current (in this case 'release4') gene models, as well as to alternative gene models and ORFs. The facility to visualize and query peptide data, in tandem with EST and microarray data, allows users of ToxoDB to place confidence in particular gene assignments and to explore those genes that are expressed in tachyzoites. As demonstrated above, the proteome data will enable continued improvement in gene models through the confirmation of the correct reading frame and intron-exon boundaries. More fundamentally, the proteome analysis raises several issues in relation to the correct determination of gene models. Many gene prediction algorithms work on the basis of sequence similarity to cDNA or protein sequence databases, EST sequences or other genome sequences (where conserved regions are more likely to correspond to genes). As such, gene finders are relatively successful at identifying 'typical' genes that are similar to gene structures previously observed in other organisms. However, where genes are atypical in structure, or have no EST data, gene finding algorithms may miss such sequences altogether. Large-scale proteome scans are able to contribute significantly in this area, by demonstrating peptide hits to regions of the genome where genes have only been weakly predicted or missed completely. Others have recently also recognized the value of so-called 'proteogenomic annotation' of genomes [38][39][40][41][42]. As more proteome data are produced, and querying algorithms improve, it is likely that the majority of protein-coding genes expressed in Toxoplasma will be confirmed by mass spectrometry based evidence.

Conclusion
This study represents an unprecedented integration of proteomic and genomic data for Toxoplasma, which we suggest might serve as a model well beyond this present field. As well as providing novel information on the functional aspects of the proteome, our data demonstrate how proteomics can inform gene predictions and help discover new genes. Moreover, the data reveal some surprising, but potentially highly significant, discrepancies between protein expression and transcript expression data as assessed by both EST analysis and microarrays. We believe that this has important implications for how we interpret transcriptional expression data in the Apicomplexa, such as that derived from microarray experiments, and points to the fact that determining both absolute protein expression and post-translational events will be a key factor in gaining a more complete understanding of the biology of these pathogenic organisms.

Parasite culture
Tachyzoites of T. gondii strain RH were maintained in confluent layers of Vero cells (ECACC, Salisbury, UK). T. gondii tachyzoites were harvested 3 or 4 days post-infection as previously described [13].

Manual validation of Mascot results
Additional manual validation of the proteins identified by Mascot was carried out on the 1-DE and 2-DE results. Proteins identifications that were based on a single peptide and proteins that returned a Mascot score < 60 were accepted if: a matching peptide possessed an individual ion score above the significant threshold for identity or extensive homology (typically > 44); or upon manual inspection of individual peptide MS/MS spectra at least 60% of the candidate y-ions were at a minimum signal to noise ratio of 10%. Spectra that failed to pass either rule were regarded as false positive identifications, which can result from an accumulation of several peptides with low ion scores.

Sample preparation for MudPIT
A pellet of 10 9 tachyzoites resuspended to approximately 800 μg/ml in 500 μl 100 mM Tris buffer pH 8.5 were lysed by three cycles of freeze/thaw and the Tris-soluble and insoluble protein fractions separated at 16,000 g for 30 minutes. Digestion of soluble fractions: MS compatible detergent Invitrosol was added to 1% (v/v), the solution heated to 60°C for 5 minutes, vortexed for 2 minutes, denatured with 2 M urea, reduced with 5 mM Tris (2-carboxyethyl) phosphine hydrochloride (TCEP), carboxyamidomethylated with 10 mM iodoacetamide, followed by addition of 1 mM CaCl 2 and trypsin at a ratio of 1:100 (enzyme:protein) and incubated at 37°C overnight. Digestion of insoluble fractions: 10% (v/v) Invitrosol was added to the pellet, which was heated to 60°C for 5 minutes, vortexed for 2 minutes and sonicated for 1 h. The sample was diluted to 1% (v/v) Invitrosol with 8 M urea/ 100 mM Tris/HCl pH 8.5, reduced and carboxyamidomethylated as before, and digested with endoproteinase Lys-C for 6 h. The solution was diluted to 4 M urea with 100 mM Tris/ HCl pH 8.5 and digested with trypsin as described above.

Mass spectrometric analysis by MudPIT
Five soluble replicates and four insoluble samples were each subjected to MudPIT analysis with modifications to the method of Link et al. [44], using a quaternary Agilent 1100 series HPLC coupled to a Finnigan LTQ-ion trap mass spectrometer (Thermo, San Jose, CA, USA) with a nano-LC electrospray ionization source [45]. Peptide mixtures were resolved by strong cation exchange LC upstream of reverse phase LC as described [46]. Each sample (approximately 100 μg) was loaded onto separate microcolumns and resolved by fully automated 12 step chromatography. Protein databases: a Toxoplasma database was assembled (see above). To identify contaminant host proteins, the parasite database was supplemented with a contaminant database (the complete prokaryote and mammalian databases from NCBI). To estimate the amount of false positives, a reverse database was added [47]. Poor quality spectra were removed from the dataset using an automated spectral quality assessment algorithm [48]. Tandem mass spectra remaining after filtering were searched with the SEQUEST algorithm version 27 [49]. All searches were in parallel and were performed on a Beowulf computer cluster consisting of 100 1.2 GHz Athlon CPUs [50].
No enzyme specificity was considered for any search. SEQUEST results were assembled and filtered using the DTASelect (version 2.0) program [51], which uses a quadratic discriminate analysis to dynamically set XCorr and DeltaCN thresholds for the entire dataset to achieve a user-specified false positive rate (< 5% peptides false positive in this analysis). The false positive rates are estimated by the program from the number and quality of spectral matches to the decoy database.

Bioinformatics prediction
Prediction programs used were: SignalP to predict proteins that contain signal peptides; TMHMM to predict transmembrane domains; results returned from PATS, Plas-Mit, and WoLF PSORT together with release4 gene description and GO cellular component prediction provided by ToxoDB were combined to obtain subcellular localization prediction of proteins.

Mapping of proteome data to the genome scaffold
Peptides that hit release4 gene annotation could be directly mounted upon the ToxoDB genome scaffold. Where the database search identified preferentially an alternative gene model or an ORF, the sequences were mapped onto the genome using the following algorithm: rule 1, if all the peptides from the alternative models could be mapped to a release4 gene, the release4 annotation is adopted and this is termed a 100% match; rule 2, if more than 50% of the peptides from an alternative model can be mapped to an official release4 gene, this is considered a valid mapping and the matching peptides are aligned with the corresponding release4 gene; rule 3, if a certain set of peptides from an alternative model can be mapped to more than one release4 gene, the gene that can host most peptides will be reported; rule 4, alternative models not conforming to rule 2 will then be mapped to ORFs; rule 5, an alternative model will be mapped to an ORF only if 100% of the peptides can be mapped to that ORF. If 100% of the peptides from the alternative model cannot be mapped to a single release4 gene (rule 1) or to a single ORF (rule 5), the peptides are also mapped to the alternative gene model (for example, TgTwinscan, TgGLEAN, and so on), which can be viewed in GBrowse by selecting the relevant option. This enables ToxoDB users to directly visualize proteomics evidence for alternative gene annotation. All raw data associated with this manuscript may now be downloaded from the Tranche Project [52], using the following hash: Ulv/ yTYTaaHin5Tv4InpsgoUY1uTJQtdoLRi9HbdtypXqztv+BiV E/wZieBkqu6d3kU20Vyejo0HYCfswgwiGyPHQPAAAAAAA AOhng==