Lineage-specific expansion of proteins exported to erythrocytes in malaria parasites

A new software was used to predict exported proteins that are conserved between malaria parasites infecting rodents and those infecting humans, revealing a lineage-specific expansion of exported proteins.

Plasmodium are obligate intracellular parasites, switching between an arthropod vector and their respective vertebrate host, where they undergo cycles of asexual reproduction in erythrocytes. The infected erythrocytes are subject to an extensive remodeling process induced by the parasite, which facilitates surface exposition of various ligands for host cell receptors, nutrient import into the parasite and asexual reproduction within the host cell. Host cell remodeling includes the development of electron dense protrusions on the infected red blood cell surface called knobs. Knob-associated histidine-rich protein (KAHRP) is a structural knob component that anchors the major virulence factor Plasmodium falciparum erythrocyte surface protein 1 (PfEMP1) on the knob surface [1]. PfEMP1 is encoded by the epigenetically regulated var multigene family, and is implicated in cytoadherence of infected red blood cells to various host cells; a causative factor in the severe pathology of the disease [2][3][4]. Recently, a second gene family encoding surface antigens has been described, the repetitive interspersed family (Rif) and it is believed that Rifins are also subject to antigenic variation [5].
Once inside the infected erythrocyte the parasite resides in a parasitophorous vacuole, which acts as a biochemical barrier between parasite and host through which parasite proteins must be translocated to reach the parasite-infected erythrocyte cytosol and the host cell membrane. It has recently been shown that transport of parasite proteins via the parasitophorous vacuole and into the host cell depends on a short aminoterminal sequence, R/KxLxE/Q [6,7], which we have termed PEXEL (for Plasmodium export element).
This sequence is functionally conserved across the genus Plasmodium, indicating the presence of a conserved export mechanism across the parasitophorous vacuole membrane in malaria parasites. The PEXEL sequence has allowed the prediction of proteins exported into the host erythrocyte, which are likely to be important to both erythrocyte remodeling and virulence. The availability of genome sequences from many different species of the genus Plasmodium now provides an opportunity for the genus-wide discovery of exported proteins and for the identification of specific protein domains representing conserved functions in these different organisms.
Here we have developed and applied a method to systematically identify exported proteins in the genus Plasmodium and to allow characterisation of the 'exportome' in the three most characterised Plasmodium lineages: P. falciparum/P. reichenowi (the 'P. falciparum lineage') and P. vivax/P. knowlesi (the 'P. vivax lineage'), encompassing parasites that infect primates, and P. berghei/P. yoelii/P. chabaudi (the 'P. berghei lineage') with parasites infecting rodents. We identified a core set of exported proteins conserved across the genus Plasmodium that are predicted to play key functions in the host cell remodeling process. Additionally, we describe a set of novel gene families encoding exported proteins likely to be important in the differential properties of the genus Plasmodium in their respective host cells.

Results
ExportPred: algorithmic prediction of the P. falciparum exportome Previous strategies [6,7] to determine the complement of Plasmodium proteins exported to the parasite-infected erythrocyte by predicting the presence of a signal sequence and a functional PEXEL element have seriously underestimated the full complement of exported proteins. A significant number of secreted P. falciparum proteins have a hydrophilic spacer of up to 50 amino acids preceding the hydrophobic signal sequence, referred to as a recessed signal sequence. Functional P. falciparum signal sequences, especially those that are recessed, can be mispredicted by SignalP [8], resulting in a large deficit in the number of exported proteins [7]. Other methods to determine the full exportome have limitations and do not provide a statistic that can be used to gauge the likelihood of export. To identify the exportome of P. falciparum, and other species of the genus Plasmodium, we constructed an algorithm for export prediction. This algorithm, named ExportPred, uses a generalised hidden Markov model (GHMM) [9] to model simultaneously the signal sequence and PEXEL motif features required for protein export.
Figures 1b and 2a demonstrate that ExportPred is able to distinguish exported proteins from those that are not exported. To test both the effect of our simplified signal sequence model and PEXEL motif, we substituted the signal sequence portion of the ExportPred GHMM with the HMM used in SignalP and the motif portion with the weight matrix [7]. Combinations of these substitutions gave rise to three new versions of Export-Pred. Table 1 lists the discriminatory power of these various model configurations and positive and negative sets as measured by area under the respective Receiver Operating Characteristic (ROC) curve. Variants of ExportPred tend to perform less well than the standard ExportPred model, even after augmenting the SignalP model to allow for recessed signal sequences. The inclusion of the alternative weight matrix does not improve discrimination in any of the cases examined and, in fact, appears to result in a decrease in accuracy in many cases.

Validation of ExportPred
To provide in vivo support for the ExportPred predictions, we generated a series of green fluorescent protein (GFP) fusions to unknown proteins conserved in Plasmodium that were ranked highly in the ExportPred output. Proteins were chosen to test various properties of exported proteins, including number of exons in the encoding gene, motif composition and presence of multiple transmembrane domains. As in our initial study [6], we fused the native amino terminus including the predicted PEXEL plus 11 amino acids downstream of it to GFP, since it has been shown that a spacer between the motif and a reporter is needed for correct export [10]. Figure 2b shows the seven GFP chimeras created in this study in the context of nine known exported proteins and the positive and negative ExportPred predictions from the P. falciparum proteome. Each protein sequence is represented by a point in two-dimensional space determined by the contributions to the ExportPred score of the predicted signal sequence and the PEXEL.
PF14_0607 is predicted to be a multispanning membrane protein encoded by a 14-exon gene, both features suggesting it was unlikely that the protein was exported (Figure 3). The protein has a negative ExportPred score because of a suboptimal signal sequence prediction and an unusual amino acid (phenylalanine) in position 4 of the motif. The fusion protein accumulated in the parasitophorous vacuole rather than being exported, demonstrating that the amino terminus could not mediate export ( Figure 2c). Next, we tested two proteins encoded by single exon genes located in tandem in the central region of chromosome 5. PFE0355c encodes the putative serine protease PfSubtilisin 3, the least characterised member of the Plasmodium subtilisin protease family. Both PfSubtilisin 1 and 2 have been described as merozoite proteins and, at least for PfSubtilisin 1, there is accumulating evidence for localisation of the mature protein in the dense granules [11]. PFE0355c has an unusually long spacer between the signal sequence and the predicted PEXEL motif, which resulted in a negative ExportPred prediction; in agreement with this the fusion protein accumulated in the parasi-tophorous vacuole (Figure 2c). The single exon gene PFE0360c encodes a protein of unknown function and had a positive ExportPred score (3.49 in PFE0360c) for all Plasmodium species where the ortholog was found. However, the motif has an unusual amino acid, glutamic acid, in position 4 and the fusion protein accumulates in the parasitophorous vacuole rather than being exported ( Figure 2c). PF10_0321 is also a single exon gene encoding a protein of unknown function. Although the export motif was close to the consensus, the short hydrophobic amino terminus was not predicted to be a signal sequence. The fusion protein localised to the mitochondrion ( Figure 2c) and, indeed, the amino terminus is predicted to be a mitochondrial transit peptide (91% predicted with PlasMit [12]). We also tested a number of positively predicted export motifs. PFE0055c is a four-exon gene encoding a putative type I DnaJ protein (that is, containing all three DnaJ domains, see below). It had a high PEXEL score and the fusion protein was exported into the parasite-infected erythrocyte. PFI1780w has a two-exon structure and encodes a protein of unknown function, which may have multiple transmembrane domains. Importantly, it contains one of the few predicted PEXEL motifs with a lysine rather than an arginine in position 1 (except for the PfEMP1-type motif, where it is the rule). The fusion protein was clearly exported and distributed evenly in the host cell cytoplasm. Finally, we made a GFP fusion to PFI1755c, one of the most highly expressed asexual stage proteins [13,14]. It is encoded by a two-exon gene located adjacent to PFI1780w on chromosome 9; the encoded protein has a high ExportPred score and, as expected, the GFP chimera was efficiently exported to the   infected-erythrocyte cytoplasm ( Figure 2c). As expected, the presence of a signal sequence in the absence of a predicted PEXEL resulted in accumulation of the reporter protein in the parasitophorous vacuole. Taken together, these data show that ExportPred can accurately predict functional PEXEL motifs.

The P. falciparum exportome
Using as input all P. falciparum annotations and automatic gene prediction, ExportPred predicted 797 sequences as being exported, many of which represent overlapping gene predictions and annotations. To address the issue of misannotation of genes, we selected the highest scoring model in each overlapping group and some were inspected manually. After curation, 59 predictions with a PfEMP1-type motif (the whole PfEMP1 set encoded in 3D7 except var2CSA) and 396 predictions with a generic motif with score ≥ 4.3 remained (see Additional data file 2 for a detailed list). The structures of the 396 predicted genes show a strong tendency towards two exons ( Figure 3a) and, in 93% of cases, the first intron occurs in phase 0 ( Figure 3b). Inspecting the GHMM state in which the first intron occurs indicates that in 90% of cases the first intron occurs in the spacer between the signal sequence and the PEXEL motif, or, less commonly, late in the hydrophobic stretch (>75% of signal sequence in the first exon), confirming that the majority of PEXEL containing genes have a similar structure, with the signal sequence in the first exon divided from the export motif by an intron in phase 0 ( Figure  3c). Many proteins in the exported proteome of P. falciparum have one or two predicted transmembrane domains ( Figure  3d). Only four sequences were predicted to possess more than three transmembrane regions.
To cluster the 396 predicted genes into putative families, we performed an all by all comparison to generate pairs of reciprocal BLAST hits (see Material and methods). This approach yielded 26 families shown in Table 2: 16 families encode hypothetical proteins containing novel domains, while others have been previously described, such as the Rifin [5,15] and Stevor [16] families, a family of Maurer's clefts localised proteins termed PfMC-2TM (Maurer's clefts two transmembrane protein family [17]) and a family of putative protein kinases (denoted FIKK kinases) [18,19]. Two of the novel families encode DnaJ domains and another two a/b hydrolase domains. In total, at least 287 of 396 exported proteins are members of families -approximately 75% of the exportome.

A core set of proteins are conserved in the Plasmodium exportome
One of the major goals of this study was to determine whether a subset of exported proteins conserved across the genus Plasmodium exists. Since PEXEL-mediated protein export appears to be functionally conserved across Plasmodia [6,7], it could be expected that the motif involved does not differ significantly across species. We rationalised, therefore, that ExportPred could be applicable for prediction of exported proteins in the genus Plasmodium. To test whether the PEXEL export mechanism is also conserved across the phylum Apicomplexa, we used ExportPred to make predictions on the two other completely sequenced and annotated apicomplexan species, Cryptosporidium hominis [20] and Theileria parva, and also on a preliminary sequence of Toxoplasma gondii. Examination of the small number of positive predictions (Cryptosporidium, 20; Theileria, 9 (Additional data file 4); Toxoplasma, 36 (data not shown)) indicated that in each species only a few proteins were neither conserved across eukaryotes or were orthologous to a Plasmodium protein lacking an export motif. In addition, none of the predicted sequences from Cryptosporidium, Theileria or Toxoplasma form paralogous clusters, as could be expected for proteins exposed to the host immune system. We concluded, therefore, that PEXEL-mediated export into the host cell is most likely specific to the genus Plasmodium.
We investigated the potential presence of a 'core set' by performing a reciprocal BLAST search for ortholog clusters of the Plasmodium and Cryptosporidium sequence sets. Out of 6,396 ortholog clusters, 277 had at least one ortholog with a predicted PEXEL score of ≥ 4.3. We further reduced this number by requiring that all members of the cluster had either a positive ExportPred score or a correctly aligned PEXEL motif but lacked a positive prediction due to a missing signal sequence (in case the first exon of the associated gene model was misannotated), and by ensuring that the motif was not contained in a functional domain. This resulted in 36 ortholog clusters conserved between at least two studied species in the genus Plasmodium ( Table 3). None of these clusters had an ortholog in Cryptosporidium hominis, and we could also not find any in the other apicomplexan genomes of Toxoplasma gondii and Theileria parva. The P. falciparum 'core' complement follows the expression pattern of exported proteins as described previously [6], with a peak in late schizonts, merozoite and ring stages consistent with a role in ExportPred: Architecture and performance Figure 1 (see previous page) ExportPred: Architecture and performance. (a) The architecture of the ExportPred GHMM. The GHMM progresses from left to right, beginning with an amino-terminal methionine and terminating at a stop codon. Length probability densities are shown for non-geometric states. Tail states and the KLD spacer state are modelled by geometric distributions. (b) ROC curves for the ExportPred model comparing the training against the five described negative sets. (c) False discovery rate as a function of score threshold, calculated using the training set and the P.f negative set, and assuming 10% of the P. falciparum proteome is exported. (d) Comparison of predictions made by ExportPred using the default threshold of 4.3 with those published in [6,7]. The -rifin set is exclusive of any sequence annotated as rifin or stevor, whereas the +rifin set includes these sequences. erythrocyte remodeling. While all 36 genes share an ortholog between P. falciparum and P. vivax, only 10 are also present in the genome of malaria parasites of the P. berghei lineage. Twenty-two belong to novel gene families identified in the course of this study: sixteen genes belong to the PHISTc subfamily, four belong to HYP11 and two to HYP16. In addition, one conserved gene, PFE0055c, encodes a DnaJ protein, and PFB0915w encodes a previously described liver stage anti-gen, LSA-3 [21]. The genes are clustered in the subtelomeric regions of P. falciparum chromosomes 1, 2, 3, 9, 10 and 11, respectively ( Figure 4a). An alignment of the subtelomeric regions on chromosome 2 with P. vivax contigs demonstrates that synteny breaks down around PFB0100c (Figure 4b), which encodes KAHRP. KAHRP is the major structural knob component and chromosome breaks in the KAHRP locus occur frequently in P. falciparum and result in reduced [61] [61] cytoadherence [22][23][24]. On the other subtelomeric end of chromosome 2, synteny between P. falciparum and P. vivax and P. yoelii breaks down just upstream of a gene encoding an exported DnaJ protein (PFB0920w). We also investigated P. falciparum chromosome 10, since it contains 7 conserved genes encoding putatively exported proteins. Interestingly, the conserved subtelomeric cluster (except the most telomeric PHISTc gene PF10_0021) is syntenic with a large P. vivax contig that otherwise maps to the subtelomeric region of P. falciparum chromosome 3. This apparent chromosomal rearrangement event inserted approximately 50 genes between PF10_0021 and PF10_0163 (another PHISTc) and, therefore, moved this part of the conserved cluster towards the centromere in P. falciparum.
In addition to orthologous clustering, we examined exported proteins in other Plasmodium species where gene predictions were available. The close evolutionary relationship between the three studied species of the P. berghei lineage (P. yoelli,P. berghei and P. chabaudi) and between the two species of the P. vivax lineage (P. vivax and P. knowlesi) motivated our decision to combine predictions from individual species into 'hybrid' exportomes [25]. The predicted hybrid exportomes are considerably smaller than the P. falciparum complement (Figure 3e and Additional data file 3). Most significantly, both hybrid exportomes appear to contain only one large (>ten paralogs) lineage-specific family of exported proteins, an as yet unidentified one in the P. vivax/P. knowlesi cluster and the pyst-b family in the P. berghei lineage [26]. Intriguingly,  members of the well-described P. vivax vir family [27] (except the virD subtype) of surface antigens and the related yir/bir/ cir family of the P. berghei lineage lack a discernible PEXEL motif.

Lineage-specific radiation of conserved proteins
Comparison of the P. falciparum exportome with the combined exportomes of the three species from the P. berghei lineage and the two studied from the P. vivax lineage clearly indicates an expansion of exported proteins in the P. falci- parum lineage. This is reflected in the large number of P. falciparum gene families that encode exported proteins. While some gene families appear to be unique to this species (and the closely related P. reichenowi), others are present in the other two lineages either as single copy genes, or, in a few cases (for example PHIST, HYP11, HYP16) as an already radiated gene family.

The FIKK kinases: a novel family of exported P. falciparum proteins
Recently, the identification of a novel class of putative protein kinases has been reported, termed FIKK kinases, in the phylum Apicomplexa [18,19]. The FIKK kinases are expanded in the P. falciparum lineage with at least 6 paralogs in (the incompletely sequenced genome of) P. reichenowi and 20 in P. falciparum (strain 3D7). Although enzymatic activity has not been demonstrated, the presence of most of the conserved residues of the catalytic domain suggests they are functional protein kinases [19]. The 20 P. falciparum paralogs all contain a PEXEL motif following an amino-terminal signal sequence (encoded in a short first exon) [18]. In contrast, the single orthologs from species of the P. berghei and P. vivax lineages lack the first exon encoding the signal sequence, as well as the PEXEL motif. Surprisingly, we found an additional FIKK paralog lacking the first exon and a PEXEL motif in the genome of another P. falciparum strain, a Ghanian isolate that is being sequenced at the Sanger centre (currently eightfold coverage) [28]. This suggests that radiation of the FIKK family was the result of PEXEL conversion of a sequence arising from an ancient gene duplication event in the P. falciparum lineage, with subsequent loss of the ancestral version occurring recently in the 3D7 strain.

A novel family of exported proteins shared between two malaria lineages
As depicted in Table 3, 16 out of 36 genes shared between the two Plasmodium lineages that infect primates belong to a novel gene family we have named PHIST. Initial alignments indicate the presence of a conserved domain of approximately 150 amino acids in length. We used a collection of HMMs constructed from subgroupings of domain sequences to the different Plasmodium species and identified 71 paralogs in P. falciparum, 39 in P. vivax, 27 in P. knowlesi, 3 in P. gallinaceum and 1 each in P. yoelli, P. berghei and P. chabaudi ( Figure 5). The domain itself is predicted to consist of four consecutive alpha helices and does not appear similar to any Chromosomal location of exported P. falciparum proteins and synteny with P. vivax and P. yoelii contigs  known protein sequence (based upon searches of the NR protein database with PHIST HMMs) or structure (as determined by structural modelling based upon the structures of the top five weak similarities detected by the structure threading algorithm Fugue [29]). Interestingly, PHIST domains cluster into three distinct subgroups (termed PHISTa, PHISTb and PHISTc), which are distinguished by the presence and position of several conserved tryptophans. In addition, the three subtypes show different overall structures: PHISTa proteins are very short and consist only of a signal sequence, an export motif and the PHIST domain; PHISTb proteins show more length variability in the carboxyterminal portion following the PHIST domain, and a subset of seven PHISTb proteins, including the well-characterised vaccine candidate RESA (Ring-infected surface antigen [30,31]), contains an additional DnaJ domain; PHISTc proteins are the most diverse group and the only one that radiated before the separation of the P. falciparum and P. vivax lineages ( Figure  5). The genes encoding the three subfamilies are located in subtelomeric regions of all chromosomes, except chromosome 3. Microarray data from P. falciparum strain 3D7 indicate the typical expression pattern of exported proteins for PHISTb and PHISTc; a peak during schizogony and after invasion in the ring stage parasites. In contrast, PHISTa genes are generally not detectably transcribed: only two genes show significant transcription levels.

(b) (a)
PHIST phylogeny and domain map    encodes 24 proteins with DnaJ domains, which is similar to the complement in the yeast Saccharomyces cerevisiae (22 plus 3 J-like proteins [33]). A phylogenetic analysis of all Jproteins from S. cerevisiae, C. hominis and different Plasmodium species indicated two clusters with a lineage-specific expansion of DnaJ proteins (Figure 6a). One cluster represents the radiation of type I or type II (no clear prediction of the central G domain) DnaJ proteins before the divergence of the P. berghei lineage from the P. falciparum and P. vivax lineages (Figure 6b). One P. yoelii, two P. vivax and three P. falciparum genes share the same four-exon structure with a signal sequence encoded in exon 1 and a predicted PEXEL motif encoded in exon 2. In addition, the amino-terminal J domain contains the conserved HPD motif, which is an essential feature of known J-domains in enabling the regulation of Hsp70 activity [33]. A second cluster, including RESA, represents expansion of type III DnaJ proteins in P. falciparum only (Figure 6c). These 15 genes encode highly divergent proteins with various amino acid differences in the crucial HPD motif. S. cerevisiae has three type III DnaJ proteins with similar mutations in the HPD domain (for example, H to Y in position 1, D to E in position 3), which were consequently classified as J-like proteins [33]. Since the function of the three yeast proteins is unknown, it remains to be determined whether the exported P. falciparum proteins with mutations in the HPD domain are capable of stimulating ATP hydrolysis of HSP70 and, therefore, whether they are functional DnaJ proteins. As mentioned earlier, seven proteins of the PHIST protein family are also part of these unusual DnaJ proteins due to the presence of a carboxy-terminal J-domain ( Figures  5 and 6b).

Discussion
Identification of a conserved signal mediating protein export from P. falciparum into the host red blood cell has facilitated the systematic identification of the 'exportomes' of malaria parasites. One goal of this study was the development of a program to characterise the exportomes of the three malaria parasites infecting rodents, P. yoelii, P. berghei and P. chabaudi, and the two malaria parasites infecting primates, P. vivax and P. knowlesi; this has been used to determine the 'core' exportome of the genus Plasmodium. Interestingly, we have identified novel gene families that have radiated to differing extents in these parasites. It is likely that the conserved core exported proteins are important in the remodeling of the host erythrocyte for the different malaria species, while the genes that are specific for the much larger exportome of P. falciparum probably encode proteins that are directly or indirectly involved in the different properties of this parasite.

ExportPred: prediction of Plasmodium exportomes
HMMs [9,35], used here to develop ExportPred for the prediction of the Plasmodium exportome, have been successfully applied to a number of problems in sequence analysis, including signal sequence prediction (SignalP [8], Phobius [36]), transmembrane domain prediction (TMHMM [37]) and secondary structure prediction. In the majority of these cases, the presence of a significant volume of biochemically verified data is offset by the weakness of the signal they aim to detect. Although the situation for Plasmodium export prediction is somewhat reversed -paucity of verified data is made up for by the strength of the signal -the PEXEL motif provides an opportunity for accurate algorithmic prediction. GHMMs extend HMMs by augmenting states with explicit length distributions, allowing the use of a smaller number of states to capture a more general set of features at the expense of an increase in computational complexity and the lack of a procedure for automated training. Features such as the periodicity observed in the signal sequence length are well captured by a single state in a GHMM, whereas to represent the observed set of lengths in a HMM would require a complex group of states with a predetermined topology. The choice of a GHMM based architecture for PEXEL prediction improves upon rulebased methods by utilising a probabilistic framework, which captures all the observed sequence features of PEXEL containing proteins simultaneously, while not excessively increasing complexity. Some differences are observed between previous exportome predictions and those of Export-Pred. In one study the exportome had been hand curated and, therefore, a large proportion of the predictions unique to it can be ascribed to cases in which an incorrect gene model was reannotated, generally to correct a missing or incorrect first exon [6]. Given the similar performance of the motif published in another study to the ExportPred framework [7] ( Table 1) coupled with the discrepancy in predictions, we surmise that their exportome is unnecessarily conservative, and stems to some degree from the reliance on positive SignalP prediction, which does not allow for recessed signal peptides, a common feature of exported proteins.
The vast majority of the predicted exportome ( Figure 3a) is encoded by genes that are located in the subtelomeric regions of most P. falciparum chromosomes, and that tend to have a two-exon structure with the signal sequence in the first exon and the PEXEL motif at the beginning of the second exon. In addition, a large proportion of exported proteins encoded by genes with more than two exons show the same configuration for the first and second exons and of the 36 exported proteins conserved across Plasmodia, 33 are encoded by 2 exon genes ( Table 3). Two competing models may explain this surprising finding. The two-exon structure may represent the ancestral status of genes encoding exported proteins, and consequently the present situation is the result of a large radiation from a single ancestral gene. Alternatively, this specific gene structure could be the consequence of a recombination mechanism, which inserts either the first exon only or, more likely, the whole 'pre-sequence' encoding the signal sequence and the PEXEL motif into existing genes. Pre-sequence acquisition has been described for chloroplast and mitochondriaderived genes transferred to and encoded in the nucleus, and which, therefore, are forced to acquire pre-sequence encoding exons to target the corresponding proteins back to their respective organelle [38][39][40]. Considering the case of the exported FIKK kinases and DnaJ proteins, which are both encoded by genes with this typical structure and which have apparently evolved via an initial gene duplication of a oneexon gene, pre-sequence acquisition by exon-shuffling is the most parsimonious mechanism. A similar model, termed semi-exon shuffling, has recently been proposed for the acquisition of plastid-targeting transit peptides in diatoms [40]. Interestingly, a recent study found an excess of phase 1 introns around signal sequence cleavage sites in human genes [41], while the introns in genes encoding exported P. falciparum proteins are predominantly in phase 0.

The 'core' exportome: novel families and conserved loci
The ExportPred output from a whole genome analysis of C. hominis, and the lack of conserved exported proteins encoded in other apicomplexan genomes such as T. gondii and T. parva (T Sargeant, unpublished observation), indicate that the evolution of the PEXEL motif, in its described form, is specific to the Plasmodium genus. Recently, a connection has been drawn between the Plasmodium PEXEL motif and a short amino-terminal sequence (RxLR) observed in three effector proteins of the oomycete plant pathogen Phytophthora [42]. Thus far, no experimental evidence supporting its function in Phytophthora exists and the lack of any similar motif in organisms much more closely related to the genus Plasmodium makes it unlikely that the Phytophthora motif is evolutionarily related to the PEXEL motif.
Although the export motif is conserved in the majority of orthologous groups, there are a few exceptions where the PEXEL has apparently been acquired in a lineage-specific manner. The radiation of FIKK kinases and two subfamilies of putative a/b hydrolases (our unpublished observation) in the P. falciparum lineage suggest an initial gene duplication event of a conserved, non-secreted protein specifically in this lineage. Interestingly, we have identified predicted proteins in P. vivax and P. knowlesi with high homology to the central region of KAHRP. The KAHRP amino terminus consists of a signal sequence and a PEXEL motif for entry into the secretory pathway and subsequent export into the host cell ( Figure  7a). A histidine-rich region (His-domain) downstream of the PEXEL motif and two charged repeat regions in the carboxyl terminus of the protein are implicated in Maurer's clefts binding and association of KAHRP with PfEMP1, respectively [1,43]. In contrast, the putative P. vivax and P. knowlesi KAHRP orthologs contain neither a discernible PEXEL motif or His-domain, and the repeat region is much shorter and of different amino acid composition. However, a KAHRP ortholog from P. fragile with 97% identity to PfKAHRP across the whole length has recently been deposited to GenBank (accession number CAB96390) [44]. The lack of a nonexported KAHRP ancestor in P. reichenowi and P. falciparum suggests a lineage-specific conversion of a common KAHRP ancestor into an exported protein associated with the knob structure, or gene loss after an initial gene duplication event. It is interesting to note that knob formation and cytoadherence have been demonstrated in P. fragile but not in P. vivax or P. knowlesi [45]. In P. falciparum, it has recently been shown that absence of knobs due to deletion on chromosome 2 results in reduced surface expression of PfEMP1 and, therefore, reduced cytoadherence [46]. However, it remains to be shown which of the four genes apart from rif and var genes (KAHRP itself, PfEMP3, the two DnaJ typeI/II gene products PFB0085c and PFB0090c, or the PHISTc paralog PFB0080c) on the deleted chromosome end are responsible for this effect.
Radiation of DnaJ proteins in Plasmodium appears to have occurred twice independently. Our presumption is that initial gene duplication in a common ancestor of all three parasite lineages generated one DnaJ type I/II protein exported into the host erythrocyte. Duplications occurring after the divergence of the P. berghei lineage gave rise to a second copy in the P. falciparum and P. vivax lineages and a third copy in P. falciparum -though whether one of these duplications occurred before P. falciparum speciation is not clear. Although orthology is not able to be determined based upon protein sequence, indications from synteny observations suggest PFE0055c to be the ancestral exported type I DnaJ protein. Expansion of exported DnaJ type III proteins (including those with an additional PHIST domain), on the other hand, has occurred in P. falciparum only. Although the strict conservation of the crucial HPD motif in all type I/II and four type III proteins predicts them to be functional co-chaperones of HSP70, no such P. falciparum protein has a putative export motif. The P. falciparum genome encodes six proteins with HSP70 domains: two appear to be cytoplasmic proteins (PF08_0054, chr7_000093.glm_5), one contains a predicted mitochondrial transit peptide (PF11_0351), two contain carboxy-terminal ER retrieval sequences (including PfBIP [47]; PFI0875w, MAL13P1.540), and one is the P. falciparum ortholog (PF07_0033) to the cytoplasmic HSP105. A recent report demonstrated the recruitment and ATP-dependent function of host-derived HSP70 on the surface of infected red blood cells [48]. Moreover, the chaperone co-migrated in a large complex with KAHRP, suggesting localisation to the knobs. It is possible, therefore, that exported DnaJ proteins function as co-chaperones and with host HSP70 are involved in refolding highly complex molecules such as the virulence factor PfEMP1. Interestingly, a subgroup of DnaJ type III proteins also contains a PHISTb domain. The proteins in this subgroup, which includes RESA, contain a slightly relaxed PEXEL motif, RxLxxE (which is not currently modelled by ExportPred) and an unusually recessed hydrophobic signal sequence, which SignalP does not predict to be proteolytically cleaved. The expression of RESA (and all the other paralogs (see Additional data file 2)) peaks during late schizogony, and it has been demonstrated that RESA is resident in the dense granules before accumulating in the par- asitophorous vacuole and being translocated into the host cell after invasion. In light of this, it is possible that RESA, and the other paralogs, are translocated into the ER as a signalanchored protein and subsequently trafficked to dense granules. Cleavage of the anchor presumably then occurs in the dense granules or after entry into the parasitophorus vacuole, allowing recognition of the PEXEL by the parasitophorus vacuole membrane translocation machinery [49].
The PHIST protein family, identified for the first time in this work, is organised into three subfamilies, of which PHISTc is entirely shared between the P. falciparum and P. vivax lineages and, in single copy form, in the three species of the P. berghei lineage. The PHISTb subfamily appears to contain only two members in P. vivax and P. knowlesi, but has radiated extensively in P. falciparum. The PHISTb subfamily contains a P. falciparum specific subgroup including RESA that is distinguished by the additional presence of a DnaJ domain. In contrast to PHISTb and PHISTc, PHISTa is entirely P. falciparum specific. It is difficult to assign possible functions to PHIST proteins as none of the members of this family have previously been described (apart from the unusual RESA antigen), and the helical structure of the prototypical PHIST domain appears dissimilar to any known structure. It is clear, however, that PHISTc performs a function common to the genus Plasmodium, but which is expanded in the P. falciparum and P. vivax lineages, while the function of PHISTa is specific to P. falciparum. A recent yeast-two hybrid based analysis of a global protein interaction network in P. falciparum indicates a PHISTb and PHISTc paralog (PFD1140w and MAL7P1.7, respectively) plus two PHISTb paralogs with a DNAJ domain (RESA and PF11_0509) being part of a interaction network with skeleton-binding protein [50] in the centre [51]. While most PHISTb and PHISTc genes show transcriptional peaks mainly in early asexual stages, three recently published microarray studies [52][53][54] indicate that a number of genes from these two subfamilies (plus most paralogs of the PfMC-2TM family, and an infected red blood cell membrane protein PIESP2 [55]) are specifically upregulated in early gametocyte stages (see Additional data file 2). Interestingly, the same PHISTb and PHISTc genes also appear upregulated in parasites directly extracted from infected patients compared to the culture-adapted 3D7 [56]. Genes of the PHISTa subgroup on the other hand appear transcriptionally silent in 3D7 (with the exception of the unusal PFD0090c and one other member, PFL2565w), and it is therefore possible that PHISTa genes are subject to mutually exclusive expression (as is the case for the var genes). Proteomic data show a consistent presence of PHIST proteins in parasite-infected erythrocyte membrane fractions [55,57,58], and the PHISTc paralog PFI1780w has been detected in detergent-resistant membrane fractions [59]. PHIST HMMs do not detect sequence homology with any non-Plasmodium protein in the NR database, and a sequence-structure homology search, followed by structure modelling using the best hits, also produced no acceptable results. As such, it is possible that this domain represents a novel protein fold specific to Plasmodium, although it should be noted that the weak homology between members of the PHIST family in P. falciparum alone suggests that any structural similarity is likely to be hard to detect on the basis of primary sequence. Finally, the topology of the PHIST tree is a strong indication for a closer phylogenetic relationship between the P. falciparum and the P. vivax lineage than either of those and the P. berghei lineage.
Apart from the conserved families, only a few genes encoding hypothetical exported proteins are conserved in P. falciparum and one or both of the two other studied lineages. Importantly, both single copy genes and paralogs from the three conserved families PHIST, HYP11 and HYP16 are predominantly clustered in the subtelomeric regions of P. falciparum chromosomes 2, 9 and 10. Completion of the whole genome sequencing and chromosomal mapping of P. falciparum [60] and P. yoelii [61] confirmed previous chromosomal mapping studies that indicated a high degree of synteny among different Plasmodium species. Apart from the subtelomeric regions, gene synteny and order appear to be conserved over large stretches between species within the P. berghei lineage, and to a lesser extent between P. falciparum and P. vivax [62,63]. On the other hand, it has been demonstrated that genes encoding variant surface antigens and exported proteins are located subtelomerically, and that these regions exhibit the highest genomic plasticity [6,64]. A comparison of chromosome 2 has shown that single nucleotide polymorphism (SNPs) between different P. falciparum isolates are clustered in the subtelomeric regions [65]. In this study, we have analysed the subtelomeric regions of chromosomes 2 and 10, which contain many of the conserved genes encoding exported proteins. Synteny between large P. vivax contigs and P. falciparum chromosome 2 breaks down where a large increase in SNPs between isolates has been reported [65]. In addition, one subtelomeric region of chromosome 10 appears to be the result of a lineage-specific chromosomal rearrangement with a subtelomeric portion of chromosome 3 (Figure 4b).
Plasmodium KAHRP orthologs and structural similarities of exported proteins Figure 7 (see previous page) Plasmodium KAHRP orthologs and structural similarities of exported proteins. (a) Structural alignment of P. falciparum KAHRP and orthologs from P. fragile and P. vivax. Note that PfrKAHRP and PvKAHRP-c represent incomplete sequences (amino terminus represented as dashed line). (b) Structural representation of P. falciparum exported protein families. Secondary structure prediction was performed using the consensus method Jnet [84,85]. Exported proteins: conserved structure -conserved function?
The clustering of exported P. falciparum proteins into paralogous groups revealed a large number of novel families, some of which are even conserved between different Plasmodium lineages (for example, Hyp11, Hyp16, PHIST). Strikingly, the majority of these proteins, including Rifin, Stevor, PfMC-2TM and PHIST, are between 250 and 300 amino acids in length and include mainly helical stretches (Figure 7b). Altogether, approximately 75% of all exported proteins (excluding PfEMP1) conform to this structure. Moreover, four novel protein families (Hyp7/8/13/15) as well as Rifin, Stevor and PfMC-2TM contain two transmembrane domains, which are separated by three (PfMC-2TM) to 170 amino acids (Rifin). Based on the presence of the two closely positioned transmembrane domains, a P. falciparum-specific superfamily encoded by subtelomeric multicopy gene families has been postulated [17]. In addition, another study suggested a common ancestry for the large antigenic families present in the three species of the P. berghei lineage (yir/bir/cir), P. vivax (vir) and P. falciparum (Rifin) based on their structural similarities [66]. In this context, it is interesting to note that unlike yir/bir/cir and vir proteins, which all lack an export motif, the virD subfamily paralogs show the same overall structure as Rifin, including the amino-terminal signal sequence and export motif (also predicted by ExportPred with a score ≥ 4.3). However, it remains to be shown whether virD is indeed the link between Rifin and the rest of this superfamily, that is, whether this postulated relationship is genuine. The presence of two (carboxy-terminal) transmembrane domains raises questions about the orientation and function of these proteins in their respective target membrane. It has been proposed that the highly variable portion flanked by these two hydrophobic stretches forms a loop, which is exposed to the cytosol (in the case of Maurer's clefts resident proteins) or the host environment (in the case of Rifin and probably Stevor) [17]. In general, the similarity in structure of these protein families suggests that they have either been derived from the same ancestral gene or, alternatively, that these similarities are the result of convergent evolution reflecting the structural constraints of proteins exported to the host cell.

Conclusion
Synteny of conserved clusters between P. vivax and P. falciparum (and to a certain extent P. yoelii) demonstrates the presence of an ancestral set of exported proteins located on a few conserved loci present in a common ancestor. The fact that most exported proteins in the different exportomes share a common secondary structure with the few conserved ones (short, soluble, one or two carboxy-terminal transmembrane domains, multiple alpha helices) strongly suggests that these ancestral core proteins were the template for subsequent lineage-and species-specific radiations. Conservation of the core set of exported proteins in Plasmodium strongly sug-gests that they are important in remodeling of the parasiteinfected erythrocyte. In contrast, the large set of proteins that appear to be unique to specific Plasmodia are likely to provide distinct functions for survival in host specific erythrocytes. It is interesting that P. falciparum appears to have a greatly expanded exportome compared to other Plasmodia and the most likely reason for this striking observation is the presence of PfEMP1 in P. falciparum that is trafficked to the surface of the infected erythrocyte. PfEMP1 is a complex multidomain protein and the presence of a family of putative co-chaperones containing DnaJ domains may be involved in the translocation process and correct conformation of this functionally important virulence protein. The identification of a conserved set of exported proteins in Plasmodium strongly supports the idea that these are functionally critical and are potential drug targets for the development of a novel class of drugs against this important infectious disease.

Training set
The training set of 166 sequences was taken from [6] and contained manually curated exported protein sequences encoded by genes with two exons, expressed early in the intra-erythrocytic cycle and/or during late schizogony. Of these, 23 have been directly validated, or have proteomic evidence for infected red blood cell (iRBC) localisation. A further 35 are paralogous to these 23. This set of sequences did not contain any members of the exported Rifin or Stevor families, which were used as a test set. For training the KLD path, all annotated PfEMP1 sequences from P. falciparum 3D7 were used. We have produced a retrained version of ExportPred using this conservative subset; examination of the output of this retrained version revealed only negligible differences from the version trained on the full set of sequences.

Simulated negative sets
Simulated negative sets were constructed using a second order Markov model trained on P. falciparum protein sequences. Acknowledging the possibility of contamination, complete single exon genes with a non-subtelomeric chromosomal location (>500 kilobases from a telomere) were used as a non-simulated negative set (PfNegative; 1,292 sequences). Simulated test sets containing signal sequences but no PEXEL were also created by four methods: using the second order Markov model (NoSS; 1,000 sequences); using the Sig-nalP model to simulate signal sequences, concatenated with a tail simulated using the second order Markov model (SpSS; 1,000 sequences); using amino termini from real proteins with a predicted signal sequence, one exon, and a non-subtelomeric chromosomal location followed by a simulated tail (PfSS; 165 sequences); and using the ExportPred signal sequence model to simulate signal sequences, followed by a tail simulated using the second order Markov model (EpSS; 1,000 sequences).

ExportPred
ExportPred utilises a GHMM to score protein sequences according to their likelihood of export. The architecture of the ExportPred GHMM (Figure 1a) consists of three paths. Two paths model, respectively, RLE (RxLxE/Q) PEXELs (of which Rifins and Stevors are members) and KLD (KxLxD) PEXELs (of which PfEMP1 is the only described example) and the third provides a background model. A simple model for signal peptides was selected, capturing only the observed leader and hydrophobic signal sequence length distributions and amino acid composition. RLE type PEXELs were thus modelled by five states corresponding to the leader, the hydrophobic stretch of the signal sequence, spacer, the RxLxE/Q motif and the carboxy-terminal portion of the protein. KLD type PEX-ELs consist of a shorter leader sequence, followed immediately by an instance of a KxLxD motif, which is more highly conserved than the RxLxE/Q motif. It is hypothesised that the transmembrane domain immediately amino-terminal of the acidic terminal sequence (ATS) acts in ER targeting, and has, therefore, also been included in the model. The null model consists of a single state modelling the average length and amino acid composition of the P. falciparum proteome. All three paths are tied together by a state representing the amino-terminal methionine, and transition probabilities were chosen to reflect our prior belief concerning the relative abundances of proteins in the three captured classes. An RLE (or KLD) score is derived by taking the log ratio of the joint probability of the sequence and the final state being RLE-tail over the joint probability of the sequence and the final state being NULL-tail.
Due to the lack of well annotated and validated training data, as well as the lack of parameter estimation procedures for GHMMs, all model parameters for ExportPred were derived empirically. Amino acid distributions for the five positions of the RxLxE/Q and KxLxD motifs, as well as for the amino acid positions immediately preceding and following, were derived from the multiple alignment presented in [6]. Hydrophobic regions were modelled initially by an amino acid composition extracted from the hydrophobic stretches of the training sequences using a windowed approach, and with a length distribution allowing, with equal probability, any length between 10 and 25 residues. All other states were modelled with background amino acid compositions and approximate geometric length distributions. Using this initial model, residues in the sequences of the training set were classified according to the most likely state path for the sequence, and these classifications were then used to update the length and emission distributions of the model states. Length distributions for nongeometric states were smoothed using a gaussian kernel [68]. A kernel bandwidth of 1.0 was used for all states except for the RLE leader, where a bandwidth of 2.0 was used due to the greater variability in observed lengths. Variants of the ExportPred model were also produced that replaced the signal sequence or PEXEL portion of the model with the signal peptide model from SignalP (with the addition of a more flexible leader state to allow for recessed signal peptides) and/or the 11 residue motif described in [7], respectively. ROC curves [69] for the standard ExportPred model are shown in Figure  1b.
To compute a score threshold, we used the training set and the PfNegative set for positives and negatives, respectively, and, after correcting for our prior belief that approximately 10% of the P. falciparum is exported, generated a curve describing false discovery rate (FDR) as a function of score (Figure 1c). From this curve, a score threshold of 4.3 was selected, corresponding to a 5% FDR [70], meaning that we expect 5% of the resulting set of ExportPred predictions above this threshold to be false positives. A 5% FDR corresponds in this case to a false positive rate for individual sequences of approximately 0.5%. A threshold of 0.0 for the KLD path correctly identifies all var genes with a zero false positive rate. ExportPred is implemented in C++, and has been successfully compiled using recent versions of g++ on Linux and OSX. Source code is available on request.
Compilation of the P. falciparum exportome P. falciparum automated predictions and annotations were scored using ExportPred. Automated predictions and annotated genes were grouped where their chromosomal locations were congruent. Ties in which an automated prediction overlapped more than one annotation were broken by selecting the annotation with the greatest sequence overlap. To address the problem of misannotation of first exons, the gene model with the highest PEXEL score was selected from each group as a representative.

Ortholog clusters
Ortholog pairs were determined from an all by all BLAST comparison (ignoring pairwise comparisons within a species) of the Plasmodium and Cryptosporidium sequence sets using the reciprocal best hit method [71,72]. Ortholog pairs in which one or both sequences had a positive PEXEL prediction were selected. Sequences were then grouped based upon these pairwise orthology relationships. Clusters were chosen such that all members were connected to all other members by orthology relationships either directly, or through a connected chain of orthology relationships (for example, inferring orthology between sequences in species A and C, where the sequence from species A is orthologous to a sequence from species B, which is also orthologous to the sequence in species C). Sequences in each cluster were aligned using MUSCLE [73].

Clustering of P. falciparum protein sequence into families
Pairs of P. falciparum sequences were selected from the all by all comparison where there existed between the two sequences a reciprocal pair of BLAST hits with an E-value <10 -5 and with subject residues aligned to at least 50% of query residues, and where at least one of the pair was predicted to be exported. Connected subgraphs were selected to produce family groups in a manner analogous to that described above for ortholog clusters. Multiple alignments of family groups were constructed with MUSCLE and inspected to verify clustering. A number of already annotated families (Rifins [5,15], Stevors [16], PfMC-2TM proteins [17], FIKK kinases [18,19]) were checked to ensure that whole families had been retrieved by this method. Families for which no annotation was available for any members were assigned a HYPn (for HYPothetical family number n) temporary identifier.

Determination of P. falciparum exportome expression patterns
Raw Affymetrix expression data from [14,52] were reanalysed using the RMA [74] and GCRMA [75] algorithms. For GCRMA, we used only the probe affinities in the background correction step, as mismatch probes were not available for this array. The analysis was carried out using the Bioconductor packages affy and gcrma [76]. For all stages, the maximum expression from sorbitol and heat treatments was chosen. Additionally, early and late stages of ring, trophozoite and schizont stages were summarised by maximisation. A probe intensity of 5 was taken as a conservative estimate of expression.

Hybrid exportomes for the P. berghei and the P. vivax lineage
Proteins from P. yoelli, P. berghei, P. chabaudi, P. vivax and P. knowlesi predicted to be exported by ExportPred were curated to remove false positives caused by poor gene models. Each set was BLASTed against ORFs from genomes of the species represented in the set. Pairs of sequences with homology (<10 -5 ) to similar sets of ORFs (determined by requiring that at least 50% of hits to ORFs by either sequence were shared by both) were clustered to simultaneously detect putative orthologs and families.

DnaJ domain phylogeny
Sequences of DnaJ domains, including 10 amino acids amino-and carboxy-terminal to the domain were extracted from Plasmodium and Cryptosporidium sequence sets using hmmsearch (from the hmmer [35] package) with DnaJ HMM (PFAM accession: PF00226. 16) and e-value cut off of 0.01 and aligned using hmmalign. Bootstrapped neighbour joining was used to choose a tree topology, and branch lengths were assigned using least squares error minimisation. Two subtrees were noted to contain sequences from P. falciparum proteins predicted to be exported. These subtrees were extracted, ensuring that one C. hominis sequence remained as an outgroup. Full length sequences were realigned using MUSCLE, and the resulting multiple alignment used to construct new trees using both neighbour joining and maximum likelihood puzzling (using TREE-PUZZLE [77]). TREE-PUZ-ZLE was preferred over neighbour joining for small phylogenies, where it was computationally tractible and where neighbour joining did not produce meaningful results. For the exported DnaJ type I subtree, it was necessary to manually correct gene models for the three P. vivax and two P. yoelii sequences used. Secondary structure of individual domain instances was predicted using PSIPRED [78], from which a consensus structure prediction was derived for the multiple alignment.

Bioinformatic analysis of the PHIST family
The PHIST family was initially detected as a single family (with the temporary identifier HYP3) by the family clustering algorithm described. From this initial group, consisting of 44 members, a multiple alignment was created using MUSCLE, and the conserved portion (which corresponds to a domain predicted by PSIPRED to consist of 4 α-helices separated by short coils) was selected to build a HMM using hmmbuild.
Observations based upon the content of the PEXEL motif and upon phylogenetic analysis led us to further divide this group into a-and b-subfamilies, and HMMs covering the same conserved stretch were built for these two subfamilies. The three resulting HMMs were used to search the P. falciparum proteome, expanding the family membership. Any new sequence matching the superfamily HMM but not either of the a-and b-subfamily HMMs was assigned to a new c-subfamily. A HMM was built for the c-subfamily and used to expand its membership. Once complete lists of sequences in the three subfamilies were obtained, the three subfamily HMMs were recreated utilising sequence from all their respective members. Limited cross-recognition was observed between the band c-subfamily HMMs, and no cross-recognition was observed between the a-subfamily HMM and either of the band c-subfamily HMMs. The three subfamily HMMs were used to search the combined Plasmodium and Cryptosporidium database previously described, as well as the National Center for Biotechnology Information (NCBI) nr database. No significant hits existed outside the Plasmodium genus. Domains from all sequenced Plasmodium species were aligned using MUSCLE and used to create a bootstrapped neighbour joining tree. Branch lengths were assigned using least squares error minimisation, and P. gallinaceum sequences were used as an outgroup.

Generation and analysis of GFP fusion proteins
Sequences encoding the amino-terminal portion of PFI1755c, PFE0055c, PFI1780w, PFE0360c, PF10_0321, PF14_0607 and PFE0355c were amplified from cDNA of an asynchronous P. falciparum culture, strain 3D7. Each of the sequences encodes the native amino terminus of the protein including the signal sequence, the predicted PEXEL motif plus an additional 11 amino acids downstream to allow for proper folding of GFP [10]. PCR products were subcloned (for primers see Additional data file 3) into the pENTR/D TOPO vector as part of the MultiSite Gateway™ cloning strategy (Invitrogen, Carlsbad, CA, USA), which has recently been adapted for P. falciparum expression vectors [79]. Expression vectors were generated by recombination of the destination vector pCHDR-3/4 with the three entry clones PfHSP86 5'-pENTR4/1, eYFP-pENTR2/3 and the gene-specific pENTR/ D TOPO vector (for details of vector construction see [79]). Expression vectors were transfected into the P. falciparum line 3D7 and selected on 10 nM of the antifolate drug WR99210 as described previously [80]. Transgenic parasites were cultured under standard conditions [81] and analysed using a Zeiss Axioskop fluorescence microscope (Zeiss, Jena, Germany).