ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes
© King and Guda; licensee BioMed Central Ltd. 2007
Received: 7 November 2006
Accepted: 1 May 2007
Published: 01 May 2007
We present a method called ngLOC, an n-gram-based Bayesian classifier that predicts the localization of a protein sequence over ten distinct subcellular organelles. A tenfold cross-validation result shows an accuracy of 89% for sequences localized to a single organelle, and 82% for those localized to multiple organelles. An enhanced version of ngLOC was developed to estimate the subcellular proteomes of eight eukaryotic organisms: yeast, nematode, fruitfly, mosquito, zebrafish, chicken, mouse, and human.
Subcellular or organellar proteomics has gained tremendous attention of late, owing to the role played by organelles in carrying out defined cellular processes. Several efforts have been made to catalog the complete subcellular proteomes of various model organisms (for review [1, 2]), with the aim being to improve our understanding of defined cellular processes at the organellar and cellular levels. Although such efforts have generated valuable information, cataloging all subcellular proteomes is far from complete. Experimental methods can be expensive, often generating conflicting or inconclusive results because of inherent limitations in the methods [3, 4]. To complicate matters, computational methods rely on these experimental data, and therefore they must be resilient to noisy or inconsistent data found in these large datasets. These dilemmas have made the task of obtaining the complete set of proteins for each subcellular organelle a highly challenging one.
In this study we address the task of estimating the subcellular proteome through development of a computational method that can be used to annotate the subcellular localization of proteins on a proteomic scale. A fundamental goal of computational methods in bioinformatics research is to annotate newly discovered protein sequences with their functional information more efficiently and accurately. Protein subcellular localization prediction has become a crucial part of establishing this important goal. In this task, predictive models are inferred from experimentally annotated datasets containing subcellular localization information, with the objective being to use these models to predict the subcellular localization of a protein sequence of unknown localization.
The methods developed for predicting subcellular localization have varied significantly, ranging from the seminal work by Nakai and Kanehisa  on PSORT, which is a rule-based system derived by considering motifs and amino acid compositions; to the pure statistics based methods of Chou and Elrod , which employed covariant discriminant analysis; to the numerous methods available today, which are based on a variety of machine learning and data mining algorithms, including artifical neural networks and support vector machines (SVMs) [7, 8]. All methods must choose a set of features to represent a protein in the classification system. Although the majority of methods use various facets of information derived from the sequence, others use phylogenic information , structure information , and known functional domains . Some methods scan documents and annotations related to the proteins in their dataset in search of discriminative keywords that can be used as predictive indicators [12, 13]. Regardless of the representation, the sequence of a protein contains virtually all of the information needed to determine the structure of the protein, which in turn determines its function. Therefore, it is theoretically possible to derive much of the information needed to resolve most protein classification problems directly from the protein sequence. Furthermore, it has been proposed that a significant relationship exists between sequence similarity and subcellular localization , and the majority of protein classification methods have capitalized on this assumption.
In addition to different classification algorithms and protein representation models, subcellular localization prediction methods also differ in exactly what they classify. Some consider only one or a few organelles in the cell [15, 16]. Others consider all of the major organelles [5, 6, 8, 11]. Methods often limit the species being considered, such as the PSORTb classifier for gram-negative bacteria . Others limit the type of proteins being considered, such as those related to apoptosis . We refer the interested reader to a review by Dönnes and Höglund , which provides an overview of the various methods used in this vast field.
High-throughput proteomic studies continue to generate an ever-increasing quantity of protein data that must be analyzed. Hence, computational methods that can accurately and efficiently elucidate these proteins with respect to their functional annotation, including subcellular localization, at the level of the proteome are urgently needed . Although a variety of computational methods are available for this task, very few of them have been applied on a proteome-wide scale. The PSLT method , a Bayesian method that uses a combination of InterPro motifs, signaling peptides, and human transmembrane domains, was used to estimate the subcellular proteome on portions of the proteome of human, mouse, and yeast. The method of Huang and Li , a fuzzy k-nearest neighbors algorithm that uses dipeptide compositions obtained from the protein sequence, was used to estimate the subcellular proteome for six species over six major organelles.
Despite the availability of an array of methods, most of these are not suitable for proteome-wide prediction of subcellular localization for the following reasons. First, most methods predict only a limited number of locations. Second, the scoring criteria used by most methods are limited to subsets of proteomes, such as those containing signal/target peptide sequences or those with prior structural or functional information. Third, the majority of methods predict only one subcellular location for a given protein, even though a significant number of eukaryotic proteins are known to localize in multiple subcellular organelles. Fourth, many methods exhibit a lack of a balance between sensitivity and specificity. Fifth, the datasets used to train these programs are not sufficiently robust to represent the entire proteomes, and in some cases they are outdated or altered. Finally, many methods require the use of additional information beyond the primary sequence of the protein, which is often not available on a proteome-wide scale.
In this report we present ngLOC, a Bayesian classification method for predicting protein subcellular localization. Our method uses n-gram peptides derived solely from the primary structure of a protein to explore the search space of proteins. It is suitable for proteome-wide predictions, and is also capable of inferring multi-localized proteins, namely those localized to more than one subcellular location. Using the ngLOC method, we have estimated the sizes of ten subcellular proteomes from eight eukaryotic species.
We use a naïve Bayesian approach to model the density distributions of fixed-length peptide sequences (n-grams) over ten different subcellular locations. These distributions are determined from protein sequence data that contain experimentally determined annotations of subcellular localizations. To evaluate the performance of the method, we apply a standard validation technique called tenfold cross-validation, in which sequences from each class are divided into ten parts; the model is built using nine parts, and predictions are generated and evaluated on the data contained in the remaining part. This process is repeated for all ten possible combinations. We report standard performance measures over each subcellular location, including sensitivity (recall), precision, specificity, false positive rate, Matthews correlation coefficient (MCC), and receiver operating characteristic (ROC) curves. MCC provides a measure of performance for a single class being predicted; it equals 1 for perfect predictions on that class, 0 for random assignments, and less than 0 if predictions are worse than random . For a measure of the overall classifier performance, we report overall accuracy as the fraction of the data tested that were classified correctly. (All of our formulae used to measure performance are briefly explained in the Materials and methods section [see below], with details provided in Additional data file 1.) To demonstrate the usefulness of our probabilistic confidence measures, we show how these measures can be used to consider situations in which a sequence may have multiple localizations, as well as to consider alternative localizations when confidence is low.
Evaluation of different size n-grams
Prediction performance using a 7-gram model
Results for 7-gram model using entire dataset
Single-localized % overall accuracy
Multi-localized % overall accuracy (at least 1 correct)
Multi-localized % overall accuracy (both correct)
Referring to Table 1, precision is high across all classes (0.81 to 0.96), whereas sensitivity ranged between 0.75 to 0.96, with the exception of golgi (GOL; 0.55) and cytoskeleton (CSK; 0.45), which is probably due to low representation in the dataset. Although CSK and GOL had the lowest sensitivity, their precision was very good, which is typical when a class is under-predicted. Specificity is very high across all classes (0.95 to 1.0), although the classes with the largest representation in the dataset, namely extracellular (EXC), plasma membrane (PLA), nuclear (NUC), and cytoplasm (CYT), had the lowest specificity, which is typical for highly represented classes that are often prone to over-prediction. Regardless, the MCC values for these four classes were still between 0.78 and 0.92. On the other end are the classes with the smallest representations in the dataset, including lysosome (LYS), peroxisome (POX), CSK, and GOL, whose MCC values range between 0.63 and 0.90. Surprisingly, LYS and POX, the two classes with the smallest representation in the dataset, had good MCC values (0.902 and 0.836, respectively). We determined the percentage of n-grams that were unique (occurred in only one organelle) in each of these four organelles (LYS, POX, CSK, and GOL) and discovered that LYS and POX had the highest percentage of unique n-grams with respect to the total number of n-grams in the organelle (data not shown). This suggests that the proteins in these locations are highly specific and distinctive compared with those proteins localized elsewhere, and could explain the superior performance of these locations despite their having the smallest representation in the training dataset. We also observed that n-grams in CSK and GOL had the lowest percentage of unique n-grams compared with any other class in the data, suggesting that n-grams in these organelles are more likely to be in common with n-grams in other organelles, and therefore the proteins in these organelles will be difficult to predict. The remaining classes performed well, with MCC values of 0.87.
To obtain the sensitivity for multi-localized sequences, we consider two types of true positive measures: at least one of the two localizations had the highest probability, and both localizations had the top two probabilities. The overall accuracy of at least one localization being correctly predicted was 81.88%, and for both localizations being correctly predicted it was 59.7%. When considering the accuracy of both localizations being predicted to be within the top three most probable classes, the accuracy increased to 73.8%, suggesting that this method is useful in predicting multi-localized sequences.
Evaluation of the confidence score
Benchmarking the performance of ngLOC (7-gram) against its confidence score
% of dataset
% overall accuracy
Cumulative % of data:
Cumulative % overall accuracy
Rank of correct class single-localized and multi-localized sequences using a 7-gram model
Rank of correct class
CYT-NUC: 1 correct
CYT-NUC: both correct
All multi-localized: 1 correct
All multi-localized: both correct
Evaluation of the multi-localized confidence score
Evaluation of MLCS against single-localized and multi-localized sequences
% of Single-localized data
Cumulative %, single-localized data
% of Multi-localized data
% Overall accuracy, multi-localized sequences only
Cumulative %, multi-localized data
Cumulative % accuracy, multi-localized sequences only
Examples of prediction for multi-localized sequences
Comparing ngLOC with other methods
We evaluated the performance of ngLOC by comparing it with that of existing methods. Comparisons were made in three ways: by using the ngLOC dataset to train and test other methods; by testing ngLOC on another dataset; and by training and testing ngLOC on another dataset.
For our next comparative test, we found a similar dataset that has been used by the research community, namely PLOC (Protein LOCalization prediction) . The primary differences between our data and PLOC's are in the version of the Swiss-Prot repository from which the sequences were acquired, the level of sequence identity assumed in the dataset, and the multi-localized annotation in our dataset. Sequences with up to 80% identity were allowed in the PLOC dataset, whereas all sequences with less than 100% identity were allowed in the ngLOC dataset. We disregarded sequences from the PLOC dataset that are localized into the chloroplast and vacuole, because we do not consider plant sequences. We built both a 6-gram and a 7-gram model using our entire dataset, and used the PLOC dataset for testing purposes. We had overall accuracies of 88.04% and 85.64%, respectively, both of which compared favorably with the 78.2% overall accuracy reported by PLOC. It is important to note that the optimal value of n in ngLOC is dependent on the amount of redundancy in the data being tested. A 6-gram model performed better than a 7-gram one, which confirms the lower redundancy in the PLOC dataset than in the ngLOC dataset. We observed that there were some predictions with a CS of 90 or greater but were misclassified by ngLOC. We discovered that all sequences predicted with this level of confidence that were misclassified by ngLOC were due to incorrect annotation, probably because of the PLOC dataset being outdated (see Additional data file 1 [Supplementary Table 1] for some examples). Each one was verified in the latest Swiss-Prot entry as matching our prediction. We also found instances in which some of the predictions misclassified by ngLOC were actually multi-localized and should have been considered correct as well (Additional data file 1 [Supplementary Table 2]. Our performance results are without correcting any annotations in the PLOC dataset. We believe that updated annotations in the PLOC dataset, as well as updates that label multi-localized sequences, would further improve the accuracy of ngLOC on the PLOC dataset.
For our final comparative test, we modified ngLOC to predict 12 distinct classes, and used the complete PLOC dataset (with original annotations and all 12 localizations) for both training and testing on our method, using a 10-fold cross-validation for performance analysis. On a 6-gram model, the overall accuracy was 82.6%, which again compared favorably with PLOC's accuracy of 78.2%. We found numerous misclassifications that had a correct second-highest prediction (see Additional data file 1 [Supplementary Table 3] for example predictions). In fact, out of 12 possible classifications, ngLOC predicted the correct localization to be within the top two most probable classes 88.7% of the time. It is interesting to note that even in this test we discovered some sequences that were misclassified according to PLOC annotations, but the prediction by ngLOC was consistent with the latest release of Swiss-Prot (Swiss-Prot:P40541 and Swiss-Prot:P33287). We also discovered instances where the sequence is multi-localized, and ngLOC predicted the location that was not annotated in the PLOC dataset (for instance, Swiss-Prot:P40630 and Swiss-Prot:P42859]. Nevertheless, we believe that these annotations were correct at the time the PLOC dataset was constructed. These results underscore the robustness of our method and usefulness of its CS, because we were able to identify outdated annotations in the PLOC dataset, identify potential multi-localized proteins in data not annotated accordingly, and consider alternate localizations beside the predicted class when the CS is low, suggested by the high accuracy when considering the top two classifications.
Evaluating ngLOC-X for proteome-wide predictions
We extended the core ngLOC method to allow classification of proteins from a single species. We call this method ngLOC-X, which is based on the model depicted in equation 9 (see Materials and methods, below). Assessing the performance of ngLOC-X proved challenging, because only a small percentage of each proteome has subcellular localizations annotated by experimental means, and therefore it is impossible to infer an exact accuracy measurement on proteome-wide predictions. However, subsets of these proteomes are represented in the ngLOC dataset, and so performance analysis can be inferred from these subsets. We chose two species for performing extensive analysis: mouse (3,596 represented sequences out of 23,744) and fruitfly (753 represented sequences out of 9,997). (Human had the largest set, with 5,945 represented sequences; we did not test this subset because of the amount of data that would need to be removed from the core ngLOC dataset.) For each species, we extracted the represented protein sequences from the ngLOC dataset and trained ngLOC on the remaining data. After training, we ran a 10-fold cross-validation on the extracted data, comparing the performance results between the standard ngLOC model against ngLOC-X. For this test, we examined the predictions of only single-localized sequences, resulting in 3,214 sequences from mouse and 683 sequences from fruitfly for analysis.
The standard ngLOC model achieved overall accuracies of 93.5% and 79.5% for mouse and fruitfly, respectively. For ngLOC-X, the overall accuracy stayed the same for mouse, and increased to 80.5% for fruitfly. The average sensitivity (often reported as normalized overall accuracy) improved as well, increasing from 86.9% to 87.5% in mouse, and from 72.6% to 74.0% in fruitfly. Although the gains in overall accuracy and sensitivity are not significant, we noted a significant increase in the number of sequences predicted with high confidence. For mouse, ngLOC predicted 39.1% of the data with a CS above 90 at 99.8% accuracy, whereas ngLOC-X predicted 52.9% of the data in the same range at the same accuracy. Fruitfly exhibited the same effect, with ngLOC predicting 28.1% of the data with a CS above 70 at 99.0% accuracy, whereas ngLOC-X predicted 38.7% of the data in the same range at 99.2% accuracy. We are sure that this is an artifact of adjusting the n-gram probabilities to reflect the proteome being predicted. Nevertheless, this test showed us that incorporating the proteome for species X in the model, as required for ngLOC-X, did not have a negative effect on the performance compared with the standard ngLOC model, while improving the coverage of the proteome predicted with high confidence.
We sought to determine how the predictions would be affected when ngLOC-X was trained on the proteome of one species, and tested on a different species. When testing the mouse sequences on ngLOC-X trained for fruitfly, the overall accuracy and normalized accuracy again stayed the same. However, when testing fruitfly on ngLOC-X trained for mouse, the overall accuracy dropped from 80.5% to 79.2%, which was slightly worse than the standard ngLOC model. These tests showed us that a species with high representation in the training data will not result in any improvement in overall accuracy by tuning the model for a specific proteome, but that a species with low representation will yield the greatest benefit when the model parameters are tuned specifically for that species.
Our next test was to examine the instances in these proteome subsets in which ngLOC and ngLOC-X generated different predictions. For the mouse data, we found 62 sequences out of the 3,214 single-localized sequences predicted that resulted in different predictions between the two methods. The standard ngLOC method had 15 of these sequences predicted correctly, whereas ngLOC-X had 16. For the fruitfly predictions, there were 38 sequences out of the 683 sequences with different predictions. Of these, ngLOC had 10 instances that were predicted correctly, whereas ngLOC-X had 17 correct predictions.
Although most of these improvements demonstrated by ngLOC-X are statistically insignificant, fruitfly exhibited a relatively greater improvement from the ngLOC-X method than did mouse. We also discovered in both cases that almost all sequences with different predictions between the two methods were instances predicted with a low CS (for example, a CS value <40.) These results may be explained by recognizing that low-confidence predictions are more likely for sequences from a species that does not have a high representation of an evolutionarily close species in the training data. The ngLOC dataset has a higher number of proteins from species closely related to mouse (the mammalian proteins) than to fruitfly. This is confirmed by the overall accuracies reported from ngLOC for mouse and fruitfly, which were 93.5% and 79.5%, respectively; it is also confirmed by the fact that 90.8% of the mouse data were predicted with a CS of 40 or greater, whereas fruitfly only had 66.6% of the data predicted in the same CS range. Moreover, we believe that ngLOC-X will have the most benefit on the predictions from a species with low representation in the training data. This is confirmed by the following observations. First, there was a noticeable increase in the overall and normalized accuracy between ngLOC and ngLOC-X on fruitfly, whereas mouse did not benefit from ngLOC-X. Second, our cross-species test showed that testing mouse predictions on ngLOC-X trained for fruitfly did not affect the accuracy, whereas fruitfly showed slightly worse performance than the standard ngLOC method when tested on ngLOC-X trained for a mouse. Based on these findings, it is evident that ngLOC-X will show improvement in the accuracy of low-confidence predictions over ngLOC. If the sequences from a species being predicted have a high representation of evolutionarily closer species in the training data (such as mouse), then ngLOC-X has little value in final predictive accuracy. In either case, ngLOC-X never resulted in a decrease in performance compared with ngLOC, and resulted a significant increase in high confidence predictions; hence, it is the method of choice for proteome-wide prediction of subcellular localizations.
Comparison of location-wise prediction percentages for mouse and fruitfly
Mouse (M. musculus)
Fruitfly (D. melanogaster)
We can only offer educated speculation regarding the results, because accurate annotation is not available. However, the proteome-wide predictions obtained by ngLOC-X are closer to what we expect than those obtained by ngLOC. For example, in our previous work, in which we used a completely different method , we estimated that 6.3% of the proteome of the fruitfly and 4.6% of the proteome of the mouse is localized in the mitochondria. Our 5.4% and 4.8% predicted with ngLOC-X for fruitfly and mouse, respectively, compared favorably with our former results, and showed significant improvement for mitochondrial estimates over ngLOC in both cases. All of our comparative tests of ngLOC versus ngLOC-X showed that ngLOC-X was a valuable addition to the core ngLOC method.
Estimation of subcellular proteomes of eight eukaryotic species
We have used ngLOC-X to estimate the subcellular proteomes of eight different eukaryotic species. With the exception of yeast, proteomes of eukaryotic model organisms have a significant portion of hypothetical proteins (about 25% to 40%). To avoid predictions on hypothetical proteins, we generate predictions on a subset of the proteome containing at least one annotated GO concept, namely those proteins that have been experimentally validated or closely related to proteins with experimental validation at some level. We then use these predictions to generate estimates of the subcellular proteome for each species.
Estimation of the subcellular proteomes of eight eukaryotic organisms
Yeast (S. cerevisiae)
Worm (C. elegans)
Fruitfly (D. melano)
Mosquito (A. gambiae)
Zebrafish (D. rerio)
Chicken (G. gallus)
Mouse (M. musculus)
Human (H. sapiens)
% ngLOC coverage
Overall, the fractions of subcellular proteomes scaled consistently across the different species, as shown in the last column of Table 7. We observed that the percentage of sequences localized into the endoplasmic reticulum (END), golgi apparatus (GOL), and perixosome (POX) tend to remain relatively consistent across species, with average percentages of 3.0%, 1.44%, and 0.5%, respectively. In contrast, the fractions of the subcellular proteomes with relatively large percentages (cytoplasm [CYT], mitochondria [MIT], nuclear [NUC], plasma membrane [PLA], and extracellular [EXC]) varied widely across different species. This variation is expected, because as multicellular eukaryotes evolved with higher complexity, consolidation of specific cellular functions to defined organelles took place, resulting in the sequestering of corresponding proteins to these organelles. As a result, more variation is observed in the proteome sizes of larger organelles. Nevertheless, the fraction of subcellular proteomes reported for mouse and human are very similar, which is expected because of their close evolutionary distance. The size of the yeast mitochondrial proteome estimate in this study (9.55%) agrees with those previously reported (about 10%) by computational methods [9, 16], and closely matches the experimental estimates reported (13%) . Similarly, about 1,500 nucleus-encoded mitochondrial proteins have been estimated in the human mitochondria [4, 26] and our estimate of 4.8% corresponds to 1,730 proteins (Table 7 and Additional data file 1 [Supplementary Table 4] contain numeric proteome estimates), suggesting that ngLOC-X estimates are on par with those obtained by other computational and experimental approaches.
Some of the organelles indicate a trend related to the evolutionary complexity of the species being predicted. The fraction of proteomes localized to the cytoskeleton (CSK) and golgi (GOL) appear to exhibit an increasing trend with the evolutionary complexity of the species, whereas mitochrondria (MIT) and nucleus (NUC) indicate a slight decreasing trend. For the other organelles, such trends are not noticeable. Nevertheless, we should like to point out that the proteomes compared in this study are not evolutionarily equidistant, which makes it difficult to infer trends in the evolution of organellar proteomes.
A matrix showing estimated fractions of subcellular proteomes on the human proteome
We compared our estimates with those generated using the PSLT method . Our estimates of the human subcellular proteome largely agree with those reported by PSLT, with the exception of a difference in the number of multi-localized sequences (16.0% versus 9.7% reported by ngLOC-X), which is probably due to our conservative choice for MLCSthresh (≥60). (For comparison, an MLCSthresh ≥50 resulted in 13.4% of the predictions being multi-localized.) We also show a significant difference for those proteins targeted for the plasma membrane (17.1% versus 24.1% reported by ngLOC-X). This may be significant, because our predictions are based on 24,638 sequences from the human proteome, as compared with PSLT's predictions on 9,793 sequences. Moreover, proteins localized to the plasma membrane have large coverage in the ngLOC dataset. These reasons suggest that our estimates are certainly plausible. Additional data file 1 (Supplementary Tables 6 to 21) provides the complete prediction matrices generated for all eight eukaryotic species.
Biological significance of discriminatory n-grams
It is well known that functional domain regions of proteins are highly conserved because they define a vital part of the overall functionality of the protein. From our previous studies, we observed that about 74% of the functional domains are localized exclusively to only one of the 10 subcellular locations . Hence, we wondered whether we could observe any relationship between discriminatory n-grams and their occurrence in the domain regions in a protein. To perform this test, we downloaded domain definitions from the InterPro database . Only domains definitions that were at least as long as the n-gram length used in the ngLOC model were considered. We mapped these domain definitions onto the single-localized proteins in the ngLOC dataset. This resulted in 15,109 protein sequences that had some portion of its sequence mapped to a functional domain. Overall, 75.5% of the n-grams in these sequences were mapped to a domain. (We say that an n-gram is mapped to a domain only if the entire n-gram falls within the bounds of a domain.) Different localization classes had different coverage of n-grams in domain regions, ranging from 53.7% (nuclear [NUC]) to 86.8% (lysosome [LYS]). Additional data file 1 (Supplementary Table 5) provides the complete results of this analysis.
In this study, an n-gram is said to be highly discriminatory if its occurrence in a protein sequence is highly correlated with a specific localization. We consider a very conservative, strict definition of a discriminatory n-gram as an n-gram that occurs at least five times over all sequences but in only one localization class in the ngLOC dataset. Based on this definition, we found that only 15.0% of all n-grams were highly-discriminatory. However, 91.4% of all discriminatory n-grams occurred entirely within a domain region, suggesting that the discriminatory n-grams indeed originate from the domain regions. It should be noted that the number of discriminatory n-grams found in domain regions vary among different subcellular classes (ranging from 80.2% to 97.5%). Nevertheless, all of these occurrences are statistically significant compared with their expected values, as shown in Additional data file 1 (Supplementary Table 5).
ngLOC method development
A multinomial naïve Bayes model is a simplistic yet effective model when used in conjunction with the n-gram model for representing proteins. The n-gram model is able to capture sequence homology while allowing for differences due to insertion, deletion, or mutation. This model effectively shrinks the protein sequence space, thereby allowing a higher degree of redundancy between proteins of different classes that could not be achieved by considering the entire protein sequence. It should be noted that the optimal value of n chosen is highly dependent on two factors: the number of sequences in the training data and the measure of sequence similarity in the training data. Generally, both large datasets and datasets with high sequence similarity will need longer n-grams for effective classification, although larger values of n will result in a model that overfits the training data. Additionally, this has an affect on the CS. If the dataset is large and highly similar, we found that short n-grams lead to probabilities that are all relatively close in value, which results in CSs that all fall within a very tight range. The reason for this is that the total number of n-grams in equation 4 is proportionally large with respect to the size of the dataset. For example, when using a 2-gram model on our dataset, the scores for the entire dataset all ranged between 8.41 and 11.72, but when we use a 7-gram model the range is 0.0 to 99.21. Although the scores were in a tight range for the 2-gram, we observed the same relationship between relative score value and overall accuracy. It would be easy to re-scale the scores for performance analysis to fall within similar ranges across all models.
From the protein feature space point of view, a different sized n-gram will map protein surface features differently. We believe that the high performance exhibited by 6- to 8-gram models (Figure 1) is due to the fact that these n-gram peptides are ideal for mapping the secondary structure space of proteins. Secondary structure elements are vital for attaining a proper fold of a protein, and consequently are vital for its function. Hence, these secondary structures are distinctly conserved across proteins with different functions and from different subcellular locations.
Comparison with other methods
Many recent methods, including PLOC, were based on SVMs [8, 28, 29]. As successful as some of these models have been, we determined that SVMs were not suitable for our needs. First, we plan to explore the most discriminatory n-grams in proteins between different subcellular organelles. With ngLOC, it will be easy to extract n-grams of interest from the model, because the relation between each n-gram and the integer identifier generated for use by the classifier is symmetric. However, with SVM-based methods, the kernel in the SVM projects the features of the data to a higher dimensional space to increase the likelihood of making the data linearly separable. Although one might discover excellent SVM parameters for a particular classification problem, it will be difficult to understand how the translated feature space is discriminating between classes. Second, as we have illustrated, a probabilistic measure ought to be considered a crucial part of any predictive model. Therefore, we determined that a pure probabilistic model was desired. Deriving this measure is a difficult feat for SVMs because of their nonprobabilistic output. Any attempt to derive such a measure with SVMs can be done only by creating another layer of classification to simulate a probability measure from the output of the SVMs. The risk of the simulated probability distribution overfitting the data used to generate the distribution is a known artifact with these methods . The pure probabilistic confidence measure derived directly from the probabilities calculated from a method such as ngLOC will have a more consistent, scalable probability measure.
Estimation of subcellular proteomes
Our model, ngLOC, was enhanced to allow dynamic adjustment of the model parameters specific to a proteome being estimated. This model, termed ngLOC-X, is useful for predicting the subcellular localization of the proteome of any species. Our proteome prediction results showed that although a single model can be used on a variety of species, better results can be had if the model is tuned for a specific species being considered (Table 6). If a single model is being used across numerous species, it is very important to include a broad spectrum of data across all species. Unfortunately, this is not possible because of the imbalanced nature of protein sequence data in the public domain. Our model, ngLOC-X, essentially extends the core ngLOC model by introducing a bias toward a single species being predicted. The accuracy and coverage of our model across species will continue to burgeon as the proteomes of new eukaryotic species become available. The eukaryotic species selected in this study represent a broad spectrum in the eukaryotic superkingdom (not including plants). Despite this, corresponding fractions of each subcellular proteome fall within a reasonable range across species (Table 7). This suggests that the proteome size corresponding to the core functionality of an organelle remains unchanged across species, whereas the observed variation in size allows for functionalities required by specific species for their adaptation. This hypothesis can be tested by studying the organellar proteomes at the domain level, and we aim to continue this work in the future.
Discriminatory n-grams and functional domains
It is known that targeting signals such as KDEL/HDEL (for endoplasmic reticulum) and SKL (for peroxisomes) play a distinct role in transporting a protein to its destination in the cell. Nevertheless, ngLOC does not require prior knowledge regarding such signal peptides, and neither does it explicitly consider such information in the prediction process. Despite this, ngLOC is able to perform better than methods that explicitly use such information because each discriminatory n-gram is analogous to such signals. To support this argument, we demonstrated that 91.4% of all discriminatory n-grams originate from the domain regions of proteins (Additional data file 1 [Supplementary Table 5], which define the core function of a protein. The observations suggest that ngLOC predictions are based on functionally significant regions (domains) of the protein sequences, which are represented by n-grams covering the entire sequence space. In contrast, methods that rely on target signals generally scan only the amino-terminal or carboxyl-terminal regions of protein sequences, where such signals are located. It is likely that if the targeting signals are shorter than the n-gram, then the discriminatory n-grams represent both the signal as well as its neighborhood (which is often very important for transport). Similarly, if protein transport requires motifs that are longer than the n-gram, such motifs would be represented by multiple and mostly contiguous n-grams. Therefore, ngLOC need not have prior knowledge of specific targeting signals, because it is likely that analogous signals (discriminatory n-grams) are inherently identified de novo and used in establishing the localization prediction of the protein. Because of this plasticity, the ngLOC method has the ability to perform well on a number of locations, and hence it is highly suitable for proteome-wide prediction of subcellular localization.
In this new age of proteomics there is great need for computational methods that can classify newly discovered proteins using information derived only from the primary sequence. Methods that predict the subcellular localization of a protein are an important part of meeting this need. In our study we have developed the ngLOC method, a Bayesian classifier that can predict the subcellular localization of a protein with superior performance against other methods of similar scope.
Because ngLOC is a probabilistic method, we were able to generate an extremely useful probabilistic confidence score (CS) that places a measure of credibility on each prediction. We have shown how this measure was used to determine the most likely localization for new proteins and possible annotation errors on known proteins. From this score, we also were able to develop a confidence measure to aid in determining multi-localized proteins as well, which is an important need in this area, because a significant part of the proteome is known to localize into multiple compartments. These scores developed in this study are sound and useful for predicting sequences localized to both single or multiple locations with high accuracy.
We extended the core ngLOC method, called ngLOC-X, and showed how it improved coverage for proteome-wide predictions over a single species by performing a comparative analysis of the results from both methods. We applied ngLOC-X to estimate ten distinct subcellular proteomes for eight eukaryotic model organisms. To our knowledge, this study presents the first estimate of ten distinct organelles on eight eukaryotic species with our coverage.
As with most computational models, the accuracy of ngLOC is completely dependent on the quality and coverage of the dataset used to train the model. Although many methods are unable to use large datasets because of computational limitations, ngLOC does not have these limitations. Clearly, modern day proteomics will continue to produce increasing amounts of experimentally determined data. The simplicity of the ngLOC model will enable it to easily incorporate these new data as they become available, thereby increasing the accuracy and coverage of ngLOC in the future. ngLOC can play a significant role in this field, when used in conjunction with experimental methods, to help meet the needs of the research community.
Materials and methods
Distribution of proteins over subcellular localizations
Number of sequences
2 localizations annotated
We downloaded the proteomes of eight eukaryotic model organisms from the Integr8 database , which include Saccharomyces cerevisiae (yeast), Caenorhibditis elegans (worm), Drosophila melanogaster (fruitfly), Anopheles gambiae (mosquito), Danio rerio (zebrafish), Gallus gallus (chicken), Mus musculus (mouse), and Homo sapiens (human). Because 25% to 40% of these proteomes are hypothetical proteins (with the exception of yeast), we separated the curated proteome subsets containing annotation for at least one of the three GO concepts, but not including those with GO evidence codes: ND (no biologic data available), RCA (reviewed computational analysis), and NAS (nontraceable author statement).
We report standard performance measures over each subcellular location class, denoted as c j , including the following: sensitivity (recall), which is the fraction of data in class c j that were correctly predicted; precision, which is the fraction of data predicted to be in class c j that were actually correct; specificity, which is the fraction of data not in class c j that were correctly predicted; false positive rate, which is the fraction of data not in class c j that were incorrectly predicted to be in class c j ; and MCC. The latter provides a measure of performance for a single class being predicted, where it equals 1 for perfect predictions on that class, 0 for random assignments, and less than 0 if predictions are worse than random .
We also report overall accuracy, defined as the fraction of data that were classified correctly, as a comparative measure of the overall performance of the classifier. Finally, we show a ROC curve as a graphical means of measuring the performance for each class. (All of our formulas used to measure performance are detailed in Additional data file 1.) All performance measurements are based on a standard 10-fold cross validation unless otherwise stated.
The n-gram model for protein representation
Letting S denote the feature space used to represent all proteins, we develop S and our predictive model in light of the significant work that has been accomplished in the field of document classification. Cheng and coworkers  showed that using document classification techniques on the primary sequence can achieve good results on classifying protein families. In a typical document classification model, S is constructed by considering all possible words that may appear throughout the entire set of documents. Here, we consider subsequences of a protein of fixed length n as the equivalent of words in a document. In literature, these protein subsequences have been commonly called n-grams, n-mers, n-peptides, or simply words or subsequences of length n [34, 35]. We adopt the term n-gram. In protein classification tasks using the n-gram model, S is constructed by considering all possible n-grams.
Formally, we let Σ represent the set of all possible amino acids, and |Σ| = 20. Given a dataset of protein sequences D, let d i be a sequence in D having k residues in length, where d i = (s1s2 ... s k ) and each s i ∈ Σ. In an n-gram model, the size of the feature space grows exponentially with n, because |S| = |Σ| n . To illustrate, an integer variable typically requires four bytes of memory. If such a variable was used to keep track of the frequency of each of the possible 5-grams, the model would require 4 bytes × 205 features = 12.8 MB of memory, a 6-gram model would require 256 MB, and a 7-gram requires 5.1 GB of memory. Fortunately, for large values of n, relatively few n-grams actually occur in nature because of the evolutionary selection process, which requires a delicate mixture of various amino acid combinations in a peptide to sustain a fold. A simple analysis on the entire National Center for Biotechnology Information nonredundant dataset (2.7 million protein sequences at the time of this analysis) showed that an n-gram length as small as n = 5 had examples that never occurred. Therefore, to allow n-gram models for n > 5, we take advantage of the sparse nature of higher order n-grams by developing a one-to-one mapping between unique n-grams and the set of integers to be used as indices, as needed. This requires memory allocation only for n-grams that occur in the training data, thereby allowing exploration of large values of n.
Naïve Bayes classifiers and subcellular localization
Bayesian predictive models have been effectively used in a variety of classification problems, including both document and protein classification tasks [33, 36–38]. We give a brief derivation of the model in the context of protein subcellular localization prediction, using similar notation as depicted by McCallum and Nigam .
Probabilistic confidence score
An advantage in using a probabilistic model, such as a naïve Bayes classifier or a Hidden Markov Model, over nonprobabilistic approaches is that the probability of the model generating a given sequence is inherently reported for every class. Although the prediction is based on the model with the highest probability, the probability can also be used as a comparative measure against other classes. If the predicted class had a low probability, it might suggest that the second or third highest predictions should also be considered. If the top two classes predicted were relatively high compared to the rest of the classes, it might suggest the possibility that the sequence is localized into both locations. We develop a probabilistic CS for sequence d i for each possible class c j , denoted CS(d i | c j ), and show how this is used to address each of the above issues. We also derive a multi-localized confidence score, denoted MLCS(d i ), that gives a probabilistic measure of sequence d i being multi-localized.
The range for each score will always be between 0 and 100, with the sum of the scores over all classes totaling 100. This CS can also be interpreted as an estimate of the conditional probability of class c j , given sequence d i and the n-gram model used.
Application of ngLOC to proteome-wide predictions
Most protein classification models, including ngLOC, are built using datasets from sequences over many species across the eukaryotic superkingdom. In fact, the ngLOC dataset exhibited in Table 9 contains proteins from 1,923 different species. This introduces another variable that can be observed. Although some methods indirectly observe the relationship between the species and the dependent variable by incorporating phylogenic information in their model, it is usually not directly observed. However, it is known that in the case of subcellular localization the distribution between classes varies among species. For example, one study of mitochondrial proteins estimates that 9.9% of the yeast proteome is localized in the mitochondria, as compared with an estimated 4.8% of the human proteome . Another study estimated that as much as 13% of the yeast proteome is localized in the mitochondria . It is clear that the predictions for the proteome of a specific species can be improved if the prior probability P(c j ) is known for the species being predicted. Unfortunately, this information is not available before classification.
We only need to consider the two terms in the denominator for reasons stated previously. We solve for P(d i | c j , X), the probability of a protein sequence, given subcellular localization c j and species X, by assuming a mixture model of two independent conditional probability distributions over the space of protein sequences. One distribution is over proteins with known subcellular localization but unknown species, and the other distribution is over proteins of a known species but unknown subcellular localization. This forms a mixture model of two distributions, formally stated as follows:
P(d i | c j , X) = α j P(d i | c j ) + (1 - α j ) P(d i | X) (10)
The mixture model allows us to consider the distribution of n-grams over an entire proteome by adjusting the probabilities of each n-gram in the training data to represent more accurately the distributions of n-grams in the proteome being classified. The result is that the distribution of P(w t | c j ) over all n-grams will be more similar to that of the proteome, while retaining the relative probabilities of each class within individual n-grams learned from the labeled data.
This method can cause the probability estimates for P(c j | X) to become more uniform in proportion to the size of the dataset of the proteome (D x ) being considered. We accept this tendency, because it implicitly factors in a measure of uncertainty proportional to the size of the proteome being considered, meaning the larger the proteome, the more uncertainty there is in regard to the exact prior probabilities of each subcellular localization. In this case, the priors should not be based solely on the exact percentages of the ngLOC training data.
Additional data files
This work has been supported by the startup funds to CG from the State University of New York (SUNY) at Albany and the graduate student fellowship provided by the Gen*NY*sis Center for Excellence in Cancer Genomics, SUNY at Albany.
- Yates JR, Gilchrist A, Howell KE, Bergeron JJ: Proteomics of organelles and large cellular structures. Nat Rev Mol Cell Biol. 2005, 6: 702-714. 10.1038/nrm1711.PubMedView ArticleGoogle Scholar
- Andersen JS, Mann M: Organellar proteomics: turning inventories into insights. EMBO reports. 2006, 7: 874-879. 10.1038/sj.embor.7400780.PubMedPubMed CentralView ArticleGoogle Scholar
- Simpson JC, Pepperkok R: The subcellular localization of the mammalian proteome comes a fraction closer. Genome Biol. 2006, 7: 222-10.1186/gb-2006-7-6-213.PubMedPubMed CentralView ArticleGoogle Scholar
- Taylor SW, Fahy E, Ghosh SS: Global organellar proteomics. Trends Biotechnol. 2003, 21: 82-88. 10.1016/S0167-7799(02)00037-9.PubMedView ArticleGoogle Scholar
- Nakai K, Kanehisa M: A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 1992, 14: 897-911. 10.1016/S0888-7543(05)80111-9.PubMedView ArticleGoogle Scholar
- Chou KC, Elrod DW: Protein subcellular location prediction. Protein Eng. 1999, 12: 107-118. 10.1093/protein/12.2.107.PubMedView ArticleGoogle Scholar
- Reinhardt A, Hubbard T: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 1998, 26: 2230-2236. 10.1093/nar/26.9.2230.PubMedPubMed CentralView ArticleGoogle Scholar
- Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: 1656-1663. 10.1093/bioinformatics/btg222.PubMedView ArticleGoogle Scholar
- Marcotte EM, Xenarios I, van der Bliek AM, Eisenberg D: Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci USA. 2000, 97: 12115-12120. 10.1073/pnas.220399497.PubMedPubMed CentralView ArticleGoogle Scholar
- Nair R, Rost B: Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins. 2003, 53: 917-930. 10.1002/prot.10507.PubMedView ArticleGoogle Scholar
- Guda C, Subramaniam S: pTARGET: a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics. 2005, 21: 3963-3969. 10.1093/bioinformatics/bti650.PubMedView ArticleGoogle Scholar
- Höglund A, Blum T, Brady S, Dönnes P, Miguel JS, Rocheford M, Kohlbacher O, Shatkay H: Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pac Symp Biocomput. 2006, : 16-27.Google Scholar
- Eskin E, Agichtein E: Combining text mining and sequence analysis to discover protein functional regions. Pac Symp Biocomput. 2004, : 288-299.Google Scholar
- Nair R, Rost B: Sequence conserved for subcellular localization. Protein Sci. 2002, 11: 2836-2847. 10.1110/ps.0207402.PubMedPubMed CentralView ArticleGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300: 1005-1016. 10.1006/jmbi.2000.3903.PubMedView ArticleGoogle Scholar
- Guda C, Fahy E, Subramaniam S: MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics. 2004, 20: 1785-1794. 10.1093/bioinformatics/bth171.PubMedView ArticleGoogle Scholar
- Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, Brinkman F: PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 2003, 31: 3613-3617. 10.1093/nar/gkg602.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhou GP, Doctor K: Subcellular location prediction of apoptosis proteins. Proteins. 2003, 50: 44-48. 10.1002/prot.10251.PubMedView ArticleGoogle Scholar
- Dönnes P, Höglund A: Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics. 2004, 2: 209-215.PubMedGoogle Scholar
- Hamady M, Cheung TH, Resing K, Cios KJ, Knight R: Key challenges in proteomics and proteoinformatics. Progress in proteins. IEEE Eng Med Biol Mag. 2005, 24: 34-40. 10.1109/MEMB.2005.1436456.PubMedView ArticleGoogle Scholar
- Scott MS, Thomas DY, Hallett MT: Predicting the subcellular localization via protein motif co-occurrence. Genome Res. 2004, 14: 1957-1966. 10.1101/gr.2650004.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang Y, Li Y: Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics. 2004, 20: 21-28. 10.1093/bioinformatics/btg366.PubMedView ArticleGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975, 405: 442-451.PubMedView ArticleGoogle Scholar
- Nakai K, Horton P: PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem Sci. 1999, 24: 34-35. 10.1016/S0968-0004(98)01336-X.PubMedView ArticleGoogle Scholar
- Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Lui Y, et al: Subcellular localization of the yeast proteome. Genes Dev. 2002, 16: 707-719. 10.1101/gad.970902.PubMedPubMed CentralView ArticleGoogle Scholar
- Lopez MF, Kristal BS, Chernokalskaya E, Lazarev A, Shestopalov AI, Bogdanova A, Robinson M: High-throughput profiling of the mitochondrial proteome using affinity fractionation and automation. Electrophoresis. 2000, 21: 3427-3440. 10.1002/1522-2683(20001001)21:16<3427::AID-ELPS3427>3.0.CO;2-L.PubMedView ArticleGoogle Scholar
- The InterPro Database. [http://www.ebi.ac.uk/interpro]
- Höglund A, Dönnes P, Blum T, Adolph HW, Kohlbacher O: MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics. 2006, 22: 1158-1165. 10.1093/bioinformatics/btl002.PubMedView ArticleGoogle Scholar
- Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization. Proteins. 2006, 64: 643-651. 10.1002/prot.21018.PubMedView ArticleGoogle Scholar
- Drish J: Obtaining Calibrated Probability Estimates from Support Vector Machines: Technical Report. 2001, San Diego, CA: University of Califorina at San DeigoGoogle Scholar
- The UniProtKB/Swiss-Prot Protein Knowledgebase. [http://www.ebi.ac.uk/swissprot]
- Integr8: Complete genome and proteome database. [http://www.ebi.ac.uk/integr8]
- Cheng BY, Carbonell JG, Klein-Seetharaman J: Protein classification based on text document classification techniques. Proteins. 2005, 58: 955-970. 10.1002/prot.20373.PubMedView ArticleGoogle Scholar
- Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J: Comparative n-gram analysis of whole-genome protein sequences. Proceedings of the Human Language Technolgy Conference (HLT 2002): 24-27 March; San Diego. 2002Google Scholar
- Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li TB, et al: How independent are the appearances of n-mers in different genomes?. Bioinformatics. 2004, 20: 2421-2428. 10.1093/bioinformatics/bth266.PubMedView ArticleGoogle Scholar
- McCallum A, Nigam K: A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Classification: 26-30 July; Madison. 1998, AAAI PressGoogle Scholar
- Mitchell T: Machine Learning. 1997, Boston, MA: WCB/McGraw-HillGoogle Scholar
- Duda R, Hart P, Stork D: Pattern Classification. 2001, New York, NY: John Wiley & Sons, IncGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.