NetPath: a public resource of curated signal transduction pathways

NetPath, a novel community resource of curated human signaling pathways is presented and its utility demonstrated using immune signaling data.


Background
Complex biological processes such as proliferation, migration and apoptosis are generally regulated through responses of cells to stimuli in their environment. Signal transduction pathways often involve binding of extracellular ligands to receptors, which trigger a sequence of biochemical reactions inside the cell. Generally, proteins are the effector molecules, which function as part of larger protein complexes in signaling cascades. Cellular signaling events are generally studied systematically through individual experiments that are widely scattered in the biomedical literature. Assembling these individual experiments and putting them in the context of a signaling pathway is difficult, time-consuming and cannot be automated.
The availability of detailed signal transduction pathways that can easily be understood by humans as well as be processed by computers is of great value to biologists trying to understand the working of cells, tissues and organ systems [1]. A systems-level understanding of any biological process requires, at the very least, a comprehensive map depicting the relationships among the various molecules involved [2]. For instance, these maps could be used to construct a complete network of protein-protein interactions and transcriptional events, which would help in identifying novel transcriptional and other regulatory networks [3]. These can be extended to predict how the interactions, if perturbed singly or in combination, could affect individual biological processes. Additionally, they could be used to identify possible unintended effects of a candidate therapeutic agent on any clusters in a pathway [4]. We have developed a resource called NetPath that allows biomedical scientists to visualize, process and manipulate data pertaining to signaling pathways in humans.

Results and discussion
Development of NetPath as a resource for signal transduction pathways NetPath [5] is a resource for signaling pathways in humans. As an initial set, we have curated a list of ten immune signaling pathways. The list of immune signaling pathways includes T and B cell receptor signaling pathways in addition to several interleukin signaling pathways, as shown in Table 1. A query system facilitates searches based on protein/gene names or accession numbers to obtain the list of cellular signaling pathways involving the queried protein ( Figure 1).

Signaling pathway annotation
To facilitate annotation of pathway data, we first developed a tool called 'PathBuilder' [6]. PathBuilder is a signal transduction pathway annotation tool that allows annotation of pathway information, storage of data, easy retrieval and export into community standardized data structures such as BioPAX (Biological Pathways Exchange) [7], PSI-MI (Proteomics Standards Initiative -Molecular Interactions) [8] and SBML (Systems Biology Markup Language) [9] formats. PathBuilder facilitates the entry of information pertaining to protein interactions, enzyme-regulated reactions, intracellular translocation events and genes that are transcriptionally regulated.
Protein-protein interactions could be binary when two proteins directly interact with each other -'direct interaction' -or when the proteins are present in a complex of proteins -'complex interaction'. Both types of protein interactions are comprehensively collected from the literature. We provide PubMed identifiers, experiment type and host organism in which the interaction has been detected.
Enzyme-regulated reactions such as post-translational modifications (for example, phosphorylation, proteolytic cleavage, ubiquitination, prenylation or sulfation) are annotated as catalysis interactions. For each catalysis or modification event, the upstream enzyme, downstream targets and the site of the modification for a protein are annotated, if available. Proteins that translocate from one compartment (for example, the cytoplasm) to another (for example, the nucleus) are represented as transport events. For all reactions, a brief comment describing the reaction is also provided.

Display of pathway information
The homepage of any given pathway contains a brief description of the pathway, a summary of the reaction statistics and a list of the molecules involved in the pathway. Reactions in a pathway are provided under three distinct categories, including physical interactions, enzyme catalysis and transport. Furthermore, the pathway data are also provided in PSI-MI, BioPAX and SBML formats, which can also be visualized through other external network visualization software, such as Cytoscape [10].

Cataloging transcriptionally regulated genes
In addition to the above pathway annotations, information on genes that are transcriptionally regulated is provided in NetPath. This is important because addition of most extracellular growth factors or ligands leads to an alteration in the transcriptome of the cell. Often, some of the transcriptionally regulated genes are used as 'reporters' in biological experiments where the pathway is being studied. We have cataloged a number of genes that are up-or down-regulated by the particular ligand involved in each pathway. These up/down-regulated genes can be considered as 'signatures' for that particular pathway. We have incorporated both microarray and non-microarray (for example, Northern blot, quantitative RT-PCR, serial analysis of gene expression (SAGE), and so on) experiments for gene expression. In each case, the type of experiment (that is, microarray, nonmicroarray or both) used to obtain the data is indicated. Additionally, we have also annotated the transcription factors that are responsible for transcriptional regulation of the downstream genes where such information is available. Given the large number of transcriptionally regulated genes for each pathway, we have also developed a query system that permits users to search such genes using gene symbol or accession numbers. This feature will be valuable for shortlisting genes that are common to several pathways or specific to any given pathway.

Pathway statistics
At present the 10 annotated immune signaling pathways comprise 703 proteins and 1,572 reactions. The reactions can be grouped into 740 molecular association events, 727 enzyme catalysis events and 105 translocation events. Our pathways provide a list of 2,004 and 889 genes that are up-or down-regulated, respectively, at the level of mRNA expression. Including 10 cancer signaling pathways that are also available through Cancer Cell Map [11], NetPath now contains 1,682 proteins and 3,219 reactions, which can be grouped into 1,800 molecular association events, 1,218 enzyme catalysis events and 201 transport events. Table 1 shows the overall immune signaling pathway statistics as of 1 November 2009.

Comparison with other signaling databases
Although over 310 resources [12] provide some form of pathway related information, many of these currently available resources are databases for protein-protein interactions, metabolic pathways, transcription factors/gene regulatory networks, and genetic interaction networks. Some of these pathways include the Kyoto Encyclopedia of Genes and Genomes (KEGG) [13], BioCarta [14], Science's Signal Transduction Knowledge Environment (STKE) Connections Maps [15], Reactome [16], National Cancer Institute's Pathway Interaction Database (PID) [17], Pathway database from Cell Signaling Technology [18], Integrating Network Objects with Hierarchies (INOH) [19], Signaling Pathway Database (SPAD) [20], GOLD.db [21], PATIKA [22], pSTIING [23], TRMP [24], WikiPathways [25] and PANTHER [26]. However, many of these pathway resources are not primary -that is, they combine data from many other sources. Thus, we have compared NetPath with eight other signaling pathways that contain manually curated human pathway data derived from experiments. Of all these pathways that are compared, NetPath stands out for three unique features. The first is that it includes annotation of transcriptionally regulated genes. Such a catalog of transcriptionally regulated genes pertaining to a given pathway should be highly useful in exploring pathway-specific expression signatures. The second unique feature is that NetPath provides manually curated textual descriptions of each pathway reaction, which should facilitate an easier understanding of these pathways, aiding biomedical scientists to get an overview of the pathway reactions in a central repository. The third unique feature of NetPath is that these data can be searched using SPARQL -the recommended query language for the semantic web.

Interleukin-2 pathway as a prototype
One of the best studied immune signaling pathways is the interleukin (IL)-2 signaling pathway [27]. IL-2 is a multifunctional cytokine with pleiotropic effects on several cells of the immune system [27,28]. IL-2 was originally discovered as a T cell growth factor [29], but it was also found to have actions related to B cell proliferation [30], and the proliferation and cytolytic activity of natural killer cells [31]. IL-2 also activates lymphokine activated killer cells [32]. In contrast to its proliferative effects, IL-2 also has potent activity in a process known as activation-induced cell death [33]. More recently, IL-2 was shown to promote tolerance through its effects on regulatory T cell development [34]. IL-2 clinically has anti-cancer effects [35] as well as utility in supporting T cell numbers in HIV/AIDS [36].
There are three classes of IL-2 receptors, binding IL-2 with low, intermediate, or high-affinity [37]. The low affinity receptor (IL-2Rα alone) is not functional; signaling by IL-2 involves either the high affinity hetero-trimeric receptor containing IL-2Rα, IL-2Rβ and the common cytokine receptor gamma chain (originally named IL-2Rγ and now generally denoted as γc) or the intermediate affinity heterodimeric receptor composed of IL-2Rβ and γc [37,38]. Mutations in the IL2RG gene result in X-linked severe combined immunodeficiency disease [39]. IL-2 stimulation induces the activation of the Janus family tyrosine kinases JAK1 and JAK3, which associate with IL-2Rβ and γ c , respectively. These kinases in turn phosphorylate IL-2Rβ and induce tyrosine phosphorylation of STATs (signal transducers and activators of transcription) and various other downstream targets [40]. The downstream signaling pathway also involves mitogen-activated protein kinase and phosphoinositide 3-kinase signaling modules [41], leading to both mitogenic and anti-apoptotic signals [40][41][42].
The IL-2 signaling pathway currently comprises of 68 proteins, 155 reactions with 68 molecular association events, 76 enzymatic catalysis events and 11 translocation events. Importantly, 840 transcriptionally regulated events -that is, a list of genes up-or down-regulated by IL-2 -have been annotated from the published literature. In all, the reactions in the IL-2 pathway are supported by 1,289 links to research articles. Figure 2 shows the pathway page of the IL-2 pathway.

Integration of pathway information with other resources
The pathways developed by us have been integrated with the Human Protein Reference Database (HPRD) [43,44]. The integration of pathways in HPRD helps identify each component of the pathway in the context Figure 2 The IL-2 pathway page in NetPath. Hyperlinks to pathway-specific information, such as reactions, transcriptionally regulated genes, molecular associations, and catalysis events, are listed. There is also an option to download pathway information in various data exchange formats from this page. of its detailed proteomic annotations [45]. As part of our community participation with other databases/ resources, we hope to establish connections with other pathway databases such as KEGG [27] and Reactome [16] in the future.

Availability of pathway data
A digital representation of pathways is essential to be able to manipulate the large amount of available information [4]. The diversity among pathway databases is also reflected in differences in data models, data access methods and file formats. This leads to the incompatibility of data formats for the analysis of pathway data. To avoid this, data standards are adopted by many of the pathway databases [12,46]. Data standards reduce the total number of translation operations needed to exchange data between multiple sources. To facilitate easy information retrieval from a wide variety of pathway resources, a broad effort in the biological pathways community called BioPAX was initiated. Since many less-detailed data types in a pathway database are difficult to represent in a very detailed format, BioPAX ontology uses hierarchical entity classes to present multiple levels of data resolution. All pathways in NetPath are available for download in BioPAX level 2, version 1.0. The PSI-MI format was developed to exchange molecular interaction data between databases containing protein-protein interactions. PSI-MI data representation facilitates data comparison, exchange and verification [8]. The molecular interaction subset of NetPath pathways is also available in PSI-MI version 2.5. SBML was developed as a medium for representation and exchange of biochemical network models [9]. NetPath provides all pathway data in SBML version 2.1 format. All data are made available under the Creative Commons license version 2.5 [47], which stipulates that the pathways may be freely used if adequate credit is given to the authors. Support for these data standards and free license enables the integration of knowledge from multiple sources in a coherent and reliable manner.

Enabling semantic web for NetPath
The semantic web envisions an internet where specific information can be obtained from the web automatically using computers. Because providing computers with the intuitiveness of humans is nearly impossible as of now, creation of meta-data -data about data -can help computers identify what is being sought less ambiguously. However, annotating more data does not automatically imply that the data can be made easily accessible by the user. For instance, although many resources permit direct querying of individual molecules in the respective databases, queries based on 'relationships' between different entries in the databases cannot be handled. One possible solution to enable searching by such 'concepts' is to incorporate semantic web features that explicitly describe the inter-relationship between entries in the databases.
The W3C has established SPARQL as the standard semantic query language. Pathway data in BioPAX uses the web ontology language (OWL) format, which is highly descriptive in nature and can be used to make pathways semantically 'queryable'. In this regard, we have implemented an application programming interface (API) for NetPath that accepts SPARQL over HTTP to query the BioPAX files describing NetPath pathways. The return results are provided in SPARQL Query Results XML format. Although biologists cannot be expected to write SPARQL queries, the ability to send SPARQL queries over HTTP allows bioinformaticians to write client applications that can retrieve NetPath resources taking advantage of the descriptive richness of SPARQL and BioPAX.

Analyzing impact factor for pathways
It is becoming clear that pathway information can be used in the context of genome-scale gene expression experiments. A novel approach has been recently reported to measure the biological impact of perturbation of pathways in genomewide gene expression experiments [48]. This approach considers the topology of genes in a pathway in conjunction with classical statistics for microarray analysis. The impact factor is a statistical approach that can capture the magnitude of the expression changes of each gene, the position of the differentially expressed genes on the given pathways, the topology of the pathway that describes how these genes interact, and the type of signaling interactions between them. Our previous results using KEGG pathways were found to correlate with known biological events that were missed by other widely used classical analysis methods. However, this approach could not be applied to study immune responses because of the limited availability of data on such pathways in humans.
As a proof of principle, we selected publicly available mRNA expression datasets from Gene Expression Omnibus (GEO), a repository for gene expression data [49]. Datasets that include expression analysis of immune cells under different experimental conditions were selected for this purpose.
One of the datasets used [GEO:GDS2214] (as described in [50]) was an experimental study of mRNA expression analysis of neutrophils isolated from blood of patients with sepsis-induced acute lung injury. The neutrophils were cultured with either lipopolysaccharide (LPS) or high mobility group box protein 1 (HMGB1), both of which are known to be mediators of the inflammatory response. Gene expression analysis was carried out using the Affymetrix GeneChip Human Genome U133 Array Set HG-U133A oligonucleotide gene chip. The authors found enhancement of nuclear translocation activity of NF-kappaB and phosphorylation of Akt and p38 mitogen-activated protein kinase upon stimulation of LPS or HMGB1. We carried out impact factor analysis using this dataset on all ten immune signaling pathways. The results corroborate with these findings since IL-1 and IL-6 pathway scores are highly affected while the rest of the NetPath pathways did not show significant scores.
Another dataset selected [GEO:GDS1407] (described in [51]) was a part of the gene expression study that screened a cohort of 102 healthy individuals to investigate the distribution of inflammatory responses to LPS in the normal population in circulating leukocytes. Expression profiling with Affymetrix U95AV2 oligonucleotide microarray identified differentially regulated genes between two phenotypic subgroups that have been described as high LPS responders (lps high ) and low LPS responders (lps low ), based on the concentration of cytokines produced in response to LPS. Gene expression analysis was done using the Affymetrix U95AV2 human oligonucleotide arrays. Impact factor analysis was carried out using this dataset on all ten immune signaling pathways. Impact factor scores for IL-1 and IL-6 NetPath pathways in the lps high group have high values whereas impact factor scores for lps low do not show any significant perturbation of NetPath pathways. The scores are consistent with experimental results showing upregulation of IL-1 and IL-6 ligands in the lps high group. The impact factor gives the insight that not only are the ligands upregulated, but the pathway also seems to be highly affected. It should be noted that impact factor is not the only method to measure the biological impact of perturbation of pathways and other methods will continue to be developed and could be applied to such pathway data.

Outlook
In addition to keeping these pathways updated on a regular basis, we will also add additional pathways to Net-Path. We also hope to involve the biomedical community by allowing researchers to provide feedback as well as to volunteer to become 'pathway authorities' on specific pathways, similar to the successful contribution model of the BioCarta resource [14]. In this regard, we have already recruited several investigators to serve as pathway authorities in our initial effort. Multiple pathway authorities are possible for the same pathway if there are enough interested investigators with expertise who wish to contribute in this fashion. For instance, ten other signaling pathways pertaining to cancer signaling were developed for the Cancer Cell Map project [11], as a collaboration with Memorial Sloan-Kettering Cancer Center, and these data are also available through Pathway Commons [52]. We also intend to map our humanspecific pathway data to corresponding mouse orthologs to create the mouse equivalent of our signaling pathways. Since large amounts of human signaling pathway data are modeled using the mouse, this will facilitate biological system modeling that relies on primary experimental data. We also intend to incorporate pathway visualization for all existing pathways in NetPath as well as those that will be added in the future using the PathVisio software [53]. PathVisio also supports visualization of gene expression data in the context of pathways, which will enable biologists to display a systems view of the signaling pathway.

Conclusions
We have developed a resource for integration of human cellular signaling events. These pathway-specific protein-protein interaction data can be used to generate larger physical networks of protein-protein interactions that, when coupled with data on genetic interactions, could help in defining novel functional relationships among proteins. In addition, genetic interactions can functionally link proteins that belong to unconnected physical networks. These pathways could also be used to interrogate gene expression signatures in cancers and other human diseases to better understand the mechanisms or to obtain profiles for diagnostic or therapeutic purposes. There is a large amount of known information about different cellular signaling pathways controlling a variety of cellular functions, which is difficult to collect by one group. We support the vision of many data providers collecting data of interest and making them freely available in standard formats as a scalable way to represent all known pathway information in databases for comprehensive analysis. Overall, we hope to engage the biomedical community in keeping the NetPath pathway resource up to date and as error-free as possible.

Materials and methods
The initial annotation process of any signaling pathway involves gathering and reading of review articles to achieve a brief overview of the pathway. This process is followed by listing all the molecules that arereported to be involved in the pathway under annotation. Information regarding potential pathway authorities are also gathered at this initial stage. Pathway experts are involved in initial screening of the molecules listed to check for any obvious omissions. In the second phase, annotators manually perform extensive literature searches using search keys, which include all the alternative names of the molecules involved, the name of the pathway, the names of reactions, and so on. In addition, the iHOP [54] resource is also used to perform advanced PubMed-based literature searches to collect the reactions that were reported to be implicated in a given pathway. The collected reactions are manually entered using the PathBuilder [6] annotation interface, which is subjected to an internal review process involving PhD level scientists with expertise in the areas of molecular biology, immunology and biochemistry. However, there are instances where a molecule has been implicated in a pathway in a published report but the associated experimental evidence is either weak or differs from experiments carried out by other groups. For this purpose, we recruit several investigators as pathway authorities based on their expertise in individual signaling pathways. The review by pathway authorities occasionally leads to correction of errors or, more commonly, to inclusion of additional information. Finally, the pathway authorities help in assessing whether the work of all major laboratories has been incorporated for the given signaling pathway.