Constructing a fish metabolic network model

We report the construction of a genome-wide fish metabolic network model, MetaFishNet, and its application to analyzing high throughput gene expression data. This model is a stepping stone to broader applications of fish systems biology, for example by guiding study design through comparison with human metabolism and the integration of multiple data types. MetaFishNet resources, including a pathway enrichment analysis tool, are accessible at http://metafishnet.appspot.com.

metabolites may or may not be included in a reaction description. Thus, we excluded currency metabolites from reaction comparisons and network modularity analysis.

SeaSpider, the sequence analysis tool
Sequence analysis plays several key roles in the MetaFishNet project. During the construction process, the genes from five fish genomes were analyzed for Gene Ontologies (GO), then the metabolic genes were identified by their GO categories. The identification of enzymes from fish genes, if without known human homologs, depends on the sequence similarity to consensus enzyme sequences. For the applications of MetaFishNet, sequence comparison is often the only way to identify the genes submitted by users. As illustrated in Figure 2, the ab initio annotation by SeaSpider associates genes to their GO terms wherever possible; while the other function maps users' genes onto MetaFishNet model. Different databases are used for these two functions. For ab initio annotations, new sequences are searched against the zebrafish sequence database first, then the generic GO sequence database. For sequences that do not have matches in these local databases, SeaSpider queries them further to NCBI remotely. The last step does not introduce GO information, but makes SeaSpider a competent standalone application for annotating new gene sequences. For the mapping to MetaFishNet, new sequences are searched against all the metabolic genes used in MetaFishNet, then taken to the next step of pathway analysis.
Figure 2: SeaSpider is used for both ab initio annotation and the mapping to MetaFishNet.

Gene Ontology
The whole set of Gene Ontology is modeled as a directed acyclic graph. When a GO term is assigned to a gene, the gene is automatically associated with all its upstream terms. They can come from all of the three major categories: biological process, molecular function and cellular component. It is common that a single gene is associated with dozens of GO terms. The relationships among these GO terms have to be tracked through the database provided by the GO Consortium. Since intense database queries are involved and the size of the complete GO database is manageable (about 400 MB), we keep and use a local copy of GO database.
Zebrafish has good GO annotations, which came mostly from the ZFIN (ZebraFish Information Network [100,101]) project. The gene sequences from genomes of medaka, Takifugu, Tetraodon and stickleback were annotated by SeaSpider. A gene is considered "metabolic" when it is associated with the GO term "metabolic process" and a next step will be taken to find its appropriate Enzyme Commission (EC) number.

the SeaSpider program
The SeaSpider needs to record the status of its queries internally. This is achieved via Python shelve, which is a serialized object database. The most memory consuming part of SeaSpider is the parsing of BLAST results in large batches. E.g., a batch query of 500 sequences may use over 500 MB. This is not a real concern on modern computers, and the batch size can be decreased to accommodate less powerful hardware.
SeaSpider is organized as a Python package. It can be run directly from a command line Shell, or imported into other Python applications. We have used SeaSpider to annotate sequences from Cyprinodon variegatus and Litopenaeus vannamei. The full version of SeaSpider needs supports from several databases. A trimmed version that does not require database support, seaspider-lite, is provided to perform the mapping of user sequences to MetaFishNet genes.

Data integration and Pathway reconsolidation
The key of integrating different data sources is a unified representation of reactions, because once all reactions are in place, the new network can be recovered by connecting the reactions. In practical terms, the unified representation means all enzymes are coded in EC numbers and all compounds are in KEGG compatible IDs (KEGG has one of the largest collections of compounds). The nomenclature of compounds is rarely consistent across literature. The EHMN project did a good job to reconcile them with KEGG IDs.
For the compounds not found in KEGG, the EHMN project assigned new IDs consistent with KEGG style.
Reactions from the two human models were extracted by a combination of parsing XML files (SBML) and 1. Unifying all identifiers to compatible formats, e.g., all enzymes to EC numbers and all compounds to KEGG compatible IDs.
2. Comparing pathways. Pathways were manually inspected to decide whether to merge or change if they meet any of the criteria: a) sharing more than 4 enzymes; b) sharing more than 60% of enzymes; c) having the same theme.
3. Comparing reactions, removing repetitive reactions. Two reactions were considered identical when they have identical enzymes and identical compounds excluding currency metabolites, because currency metabolites might or might not be included in the original descriptions.
4. Manual inspection of merged data. E.g., some pathways are functionally identical but differ significantly in source models. Such pathways require manual merging. Table 2 shows the 49 pathways from the UCSD model (91 total) to be merged into the corresponding pathways in the EHMN model. Transport reactions from the UCSD model were excluded. The "Nucleotides" pathway in UCSD model was dismantled because it is covered by the "Purine metabolism"

Integration of two high quality human models
and "Pyrimidine metabolism" pathways in the EHMN model. In the merged result, pathway "CYP Metabolism" was merged into "Xenobiotics metabolism"; "Ascorbate and Aldarate Metabolism" and "Vitamin C metabolism" were merged to "Ascorbate (Vitamin C) and Aldarate Metabolism". The EHMN pathway "Urea cycle and metabolism of arginine, proline, glutamate, aspartate and asparagine" was too large so that the several overlapping smaller pathways in the UCSD model were adopted instead. The rest of pathways were not affected at this stage. In total, 2824 reactions from EHMN and 1859 reactions from UCSD were merged to 3953 reactions and 106 pathways.

Merging KEGG zebrafish data
The merging of the KEGG zebrafish model with the human model followed the same procedure as above.
Reactions are marked by fish or/and human according the presence of their enzymes in those species. The spontaneous reactions (without an enzyme) may be necessary for mass flow in metabolic pathways and were kept in MetaFishNet.  Figure 4).
The MetaFishNet core database defines the relationships among genes, enzymes, compounds, reactions and pathways. Primary gene IDs were adopted from Ensembl. This MetaFishNet database also includes zebrafish gene IDs from GenBank and ZFIN, so that users can look up genes by these ID systems.
However, fish genomics is still evolving and most gene identifications will have to be established via sequence comparison by SeaSpider. Besides the relational databases, three sequence databases are used with SeaSpider and BLAST ( Figure 2): zebrafish sequences, generic sequences associated with Gene Ontologies and MetaFishNet sequences, which consist of all metabolic genes from five fish species used in the construction.
We use Google App Engine (GAE) to build our project website [35]. GAE provides a free (within quota) and stable platform, which eliminates logistic costs of maintaining the website. The web development Figure 4: Database schema for MetaFishNet. The linkage between "compound" and "reaction" is not directly through attribute matching. A simple text parsing of reaction.description makes the connection. This trick saves storage space and improves database performance.
framework of GAE (similar to the popular Django framework) is state of the art, enabling rapid development and deployment. We ported our database to Google's datastore to support this website ( Figure 5). However, the choice of GAE also limits functionalities of the site. Extensive use of CPU is disallowed and regular programs cannot be installed. This prevents the deployment of FishEye and SeaSpider on the project site, though both programs can be downloaded and run locally.