A systematic assessment of current genome-scale metabolic reconstruction tools

Mendoza, Sebastián N.; Olivier, Brett G.; Molenaar, Douwe; Teusink, Bas

doi:10.1186/s13059-019-1769-1

Research
Open access
Published: 07 August 2019

A systematic assessment of current genome-scale metabolic reconstruction tools

Sebastián N. Mendoza¹,
Brett G. Olivier^1,2,
Douwe Molenaar¹ &
…
Bas Teusink ORCID: orcid.org/0000-0003-3929-0423¹

Genome Biology volume 20, Article number: 158 (2019) Cite this article

21k Accesses
118 Citations
12 Altmetric
Metrics details

Abstract

Background

Several genome-scale metabolic reconstruction software platforms have been developed and are being continuously updated. These tools have been widely applied to reconstruct metabolic models for hundreds of microorganisms ranging from important human pathogens to species of industrial relevance. However, these platforms, as yet, have not been systematically evaluated with respect to software quality, best potential uses and intrinsic capacity to generate high-quality, genome-scale metabolic models. It is therefore unclear for potential users which tool best fits the purpose of their research.

Results

In this work, we perform a systematic assessment of current genome-scale reconstruction software platforms. To meet our goal, we first define a list of features for assessing software quality related to genome-scale reconstruction. Subsequently, we use the feature list to evaluate the performance of each tool. To assess the similarity of the draft reconstructions to high-quality models, we compare each tool’s output networks with that of the high-quality, manually curated, models of Lactobacillus plantarum and Bordetella pertussis, representatives of gram-positive and gram-negative bacteria, respectively. We additionally compare draft reconstructions with a model of Pseudomonas putida to further confirm our findings. We show that none of the tools outperforms the others in all the defined features.

Conclusions

Model builders should carefully choose a tool (or combinations of tools) depending on the intended use of the metabolic model. They can use this benchmark study as a guide to select the best tool for their research. Finally, developers can also benefit from this evaluation by getting feedback to improve their software.

Background

Genome-scale metabolic models (GSMMs) have been a successful tool in Systems Biology during the last decades [1, 2], largely due to the wide range of areas for which the scientific community has found an application. GSMMs, for example, predict cellular behavior under different biological conditions, or can be used to design drug targets for important pathogens; they help to design improved strains through metabolic engineering strategies or to predict metabolic interactions in microbial communities; they have been used to study evolutionary processes or to give a rationale to lab experiments (see excellent reviews [3, 4]).

The reconstruction process that forms the basis of a GSMM is very time-consuming. Usually, this process starts with the annotation of a genome and the prediction of candidate metabolic functions at a genome-scale. The draft reconstruction is then refined by the user in an iterative manner through an exhaustive review of each reaction, metabolite, and gene in the network. After curation, the genome-scale metabolic reconstruction is transformed into a mathematical structure, an objective function is given, constraints are set to account for specific media conditions and the resulting GSMM is evaluated to try to reproduce the experimental data. This iterative process of manual refinement is the limiting step of the whole process because it continues until the GSMM achieves the desired performance determined by the model builder. Hundreds of GSMMs have been reconstructed using this procedure, for which protocols have been described [5] and reviews are available [6, 7].

Several genome-scale reconstruction tools have been developed over the last 15 years to assist researchers in the reconstruction process [8, 9]. These tools are designed to speed up such process by automating several tasks that otherwise should be performed manually, such as draft network generation or gap-filling, and/or by providing useful information to the user to curate the reconstruction. There has been an outstanding increase in the number of new tools for genome-scale reconstruction which reflects the increasing interest to create high-quality GSMMs [10]. Consequently, there is a need for a systematic assessment of the performance of these tools, as many researchers are uncertain which tool to choose when they want to reconstruct their favorite organisms.

In this work, we installed and applied the most promising genome-scale reconstruction tools to provide a systematic evaluation of their performance and outputs. With each tool we reconstructed draft networks for Lactobacillus plantarum [11] and Bordetella pertussis [12], representatives of gram-positive and gram-negative bacteria, respectively, and for which high-quality GSMMs already exist. We used high-quality manually curated GSMMs as a benchmark to assess the features of the tool-generated draft models. In addition, we also reconstructed draft networks for Pseudomonas putida to confirm our findings.

Current state of genome-scale reconstruction tools

Here, we provide a brief description of the current reconstruction tools (see also Additional file 1: Table S1).

AutoKEGGRec (2018)

AutoKEEGRec [13] is an easy-to-use automated tool that uses the KEGG databases to create draft genome-scale models for any microorganism in that database. It runs in MATLAB and is compatible with COBRA Toolbox v3 [14]. One of the advantages of this tool is that multiple queries (microorganisms) can be processed in one run making it appropriate for cases where several microorganisms need to be reconstructed. The main limitation of this tool, which is directly related to the use of the KEGG database, is the lack of a biomass reaction, transport and exchange reactions in the draft genome-scale models.

AuReMe (2018)

AuReMe [15] (Automatic Reconstruction of Metabolic Models) is a workspace that ensures good traceability of the whole reconstruction process, a feature that makes this tool unique. A Docker image is available for AuReMe, so users are easily able to run AuReMe in any platform without having to pre-install required packages (Windows, Linux or Mac). AuReMe creates GSMMs with a template-based algorithm [16] but it is also designed to incorporate information from different databases such as MetaCyc [17] and BIGG [18].

CarveMe (2018)

CarveMe [19] is a command-line python-based tool designed to create GSMMs, ready to use for Flux Balance Analysis (FBA), in just a few minutes. Its unique top-down approach involves the creation of models from a BIGG-based manually curated universal template. The implementation of its own gap-filling algorithm allows this tool to prioritize the incorporation into the network of reactions with higher genetic evidence. The authors of this tool showed that the performance of the generated models is similar to the manually curated models.

MetaDraft (2018)

MetaDraft [20, 21] is a Python-based user-friendly software designed to create GSMMs from previously manually curated ones. It contains in its internal database BIGG models ready to be used as templates although any other model can be used as a template. Users can define a specific order of templates in order to prioritize the incorporation of information related to reactions if there is a reaction match in two or more templates. One of the advantages of Metadraft is that it supports the latest features of the current SBML standards, i.e., SBML Level 3 [22] including the FBC Version 2 [23] and Groups packages [24].

RAVEN version 2 (2018)

RAVEN [25] (Reconstruction, Analysis and Visualization of Metabolic Networks) is a tool for genome-scale metabolic reconstruction and curation that runs in MATLAB is compatible with COBRA Toolbox v3 [14]. In contrast to the first version which only allowed reconstruction using the KEGG database [26], this evaluated version also allows the novo reconstruction of GSMMs using MetaCyc and from template models. Additionally, algorithms to merge network from both databases are provided inside RAVEN. The addition of MetaCyc allows the incorporation of transporters and spontaneous reactions to the reconstructed networks.

ModelSEED version 2.2 (2018)

ModelSEED [27] is a web resource for genome-scale reconstruction and analysis. This tool allows the creation of GSMMs, not only for microorganisms but also for plants. The first step of its pipeline for genome-scale reconstruction is the genome annotation which is performed by RAST [28]. Users can select or even create a medium to be used for gap-filling. In contrast to the first version, the second version allows the creation of models in less than 10 min (including annotation) and it provides aliases/synonyms of reactions and metabolites in other databases.

Pathway Tools version 22.0 (2018)

Pathway tools [29] is a software environment that supports the creation and curation of organism-specific databases. One of the most useful features is that users can interactively explore, visualize and edit different components of the created databases such as genes, operons, enzymes (including transporters), metabolites, reactions, and pathways. Also, visualization of the whole network is possible by using Cellular Overview diagrams, in which experimental data such as gene expression can be mapped using different colors depending on the expression level.

Merlin version 3.8 (2018)

Merlin [30] is a java application for genome-scale reconstruction based on the KEGG database. One of the most useful resources of Merlin is the re-annotation of genomes through the online service of BLAST (EBI) or HMMER. Several parameters in the annotation algorithms such as the expected value threshold and the maximum number of hits can be changed by the user if required, which makes this tool very flexible. The interface allows to compare gene function agreement between the annotation and UniProt providing information to the user for manual curation.

Kbase (2018)

Kbase [31] (The US Department of Energy Systems Biology Knowledgebase) is an open-source software that allows, among a variety of functions, the reconstruction, and analysis of microbes, plants, and communities. Kbase is a platform that integrates several tasks such as annotation, reconstruction, curation, and modeling, making suitable for the whole process of reconstruction. One of the unique features of this software is the use of narratives which are tutorials where users can interactively learn particular topics and reproduce previous results.

CoReCO (2014)

CoReCo [32] (Comparative Reconstruction) is a novel approach for the simultaneous reconstruction of multiple related species. The pipeline of CoReCo includes two steps: First, it finds proteins homologous to the input set of protein-coding sequences for each species. Second, it generates gapless metabolic networks for each species based on KEGG stoichiometry data. Thus, CoReCo allows a direct comparison between the reconstructed models, e.g., to study evolutionary aspects.

MEMOSys version 2 (2014)

MEMOSys [33] (Metabolic Model Research and development System) is a database for storing and managing genome-scale models, rather than a reconstruction tool. This tool allows tracking of changes during the development of a particular genome-scale model. Twenty genome-scale models are publicly available for exporting and modifying. Child models can be created from the 20 available models and then modified and compared with parent models. All the differences between different versions of the models can be listed to track changes in the networks.

FAME (2012)

FAME [34] (Flux Analysis and Modeling Environment) is a web-based application for creating and running GSMMs. This tool can reconstruct genome-scale models for any microorganism in the KEGG database. One of the most interesting features of FAME is that analysis results can be visualized on familiar KEGG-like maps. It is foremost a tool for running and analyzing models and is used -by us- for educational purposes. One of the limitations of FAME is that models cannot be generated for microorganisms which are not in the KEGG database.

GEMSiRV (2012)

GEMSiRV [35] (Genome-scale Metabolic Model Simulation, Reconstruction and Visualization) is a software platform for network drafting and editing. A manually curated model is used as a template to generate a draft network for the species under study. Among the tools inside the toolbox, MrBac [36] can be used to generate reciprocal orthologous-gene pairs which are then used by GEMSiRV to generate the draft model. One of the limitations of this tool is that only one template can be used per run.

MetExplore (2018)

MetExplore [37] is a web-based application for sharing and curating in a collaborative way previously reconstructed draft metabolic networks. FBA, FVA, gene, and reaction essentiality analyses can also be performed in the same environment to compare predictions with experimental data. One of the main features of this software is that networks can be automatically visualized using the lightest paths algorithm which reduces the complexity of the network.

rBioNet (2011)

This tool [38] allows assembling a metabolic reconstruction. Rather than automatically generating a draft metabolic network from its genome, it allows the user to either create a reconstruction from scratch or load an existing one for curation. In both cases, reference databases are used to facilitate the import of metabolite and reactions into the network. Several tests, such as dead-end metabolite detection and mass and charge reaction balances, can be run to ensure high quality of the reconstruction. Finally, this tool is part of the COBRA toolbox and a tutorial of use is available for beginners.

Other tools

Microbes Flux (2012) [39], Subliminal (2011) [40], and GEMSystem (2006) [41] are no longer maintained, as confirmed by the authors of the corresponding articles.

Results

To assess the reconstruction tools, we performed both a qualitative and quantitative evaluation. As a first step, we created a list of relevant features for genome-scale reconstruction and software quality and we scored each tool depending on the performance (1: poor, 5: outstanding). These features are related to software performance, ease of use, similarity of output networks to high-quality manually curated models and adherence to common data standards. In addition, we evaluated 18 specific features related mostly with the second stage (refinement) of the protocol for generating high-quality genome-scale metabolic reconstructions [5]. The criteria to assign a particular score in each feature is specified in Additional file 1: Table S2. Note that not all the tools were designed for the second stage, so they scored poorly on quite some features. Many of these features have not been assessed in previous reviews [8, 9].

Subsequently, to assess how similar the generated draft networks are to high-quality models, we reconstructed with different reconstruction tools the metabolic networks of two bacteria for which high-quality manually curated genome-scale models already were available. We chose to reconstruct the metabolic network of Lactobacillus plantarum and Bordetella pertussis, representatives of gram-positive and gram-negative bacteria, respectively. These microorganisms were selected because of three reasons. First, the corresponding GSMMs are not stored in the BIGG database, so tools that are able to use the BIGG database (AuReMe, CarveME, MetaDraft, RAVEN) in the reconstruction process cannot use the specific information for these microorganisms. If Escherichia coli or Bacillus subtilis would have been chosen instead we would have favored these tools because high-quality models for E. coli or B. subtilis already exist in the BIGG database and they would have been used as templates or inputs. Second, we chose these microorganisms because we were fully informed of the quality of the reconstructions as we built them ourselves and they have proven to be able to accurately replicate experimental data [11, 12, 42, 43], even by independent researchers [44, 45]. Third, these networks were reconstructed almost entirely in a manual way, so we do not expect any bias for any particular tool.

In addition to the two previous species, we also reconstructed with all the tools draft networks for Pseudomonas putida, for which four lab-independent genome-scale models have been reconstructed. We compared the draft reconstructions with iJP962 [46], a model that is not in the BiGG database, that has been proven to accurately replicate experimental data and to be absent of inconsistencies [47].

The networks were generated using seven tools: AuReMe, CarveMe, Merlin, MetaDraft, ModelSEED, Pathway Tools and RAVEN. These cover most of the freely available software platforms. The general features of these tools are listed in Table 1.

Table 1 List of selected genome-scale metabolic reconstruction tools and their main features

Full size table

General assessment overview

None of the tools got a perfect score for all of the evaluated features and usually, strengths in some tools are weaknesses in others (Fig. 1, Additional file 1: Figure S3, Tables S25 and S26 to see detailed evaluation). For example, on the one hand, ModelSEED and CarveMe were evaluated as outstanding when we checked whether the whole reconstruction process is automatic; Merlin was evaluated as poor because users should interfere more to get a network ready to perform FBA. On the other hand, we consider Merlin as outstanding with respect to a workspace for manual refinement and information to assist users during this step; CarveMe and ModelSEED do not provide further information for manual refinement nor a workspace for manual curation, so they were evaluated as poor in this category.

In some cases, all the tools got the maximum score possible. For instance, all the tested tools are properly supported by specialist teams and also maintain up-to-date databases. In other cases, none of the tools got the maximum score. This was the case for automatic refinement of networks using experimental data. Some of the tools, such as ModelSEED and CarveMe, can use media composition to gap-fill the network. AuReMe and Pathway Tools also can use, in addition to media composition, known metabolic products to gap-fill the network. In spite of that, none of the tools can also use Biolog phenotype arrays, knockout experiments and different types of omics data (transcriptomic, proteomic, metabolomic, etc.) to automatically curate the network. Although some efforts have been done in this area [48,49,50,51], this seems like a major challenge for future tool development that should lead to improved metabolic reconstructions.

Compliance with the latest SBML standards has been pointed as one of the critical points to share and represent models [52]. Consequently, we evaluated if the tools use the latest SBML features in the import (inputs) and export (outputs) of networks. For inputs, we checked if the tools were able to read networks in SBML level 3 [22]. We additionally checked if the output networks satisfy the following three features: use of SBML level 3 [22] with FBC annotations [23], SBML groups [24], and MIRIAM compliant CV annotations [22, 53]. These features are used, for example, for models in the BIGG database and they ensure that the information is stored in a standard way. For inputs, we found that among the tools that are able to import and use networks (AuReMe, MetaDraft, RAVEN) all of them are able to use SBML level 3 but AuReMe generated slightly different networks when using SBML level 2. For outputs, MetaDraft and Merlin and RAVEN were the only ones that exported the networks with all the three features. Be aware that networks created with RAVEN have to be exported to SBML using the specific functions of RAVEN (not COBRA functions as a regular COBRA user would expect) because otherwise there will be no MIRIAM annotations in the SBML files. In addition, AuReMe and CarveMe lack MIRIAM compliant CV annotations and SBML Groups, and Pathway Tools and ModelSEED exported the networks in SBML level 2.

Network comparison

We reconstructed draft networks for Lactobacillus plantarum WCFS1, Bordetella pertussis Tohama I and Pseudomonas putida KT2440 with each reconstruction tool. L. plantarum is a lactic acid bacterium (LAB), used in the food fermentation industry and as a probiotic [54,55,56]. Its GSMM comprises 771 unique reactions, 662 metabolites, and 728 genes, and it has been used to design a defined media for this LAB [43], to explore interactions with other bacteria [57] and as a reference for reconstructing other LAB [58]. In contrast to this LAB, B. pertussis is a gram-negative bacterium, and the causative agent of the Whooping cough, a highly contagious respiratory disease [59]. The metabolic network of this pathogen was recently reconstructed, and it comprises 1672 unique reactions, 1255 metabolites, and 770 genes. As B. pertussis, Pseudomonas putida is also a gram-negative bacterium but the interest in this species relies on its capability as a cell factory to produce a wide variety of bulk and fine chemicals of industrial importance [60]. Its metabolic network comprises 1069 unique reactions, 987 metabolites, and 962 genes. While L. plantarum and B. pertussis are the main subject in the network comparisons, P. putida was used, as a model developed independently from us, to validate the tendencies obtained with the two previous species.

In total, 29 networks were created for L. plantarum, 27 for B. pertussis, and 27 for P. putida. The specific inputs and parameters for creating each network can be found in Additional file 1: File S1. Genes, metabolites, and reactions were extracted from the SBML files and compared with those in the manually curated model. For convenience, the manually curated model of L. plantarum, B. pertussis, and P. putida will be called hereafter iLP728, iBP1870, and iJP962, respectively.

Comparison of gene sets

Genes are the basis from which the genome-scale model is reconstructed. When a gene is included in a metabolic reconstruction, there is at least one biochemical reaction associated with that gene. When a gene is not in the reconstruction, either the reconstruction tool could not find an orthologous gene in the reference database or an orthologous gene was found but no biochemical reaction is associated with that gene. Gene sets are interesting to compare because if a gene present in the manually curated model is absent in a draft reconstruction, that could explain why some biochemical reactions are missing in the draft. Alternatively, if a gene is absent in the manually curated model but present in a draft reconstruction, that could explain the presence of reactions that should not be in the reconstruction. Moreover, gene sets are simple to compare among reconstructions because gene identifiers in all the cases are the same (the locus tag in the genome annotation) and so, in contrast to metabolites and reactions, there is no mapping-related bias in the comparison.

To assess how similar the draft networks were to the corresponding manually curated networks we calculated the Jaccard distance (JD) as well as the ratio between the percentage of covered genes and the percentage of additional genes (R) (Additional file 1: Tables S4–S7). The JD has been used before to measure the distance between genome-scale metabolic reconstructions, based on reaction sets [61]; here, we also applied it to compare reconstructions in terms of genes and metabolites. We called JD_g, JD_r, and JD_m to the JD between two reconstructions when they are compared in terms of genes, reactions and metabolites, respectively. Analogously, we called R_g, R_r, and R_m to the R when reconstructions are compared in terms of genes, reactions and metabolites, respectively. In general terms, a value of 0 in the JD means that the networks are identical and a value of 1 means that the networks do not share any element. For the R, higher values reflect a higher similarity to the original network and lower values reflect a lower similarity with the original network.

The values in the JD_g ranged from 0.38 to 0.60 in L. plantarum and from 0.43 to 0.67 in B. pertussis (Additional file 1: Tables S4 and S5), while values in the R_g ranged from 1.18 to 13.16 in L. plantarum and from 0.84 to 3.52 in B. pertussis (Additional file 1: Tables S6 and S7). Although the similarity of the generated draft networks seems slightly better for L. plantarum than for B. pertussis, we found that it depends on which metric is analyzed. With the exception of one network, the R_g showed that all the draft networks of L. plantarum were more similar to iLP728 than the draft networks of B. pertussis to iBP1870, using the analog parameter settings. In contrast, the JD_g showed that AuReMe, ModelSEED, RAVEN, and Merlin generated draft networks of L. plantarum which are more similar to iLP728 than the draft networks of B. pertussis with regard to iBP1870, and that CarveMe, MetaDraft, and Pathway Tools generated draft networks slightly more similar for B. pertussis. In general, similar values of JD_g and R_g were obtained for P. putida (Additional file 1: File S3).

Additionally, when sorting the values of both metrics, we noticed that the JD_g order does not correspond to that made with the R_g. The lowest JD_g among the draft reconstructions for L. plantarum was obtained in the network generated with AuReMe when the gram-positive set of templates was used; for B. pertussis, it was obtained with MetaDraft. In contrast, the highest R_g among the draft reconstructions for L. plantarum was obtained in the network generated with AuReMe when only Lactococcus lactis was used as template; for B. pertussis, it was obtained with MetaDraft when Escherichia coli template was used.

Although the similarity scores for both metrics are not entirely consistent, some trends were observed. The networks more similar, in terms of genes, to the manually curated models were generated by MetaDraft, AuReMe, and RAVEN (Fig. 2). However, since parameters settings and inputs have a big effect on the similarity scores, the usage of these tools does not automatically ensure obtaining a draft network similar, in terms of genes, to a manually curated model. This is particularly true for RAVEN which also generated some networks with high JD_g and low R_g scores. The same trends were obtained for P. putida (Additional file 1: Figure S2).

We further analyzed the percentage of genes covered in the manually curated models and the percentage of genes not in the manually curated models to explain differences in R_g. For all the species we observed a wide variation in both variables (Figs. 3, 4 and Additional file 1: Figure S7). Among the five networks of L. plantarum with the highest coverage, two were created with AuReMe and three with RAVEN; for B. pertussis, four were created with RAVEN and one with CarveMe. However, the networks created with RAVEN that recovered the highest percentages of genes also added a large number of genes which were not present in the manually curated models, decreasing the values in the R_g. In addition, AuReMe and MetaDraft created conservative draft networks with the lowest number of additional genes, which explains the higher values in the R_g. Finally, tools such as ModelSEED, Pathway Tools, and Merlin consistently created reconstructions with gene coverages not ranging in the highest values (in comparison with other networks) and adding a relatively large number of genes not present in the manually curated models, which explains why they had lower values in the R_g.

For L. plantarum we found 1613 different genes in total with all the tools, of which 885 were not present in iLP728. For B. pertussis, 1888 different genes were found, of which 1118 were not present in iBP1870. In addition, 79 genes were correctly predicted in all the draft networks for iLP728; for iBP1870, this was 131 genes. The distribution of metabolic pathways associated to those genes is wide for both species, with carbohydrate metabolism and amino acid metabolism accounting for more than 50% of the metabolic processes (Additional file 1: Tables S8 and S9). Additionally, 35 and 39 genes were not recovered in any network for iLP728 and iBP1870, respectively. The metabolic functions associated to those genes were very specific, with polysaccharide biosynthesis (63%) and transport (22%) top in the list for L. plantarum and with transport (41%) and ferredoxin/thioredoxin related reactions (30%) for B. pertussis. Finally, one gene in L. plantarum, which was associated with riboflavin biosynthesis, was recovered by all the networks but it was not present in iLP729. For B. pertussis, three such genes were found. These genes were associated with alternate carbon metabolism and cell envelope biosynthesis.

Comparison of reaction sets

Genes and biochemical reactions are connected within a reconstruction through gene-protein-reaction (GPR) associations. However, genes and reactions relationships are ultimately represented in reconstructions as boolean rules known as gene-reaction rules. With the exception of exchange, sink, demand, spontaneous and some transport reactions (e.g., those governed by diffusion), each reaction has a defined gene-reaction rule in the reference database used by each reconstruction tool. During the process of reconstruction, if orthologous genes are found that satisfy the gene-reaction rule of a particular reaction, that reaction is included in the draft reconstruction. Other reactions may be added to the draft reconstruction based on others criteria, such the probability of a particular pathway to exist in the microorganism under study or the need to fill particular gaps in the network in order to produce biomass. Nonetheless, we expect that networks which are more similar in terms of genes will also be more similar in terms of reactions.

In contrast to genes, however, reactions are labeled with different identifiers in different databases. Thus, the same reaction can be stored with two different identifiers in two different databases. During the reconstruction process, reactions are added from the reference database to the draft reconstruction and tools using different databases will generate reconstructions comprising reactions with different identifiers. We, therefore, used MetaNetX [62] to map reactions among reconstructions built with different databases. In this approach, reactions were compared using their identifiers (case sensitive string comparison). In addition, we compared networks using reaction equations, i.e., we compared reactions using their attributes instead of their identifiers. In this second approach, we considered that two reactions were the same if they had the same metabolites with the same stoichiometric coefficients. Some exceptions were made to also match reactions that differ only in proton stoichiometry (due to differences in metabolites charge) or to catch reactions which are written in the opposite direction (reactants in the side of products). We decided to include exchange reactions in the network comparison for completeness because CarveMe and ModelSEED automatically generate them; as they are non-gene associated reactions, this automatically lowers the scores for the other tools that do not add exchange reactions. For most networks, comparison through reaction identifiers resulted in a lower percentage of coverage than through reaction equation comparison (Additional file 1: Tables S10 and S11). This lower coverage was due to some missing relationships between different databases in MetaNetX, which we discovered when comparing with the reaction equations. In total, 220 new unique reaction synonyms pairs were automatically discovered for both species with the second approach (Additional file 1: Table S12). To further overcome the missing relationships in MetaNetX, a semi-automatic algorithm was developed to assist the discovery of new metabolite synonyms. In total, 187 new metabolites synonyms were discovered (Additional file 1: Table S13) which led to the discovery of 282 additional reaction synonyms (Additional file 1: Table S14).

The comparison through reaction equations showed a wide variation in reaction coverage and percentage of additional reactions for all the species (Figs. 5 and 6 and Additional file 1: Figure S8). In addition, for those networks created with RAVEN (KEGG), ModelSEED, and Merlin, we observed a considerable number of reactions with a partial match with the manually curated model. These partial matches emerge from differences in proton stoichiometry, which indicates the existence of metabolites with different charge than those found in the manually curated models. In contrast to the gene sets comparison, where the coverage was as high as 88% and 83%, we only observed a maximum coverage of 72% and 58%, for L. plantarum and B. pertussis, respectively, even when considering partial matches. We classified the reactions that were not recovered in different categories (Additional file 1: Figures S3–S6) and we found that the low reaction coverage can be explained mainly by three reasons.

First, both manually curated models contain a considerable amount of reactions without gene-associations, including spontaneous, transport, exchange reactions, reactions added during the manual gap-filling and biomass-related reactions. For L. plantarum and B. pertussis, there are 241 and 657 of such reactions, representing 31% and 39% of the network, respectively. With the exception of CarveMe and ModelSEED, which can perform automatic gap-filling, all the rest of the tools are not able to recover most of the non-gene associated reactions, mainly because all the tools predict reactions based on genomic evidence. Thus, for both species, around, 50% of the reactions that were not recovered do not have gene-reaction associations in the manually curated model. Without considering exchange reactions, the coverage roughly increased by 15% and 12% for L. plantarum and B. pertussis, respectively, except for CarveMe and ModelSEED. Second, in around 30% of the reactions that were not recovered, there are at least 50% of the associated genes missing in the draft reconstructions. Third, even when all the genes associated with a particular reaction are recovered, specific substrate and cofactor usage is difficult to predict. Many times, the tools predict the correct metabolic activity but they fail in predicting the specific substrate used in the manually curated models. We created a collection of plain text files containing hundreds of examples where the associated genes were recovered by the tool but the reaction does not correspond to the one in the manually curated model because of different substrates (see section availability of data for details).

We again calculated the JD_r and the R_r to assess how similar the networks were, in this case in terms of reactions. The first observation we made is that, independent of the metric and for both species, each reconstruction was less similar in terms of reactions than in terms of genes, which is consistent with the decrease in coverage. In addition, as in the gene comparison, the order of scores for the R_g and the R_r by magnitude was not the same. If we compare the similarity scores for reaction sets with the ones for gene sets, we see almost the same trend but with one difference. AuReMe and MetaDraft are still the tools with the best similarity scores but now CarveMe goes up in the list of scores and RAVEN goes down (Fig. 7, Additional file 1: Tables S4–S7). This was particularly true for B. pertussis where two networks reconstructed with CarveMe got the two first places in the JD_r list. Almost the same trend was observed for P. putida (Additional file 1: Figure S2) being the higher scores for RAVEN instead of CarveMe the main difference.

Although RAVEN generated some reconstructions with high gene sets similarity to the manually curated models, it did not for reaction sets similarity. We, therefore, analyzed one of the networks reconstructed with RAVEN in more detail, one that was consistently in the top 5 list for both species for both metrics. We found one main reason for the decrease in performance. The analyzed network was created based on KEGG, so metabolites were not labeled as intracellular or extracellular. Hence, no transport or exchange reactions were present. Although there are functions to incorporate this kind of reactions in RAVEN, that is considered as manual curation because users must specify which compounds should be transported, and we here only tested how much work would it take to transform these draft networks into high-quality reconstructions.

We further analyzed reactions that were present and absent in all the reconstructions to understand which kind of metabolic processes they were related. Sixty-six reactions in iLP728 and 98 in iBP1870 were always found in all the draft networks. In agreement with the gene sets analysis, the associated metabolic processes are mainly amino acid metabolism, nucleotide metabolism, and carbohydrate metabolism (Additional file 1: Tables S15 and S16). Additionally, 165 reactions in iLP1870 and 598 in iBP1870 were not found by any tool. In both species, around 10% of those reactions were biomass-related reactions and from the rest, most of them were exchange reactions, transport reactions without gene associations and reactions in other categories that were not in the BIGG database (Additional file 1: Tables S17 and S18). Only one reaction, associated to amino acid metabolism, was found in all the draft networks of L. plantarum but not in iLP728; four reactions, associated mainly to carbohydrate metabolism, were found in all the draft networks but not in iBP1870.

Comparison of metabolite sets

Other important elements within metabolic reconstructions are metabolites. When a biochemical reaction is added to the draft network during the reconstruction process, all the reactants and products are added to the network too. As the draft metabolic networks were created with different tools, each of which using its own set of databases, they had different identifiers for the same metabolite. For those networks whose identifiers were different from BIGG, we again used MetaNetX and our own additional dictionary to map metabolites.

We calculated the JD_m and the R_m to assess the metabolite sets similarity. For almost all the draft networks in both species, the values in the JD_m were between the JD_g and the JD_r; we found the same for the R_m (Additional file 1: Tables S4–S7). Again, when sorting the networks according to their metric scores, we found the same trends than for reaction sets. The first position in the lists were networks either reconstructed with MetaDraft, AureMe, or CarveMe. Moreover, independently of the metric and the species, MetaDraft reconstructed 40% of the networks among those in the top 5.

Two hundred six metabolites in iLP728 and 271 in iBP1870 were correctly predicted in all the draft networks. These metabolites were in both cases mainly associated with carbohydrate metabolism and amino acid metabolism (Additional file 1: Tables S19 and S20). Eighty-one metabolites in iLP728 and 278 in iBP1870 were not recovered in any network. Of those, 16 were related to the biomass of L. plantarum and 16 others were not in the BIGG database. For iBP1870, 44 were biomass-related and 47 others were not in the BIGG database. Finally, 9 and 11 metabolites were recovered in all the networks but they were not present in iLP728 and iBP1870, respectively. Mainly, they were associated to the metabolism of cofactors and vitamins and amino acid metabolism in the case of L. plantarum and carbohydrate metabolism and glycan biosynthesis in the case of B. pertussis (Additional file 1: Tables S21 and S22).

Topological analysis

To compare the topological features of each network, we calculated the number of dead-end metabolites, the number of orphan reactions, the number of unconnected reactions and other metrics (Additional file 1: Tables S23 and S24).

iLP728 has 113 dead-end metabolites while iBP1870 has 59. This is consistent with the observation that many pathways are disrupted in L. plantarum leading for example to well-known auxotrophies for many amino acids [42, 43]. With the exception of CarveMe, all the tools generated networks with a high number of dead-end metabolites, ranging from 244 and 999, and from 379 to 976, for L. plantarum and B. pertussis, respectively. The low number of dead-end metabolites in CarveMe is caused by the use of a manually curated universal model as a template which lacks dead-end metabolites.

Without considering exchange and demand/sink reactions, 127 and 449 reactions without gene associations (called orphan reactions) were found in iLP728 and iBP1870, respectively. These reactions are mainly associated with transport amino acid metabolism, and biomass formation. MetaDraft, AuReMe, and RAVEN returned metabolic networks with no orphan reactions. These tools only include reactions with genomic evidence and others lacking this support are not included. ModelSEED returned networks with a low amount of orphan reactions, which are related to exchange reactions. In contrast, CarveMe, Pathway Tools, and Merlin returned networks with a significantly larger number of orphan reactions (ranging from 66 to 491 in L. plantarum and from 115 to 736 in B. pertussis). For CarveMe, this is due to the inclusion of transport and spontaneous reactions as well as reactions needed to create biomass (from gap-filling); for Pathway tools, it is because of the addition of reactions to complete probable pathways and spontaneous reactions; and for Merlin, this is solely due to spontaneous reactions.

Discussion

In this work, we reviewed the current state of all the reconstruction tools we could find in the literature and performed a systematic evaluation of seven of them. None of the tools performed well in all the evaluated categories so users should carefully select the tool(s) that suit the purpose of their investigation. For example, if a high-quality draft is required and models are available for a phylogenetically close species, MetaDraft or AuReMe could be selected, reducing thus the time needed to obtain a high-quality manually curated model. Of these, MetaDraft was the most robust for handling models and since it has a graphical user interface, it is also suitable for non-specialists. AuReMe, on the contrary, offered a command-line workspace where the traceability is the priority. Although we were not able to use RAVEN in the template mode (for details, see error messages in supplementary files), this tool allowed us to automate the generation of several reconstructions, it had a high flexibility with parameters and it offered integration with the KEGG and MetaCyc databases which makes it very appropriate for less-studied species. ModelSEED, CarveMe, and Pathway Tools were the fastest tools to generate reconstructions having a great potential for large-scale studies how it has been proven in previous works [61, 63]. The first two tools provided networks which are ready to perform FBA, however presumably because of the automatic gap-filling procedure, too many reactions that should be manually verified must be expected. Pathway Tools and Merlin provided platforms suitable for manual curation which nicely guide the user through the whole reconstruction process.

The list of features that we defined not only can be used by model builders to select the best tool(s) but also by developers as a guide for improving them. We highlight four features, which are in accordance with the FAIR guiding principles for scientific data management and stewardship [64], that should be considered as a priority by developers to ensure management of reconstructions in a standard way: (1) To be findable: all the genes, metabolites and reaction in a reconstruction should be assigned with unique and persistent identifiers, and synonyms or aliases in other databases should be provided whenever possible. (2) To be accessible: exhaustive control of versions should be implemented so users will be able to submit small but significant changes to draft reconstructions, to trace changes made during the reconstruction process, or to retrieve a particular version if desired. (3) To be interoperable: output (and input if applied) reconstructions should be written with the latest features of the SBML standards. (4) To be reusable: in relationship with providing a detailed provenance, transparency of decisions through the whole reconstruction process should be ensured so users can see why a particular reaction was added and at which stage (draft network generation, gap-filling, refinement, etc.).

Genome-scale reconstructions are usually evaluated after they are converted into genome-scale models [5], i.e., mathematical structures where simulations can be performed under constraints that describe specific experimental conditions. Thus, GSMMs are tested by their accuracy to predict experimental data such as knockouts, nutritional requirements and growth rates on different conditions. However, most of the drafts we generated were not suitable to perform FBA, mainly due to the lack of biomass-related, transport and exchange reactions. Thus, we limited the evaluation of the drafts to the comparison with manually curated, genome-scale reconstructions. The latter are valuable by themselves as knowledgebases because they contain extensive information from the literature. Here, we prescribed that the manually curated reconstructions are the gold standard, which implies that they cannot be improved and that is obviously not true. Many reconstructions of, for example, E. coli, S. cerevisiae, and H. sapiens have gone through multiple rounds of improvements during the years [65,66,67]. As reference databases used by reconstruction tools increase in size and quality, so will the reconstructions which are based on them. Therefore, some of the reactions which were suggested by the tools and which are not in the manually curated models could indeed be reactions which would improve the quality of the reconstructions. Whether one of those reactions should be in the reconstruction or not will depend not only on the genomic evidence but also on the scope and context of the reconstruction. Many reactions are usually not incorporated because they are not needed for modeling purposes [5]. Thus, similarity scores should not be taken alone to assess the quality of draft reconstructions. Indeed, additional reconstructions of Lactobacillus plantarum that we made with CarveMe and ModelSEED and which were gap-filled using a modified version of CDM (Additional file 1: File S2), a media that support the growth of this microorganism in vivo [43], showed a general performance close to the manually curated model, suggesting that although the networks are not so similar as others created with different tools, the core metabolism remains similar. Despite that, the performance of these networks is dependent on the media composition which is used for the gap-filling (Additional file 1: Figure S1), and therefore if there is no experimentally determined media, some false positive and false negative predictions could emerge. For example, if very accurate predictions regarding nutritional requirement are needed to design a microbial community, automatic reconstructions for which an experimentally determined media composition is not provided during gap-filling could result in false predictions.

A correct mapping of identifiers among different databases is crucial to perform a proper comparison between metabolic networks. Important efforts such as MetaNetX [62] and Borgifier [68] have been done to facilitate this titanic task. The first of those tools allowed us to map most of the metabolites and reactions among the different reconstructions but naturally, some relationships were missing. To overcome this limitation, we fully mapped metabolites in the manually curated models to known databases namely BiGG, KEGG, MetaCyc, and SEED. Second, we implemented an algorithm to search reaction equations, even when they have differences in proton stoichiometry due to different protonation states or even if the reactions are written in the opposite direction. As a third step to further reduce the fraction of metabolites which were not mapped and through a semi-automatic and iterative process, we determined 187 new relationships. In spite of our efforts, some relationships were still missing which evidence the complexity of the problem. Since recent efforts have made clearer the type of issues arising in different databases [69], we emphasize the importance of standards, which could make easier the identification of synonyms because of the presence of high-quality information, and the need of an outstanding mapping system.

Systematic assessments of tools for systems biology have become very popular [70, 71] due to the great impact they have in the community of potential users who certainly are searching the best tool to apply in their research. Knowing the strengths and limitations of each tool allow users to select the best tool(s) for their case, to save time in preliminary tests and to focus more on the analysis and modeling using those reconstructions. Moreover, to provide genome-scale models of high quality, in terms of usability and standards, has become a priority during the last years. Efforts such as those done by Memote [52] highlight the need for suites that test the quality of genome-scale models to ensure high-quality outputs, not only in terms of their content as knowledgebases but also in terms of standards.

Conclusions

All the assessed reconstruction tools showed strengths and weaknesses in different areas and none of the tools outperformed the others in all the categories. In particular, template-based reconstruction tools such as AuReMe, MetaDraft, and CarveMe generated networks with a higher reaction sets similarity to manually curated networks than other tools. In addition, tools such as Pathway Tools and Merlin provide a proper workspace and useful information for manual refinement which could be suited for cases where much time can be dedicated to this step. RAVEN provides a platform in which biochemical information from different databases and approaches can be merged, which could be useful for less characterized species. Finally, tools such as CarveMe and ModelSEED provide ready-to-use metabolic networks which can be useful for a fast generation of model-driven hypothesis and exploration but users will have to be aware of potential false results.

There seems to be a trade-off between coverage and similarity, and it remains to be seen how much room for improvement there is. We see three clear features that would improve any tool: better standards that would allow easier integration of the best of tools, exhaustive version control during the reconstruction process, and algorithms that can use experimental data for inclusion of genes and reactions into the models.

Materials and methods

Protein sequences

We used the protein sequences or the GenBank files of the different microorganisms as input to generate the genome-scale metabolic reconstructions with each of the selected tools. All the protein sequences were downloaded from NCBI. For Lactobacillus plantarum strain WCFS1, Bordetella Pertussis strain Tohama I, and Pseudomonas putida KT2440 we used the protein sequences deposited under the NCBI accession numbers NC_004567.2 [72, 73], NC_002929.2 [74, 75] and NC_002947.4 [76, 77] respectively.

Reconstruction

The specific parameters and inputs used to reconstruct the draft networks with each tool can be found in Additional file 1: File S1.

AuReMe

We used AuReMe version 1.2.4, which was downloaded using Docker Toolbox, to generate the draft reconstructions.

To generate the genome-scale metabolic reconstructions of Lactobacillus plantarum we used three different set of templates from the BIGG database: (1) Lactococcus lactis (iNF517). (2) Lactococcus lactis (iNF517), Bacillus subtilis (iYO844), Staphylococcus aureus (iSB619), Clostridium ljungdahlii (iHN637) and Mycobacterium tuberculosis (iNJ661). 3) Lactococcus lactis (iNF517), Bacillus subtilis (iYO844), Staphylococcus aureus (iSB619), Clostridium ljungdahlii (iHN637), Mycobacterium tuberculosis (iNJ661), Escherichia coli (iML1515), Klebsiella pneumoniae (iYL1228), Shigella flexneri (iSFxv_1172), Shigella boydii (iSbBs512_1146), Shigella sonnei (iSSON_1240), Pseudomonas putida (iJN746), Yersinia pestis (iPC815), Helicobacter pylori (iIT341), Geobacter metallireducens (iAF987), Salmonella entérica (STM_v1_0), Thermotoga marítima (iLJ478), Synechocystis sp (iJN678), and Synechococcus elongatus (iJB785).

For Bordetella pertussis we used Escherichia coli as a template (iML1515).

For Pseudomonas putida we used Pseudomonas putida as a template (iJN746).

CarveMe

We used CarveMe version 1.2.1 (downloaded from https://github.com/cdanielmachado/carveme on August 1st, 2018) to generate the draft reconstructions. Two genome-scale metabolic reconstructions were generated for Lactobacillus plantarum using the universal bacterial template and the gram-positive bacterial template, respectively. For B. pertussis, the universal bacterial template and the gram-negative bacterial template were used. For P. putida, the universal bacterial template and the gram-negative bacterial template were used.

Merlin

We used Merlin version 3.8 (downloaded from https://merlin-sysbio.org/index.php/Downloads on August 1st, 2018) to generate the draft reconstructions. For all the networks, we first annotated the genomes with EBI through MERLIN using default parameters. Then, we loaded KEGG metabolic data and integrated the annotation with the model. Finally, we created gene-reaction-protein associations and removed unbalanced reactions to be able to export the network to SBML format.

MetaDraft

We used MetaDraft version 0.9.2, which was obtained from https://systemsbioinformatics.github.io/cbmpy-metadraft/.

To generate the genome-scale metabolic reconstructions of Lactobacillus plantarum we used three different set of templates from the BIGG database: (1) Lactococcus lactis (iNF517). (2) Lactococcus lactis (iNF517), Bacillus subtilis (iYO844), Staphylococcus aureus (iSB619), Clostridium ljungdahlii (iHN637) and Mycobacterium tuberculosis (iNJ661). (3) Lactococcus lactis (iNF517), Bacillus subtilis (iYO844), Staphylococcus aureus (iSB619), Clostridium ljungdahlii (iHN637), Mycobacterium tuberculosis (iNJ661), Escherichia coli (iML1515), Klebsiella pneumoniae (iYL1228), Shigella flexneri (iSFxv_1172), Shigella boydii (iSbBs512_1146), Shigella sonnei (iSSON_1240), Pseudomonas putida (iJN746), Yersinia pestis (iPC815), Helicobacter pylori (iIT341), Geobacter metallireducens (iAF987), Salmonella entérica (STM_v1_0), Thermotoga marítima (iLJ478), Synechocystis sp (iJN678), and Synechococcus elongatus (iJB785).

To generate the genome-scale metabolic reconstructions of Bordetella pertussis we used three different set of templates from the BIGG database: (1) Escherichia coli (iML1515). 2) Escherichia coli (iML1515), Klebsiella pneumoniae (iYL1228), Shigella flexneri (iSFxv_1172), Shigella boydii (iSbBs512_1146), Shigella sonnei (iSSON_1240), Pseudomonas putida (iJN746), Yersinia pestis (iPC815), Helicobacter pylori (iIT341), Geobacter metallireducens (iAF987), Salmonella entérica (STM_v1_0), Thermotoga marítima (iLJ478), Synechocystis sp (iJN678), and Synechococcus elongatus (iJB785). 3) Escherichia coli (iML1515), Klebsiella pneumoniae (iYL1228), Shigella flexneri (iSFxv_1172), Shigella boydii (iSbBs512_1146), Shigella sonnei (iSSON_1240), Pseudomonas putida (iJN746), Yersinia pestis (iPC815), Helicobacter pylori (iIT341), Geobacter metallireducens (iAF987), Salmonella entérica (STM_v1_0), Thermotoga marítima (iLJ478), Synechocystis sp (iJN678), Synechococcus elongatus (iJB785), Lactococcus lactis (iNF517), Bacillus subtilis (iYO844), Staphylococcus aureus (iSB619), Clostridium ljungdahlii (iHN637), and Mycobacterium tuberculosis (iNJ661).

To generate the genome-scale metabolic reconstructions of Pseudomonas putida, we used three different set of templates from the BIGG database: (1) iJN746. (2) iJN746 - iML1515 - iYL1228 - iSFxv_1172 - iSbBS512_1146 - iSSON_1240 - iPC815 - STM_v1_0 - iIT341 - iAF987 - iLJ478 - iJN678 - iJB785 iJN746 - iML1515 - iYL1228 - iSFxv_1172 - iSbBS512_1146 - iSSON_1240 - iPC815 - STM_v1_0 - iIT341 - iAF987 - iLJ478 - iJN678 - iJB785 - iNF517 - iYO844 - iSB619 - iHN637 - iNJ66.

ModelSEED

We used ModelSEED version 2.2 web service on August 16st of 2018 to generate the draft reconstructions of Lactobacillus plantarum and B. pertussis. Version 2.4 was used to generate the draft reconstructions for Pseudomonas putida. Models were created using different template models. No media was specified to create the models.

Pathway Tools

We used Pathway Tools version 22.0 to generate the draft reconstructions. Four networks were created with the Desktop mode using different cutoff values for pathways prediction and one was made with the Lisp-console with default parameters. All the networks were exported manually with the Desktop mode.

RAVEN

We used RAVEN version 2.0.1, which was downloaded from https://github.com/SysBioChalmers/RAVEN, to generate the draft reconstructions. Different models were created using different databases (KEGG and MetaCyc) and different values in the parameters for orthology searches.

Pre-processing of L. plantarum and B. pertussis network

We pre-processed the manually curated networks in order to compare them with the draft networks. We semi-automatically changed metabolite and reaction identifiers to match those of the BIGG database. Also, we removed duplicated reactions (those with the same reaction equation). Before the deletion of a duplicated reaction, the associated gene-reaction rule was transferred to or merged with the gene-reaction rule of the reaction that was kept in the network.