- Open Access
FlyMine: an integrated database for Drosophila and Anopheles genomics
- Rachel Lyne1,
- Richard Smith1,
- Kim Rutherford1,
- Matthew Wakeling1,
- Andrew Varley1,
- Francois Guillier1,
- Hilde Janssens1,
- Wenyan Ji1,
- Peter Mclaren1,
- Philip North1,
- Debashis Rana1,
- Tom Riley1,
- Julie Sullivan1,
- Xavier Watkins1,
- Mark Woodbridge1,
- Kathryn Lilley2,
- Steve Russell1,
- Michael Ashburner1,
- Kenji Mizuguchi2, 3, 4 and
- Gos Micklem1, 3Email author
© Lyne et al.; licensee BioMed Central Ltd. 2007
- Received: 15 November 2006
- Accepted: 5 June 2007
- Published: 05 July 2007
FlyMine is a data warehouse that addresses one of the important challenges of modern biology: how to integrate and make use of the diversity and volume of current biological data. Its main focus is genomic and proteomics data for Drosophila and other insects. It provides web access to integrated data at a number of different levels, from simple browsing to construction of complex queries, which can be executed on either single items or lists.
- Gene Ontology
- Transcription Factor Binding Site
- Data Warehouse
- Complex Query
- Protein Interaction Data
With the completion of increasing numbers of genome sequences has come an explosion in the development of both computational and experimental techniques for deciphering the functions of genes, molecules and their interactions. These include theoretical methods for deducing function, such as analysis of protein homologies, structural domain predictions, phylogenetic profiling and analysis of protein domain fusions, as well as experimental techniques, such as microarray-based gene expression and transcription factor binding studies, two-hybrid protein-protein interaction screens, and large-scale RNA interference (RNAi) screens. The result is a huge amount of information and a current challenge is to extract meaningful knowledge and patterns of biological significance that can lead to new experimentally testable hypotheses. Many of these broad datasets, however, are noisy and the data quality can vary significantly. While in some circumstances the data from each of these techniques are useful in their own right, the ability to combine data from different sources facilitates interpretation and potentially allows stronger inferences to be made. Currently, biological data are stored in a wide variety of formats in numerous different places, making their combined analysis difficult: when information from several different databases is required, the assembly of data into a format suitable for querying is a challenge in itself. Sophisticated analysis of diverse data requires that they are available in a form that allows questions to be asked across them and that tools for constructing the questions are available. The development of systems for the integration and combined analysis of diverse data remains a priority in bioinformatics. Avoiding the need to understand and reformat many different data sources is a major benefit for end users of a centralized data access system.
A number of studies have illustrated the power of integrating data for cross-validation, functional annotation and generating testable hypotheses (reviewed in [1, 2]). These studies have covered a range of data types; some looking at the overlap between two different data sets, for example, protein interaction and expression data [3–6] or protein interaction and RNAi screening results , and some combining the information from several different types of data [8–12]. Studies with Saccharomyces cerevisiae, for example, have indicated that combining protein-protein interaction and gene expression data to identify potential interacting proteins that are also co-expressed is a powerful way to cross-validate noisy protein interaction data [3–6]. A recent analysis integrated protein interactions, protein domain models, gene expression data and functional annotations to predict nearly 40,000 protein-protein interactions in humans . In addition, combining multiple data sets of the same type from several organisms not only expands coverage to a larger section of genomes of interest, but can help to verify inferences or develop new hypotheses about particular 'events' in another organism. Alternatively, finding the intersection between different data sets of the same type can help identify a subset of higher-confidence data . In addition to examination of different data sources within an organism, predicted orthologues and paralogues allow cross-validation of datasets between different organisms. For example, identification of so-called interologues (putative interacting protein pairs whose orthologues in another organism also apparently interact), can add confidence to interactions .
Biological data integration is a difficult task and a number of different solutions to the problem have been proposed (for example, see [14, 15] for reviews). A number of projects have already tackled the task of data integration and querying, and the methods used by these different systems differ greatly in their aims and scope (for a review of the different types of systems, see ). Some, for example, do not integrate the data themselves but perform fast, indexed keyword searches over flat files. An example of such a system is SRS . Other systems send queries out to several different sources and use a mediated middle layer to integrate the information (so called mediated systems such as TAMBIS , DiscoveryLink  and BioMoby ). Although these systems can provide a great range of data and computational resources, they are sensitive to network problems and data format changes. In addition, such systems run into performance issues when running complex queries over large result sets. Finally, like FlyMine, some systems integrate all the data into one place - a data warehouse (for example, GUS , BioMart , Biozon , BioWarehouse , GIMS , Atlas  and Ondex ). Our objective was to make a freely available system built on a standard platform using a normal schema but still allowing warehouse performance. This resulted in the development of InterMine , a generic system that underpins FlyMine. A particular feature of InterMine is the way it makes use of precomputed tables to enhance performance. Another key component is the use of ontologies that provide a standardized system for naming biological entities and their relationships and this aspect is based on the approach taken by the Chado schema . For example, a large part of the FlyMine data model is based on the Sequence Ontology (a controlled-vocabulary for describing biological sequences) . This underlying architecture is discussed in more detail under 'System architecture'.
Another objective for FlyMine was to provide access to the data for bioinformatics experts as well as bench biologists with limited database (or bioinformatics) knowledge. FlyMine provides three kinds of web-based access. First, the Query Builder provides the most advanced access, allowing the user to construct their own complex queries. Second, a library of 'templates' provides a simple form-filling interface to predefined queries that can perform simple or complex actions. It is very straightforward to convert a query constructed in the Query Builder into a template for future use. Finally, a Quick Search facility allows users to browse the data available for any particular item in the database and, from there, to explore related items. This level of query flexibility combined with a large integrated database provides a powerful tool for researchers.
Below we briefly outline the data sources available in the current release of FlyMine and provide details of how these data can be accessed and queried. This is followed by examples illustrating some of the uses of FlyMine and the advantage of having data integrated into one database. Finally, we describe our future plans, and how to get further help and information.
Summary of data sources available in release 6.0 of FlyMine
D. melanogaster, A. gambiae, D. pseudoobscura, A. mellifera
D. melanogaster, D. pseudoobscura, A. gambiae, A. mellifera, C. elegans, S. cerevisiae
UniProtKB version 8.9
Protein family and domain annotation
D. melanogaster, A. gambiae, C. elegans
InterPro version 12.0
D. melanogaster, C. elegans, S. cerevisiae
Drosophila RNAi screening center
Three-dimensional structural domain predictions
D. melanogaster, A. gambiae, C. elegans, A. mellifera, D. pseudoobscura + others (see )
GO annotation and the Gene Ontology
D. melanogaster, C. elegans, A. gambiae + others (see )
Gene Ontology site
flyreg version 2.0
Transcriptional cis-regulatory modules
Whole genome tiling path
INDAC microarray oligo set
INDAC Version 1.0
P-element insertions and deletions
Human disease to D. melanogaster
Homophila version 2.1
Currently, we can load any data that conform to several different formats: GFF3  for genome annotation and genomic features (for example, Dnase I footprints, microarray oligonucleotide and genome tiling amplimers), PSI-MI [31, 32] for data describing protein interactions or complexes, MAGE [33, 34] for microarray expression data, XML files from the UniProt Knowledgebase (UniProtKB) [35, 36] and the OBO flat file format describing the Gene Ontology (GO)  and gene association files for GO annotations . In addition, we can also import data from the Ensembl [39, 40], InterPro [41, 42] and DrosDel [43, 44] database schemas to the FlyMine data model, enabling data from these databases to be loaded and updated regularly. Several smaller-scale data sources that currently do not conform to any standard have also been incorporated, such as RNAi data, orthologue data generated by InParanoid [45, 46] and three-dimensional protein structural domain predictions (K Mizuguchi, unpublished).
The number of unique occurences of some of the main types ofobjects in FlyMine release 6.0
Number of unique objects in FlyMine
Object details pages
The quick search option provides the simplest way to access FlyMine by allowing users to search all identifiers and synonyms at once (using wild cards if necessary). This takes the user either directly to an object details page (if the search returns just one object) or to a list of objects that match the query, together with links to the corresponding details pages for each object. The quick search option therefore provides a simple way for users to retrieve objects where a name or identifier is known, and to browse all data available for that object and related objects.
New templates can be added at any time: both users and the FlyMine team can derive templates from queries using the web interface, although currently only those from the team are visible to all users. Creation of a template from a query is carried out by completing a short web form in which constraint fields are chosen and labeled, and the template is described. Making and saving templates complements saving queries (see 'Personal FlyMine accounts: MyMine' below) and allows users to build their own library of useful functionality. Templates can be exported and imported as XML, thus facilitating the sharing of templates between users.
The query builder
The query builder is a tool that allows users to navigate the FlyMine data model, choosing what columns of data should be output and applying constraints that will limit the output to that of interest. Internally, the FlyMine database is built on top of an object-based data model that defines the way different types of data are related to each other. This model is made up of classes, each of which describes a particular object-type and its relationship to other objects. Data with similar properties are grouped together into the same class. For example, there are classes to describe sequence features such as genes and exons. Each class has a set of attributes that define the different types of information that can be recorded in each object of that class. In the case of the Gene class, attributes include fields for name, symbol and identifier among others. Each object in the database is a member of a class; for example, the zen gene is an object of the Gene class. The classes are linked together by references that define the relationships between objects in different classes. For example, the Gene class has a reference to the transcript class, which allows a particular gene object to have references to multiple transcript objects.
Importantly, users are also able to configure the results table they would like to view (Figure 7). In the same way that any attribute can be constrained, any attribute can also be added as an output column. Each attribute in the model browser has a 'show' button that can be used to add it to a 'Fields selected for output' list underneath the model browser. Such columns selected for the results table can be removed or their order changed either in the query builder or later, once the results table has been created.
The query builder has the advantage that users are not confined to filling in pre-defined forms that potentially restrict the diversity and complexity of queries that can be constructed. Although this kind of form-filling functionality is in fact provided by template queries, it is always possible to return to the query building page to modify a template. We expect that it will take more effort to learn to use the query builder compared to using template queries, but the reward is greater flexibility.
Often, it is useful to be able to run queries and perform other operations on lists of objects. In addition, the ability to save sets of data from one query for use in another allows complex queries to be built up in stages. In order to enable researchers to constrain their queries to a set of items (which can be any object in the database with a name or identifier), FlyMine includes a facility to collect items into lists. Lists can be generated by selecting columns (or members of columns) from query results, or by loading a list of identifiers directly (by typing, copying and pasting or by file upload). Importantly, lists can be used to constrain both template queries and queries created using the query builder. For example, for a particular list of interesting genes it is possible to run a template query that will return all their GO annotations. In addition, it is possible to perform set operations to combine lists of data in different ways. Currently, it is possible to subtract the contents of lists, or find their union or intersection. Lists of data can be stored in the user's MyMine account (see below) for future use or reference. Some of the applications of lists are described in more detail in the Discussion.
The list upload facility matches the contents of an entered list with objects in the database. Users are asked to select the 'type' of list they wish to create (for example, a 'Gene' list or a 'Protein' list). Items in the list that do not match the 'type' of list to be created, that are found only as synonyms, that are duplicated in the database (for example, the same identifier from two different organisms), or that do not exist at all in the database are reported. Then, before the list is created, the user has the option to add or discard the above non-standard matches. This facility should aid users in producing lists containing the correct objects and will be useful in its own right for checking the contents of large lists.
List details pages provide further information on the set of objects in a particular list and also run appropriate template queries automatically on the list contents. List details pages are being further developed to provide additional viewing and analysis tools (see 'Future directions'). For example, for a list details page for gene objects, the page currently provides a graphical summary of the expression of the gene set (currently according to the FlyAtlas data ). Also, from such a graph it is possible to create additional lists, containing subsets of the data represented by the different graph columns.
Viewing and analysis of results
An essential component of the web interface is the ability to view and further analyze the data generated by a query. Results generated from the query builder or from template queries can be browsed (via the object details pages) or exported, thus allowing for well-defined hypothesis-driven analysis as well as more exploratory data analysis. Results can be exported in either tab or comma-delimited formats, uploaded directly into Open Office  or Microsoft Excel, or in the case of objects that have sequences, such as genes and proteins, annotation can be exported in GFF3 format  or sequences as FASTA files. We plan to increase the choice of export formats in the future.
In addition to browsing object details pages or result tables and exporting these from the database, FlyMine aims to provide tools for viewing and further analysis of results. The availability of such tools, and the ability to seamlessly upload query results to them, will greatly reduce the time and effort required to find suitable analysis software and re-format data for use with them. So far the GBrowse genome browser [47, 48] has been integrated into the object details pages, allowing users to view a sequence feature (for example, a gene) of interest in the context of its surrounding region and other features. We have also integrated the Jmol three-dimensional structural viewer , enabling users to view the three-dimensional structural domain predictions generated as part of the FlyMine project. To aid viewing of results, a graph of interacting proteins is available on the protein 'object details' page and graphs showing the expression of genes during the D. melanogaster life cycle (from ) are available on the relevant gene object details pages. Additional analysis tools will be added as FlyMine develops (see 'Future directions') and we also intend to increase the range of external data analysis and visualization packages for which one can directly export data.
Personal FlyMine accounts: MyMine
By creating a log-in, users are able to activate a personal FlyMine account called MyMine. MyMine allows users to permanently save lists, queries and templates and mark 'favorite templates'. Saved queries and templates can be run directly, edited or exported (as XML) from a MyMine account. Every query executed (whether a template or from the query builder) is automatically saved to the user's 'query history' with a default name, which can be changed. By default, such queries are maintained only for the duration of a particular session, but can be saved permanently to the user's MyMine account when the user is logged in. Users can also save queries directly to their MyMine account from the query builder. To generate a MyMine account users need to provide only their e-mail address and a password. Finally, queries can be exported and imported as XML, thus providing an alternative mechanism for saving queries between sessions or for sharing queries.
FlyMine is built using InterMine , which was developed as an integral part of the FlyMine project. InterMine is an open source generic system that allows a query-optimized data warehouse and web interface to be quickly built for any data model. The InterMine code is available for download from the InterMine website  under the Lesser General Public License . InterMine will be described in more detail elsewhere and a general overview is provided below.
A data model is defined at the object level by an XML file. Java objects, the relational database schema and all model-specific parts of the web application are generated automatically, reducing the maintenance overhead when data model changes are required. Data are loaded as Java objects or XML that conforms to the specified model. Integration of data from multiple sources is configured to define how equivalent objects from different sources should be merged. As different data sources may provide different fields, multiple 'keys' can be defined for a particular type. For example, Gene objects may be merged according to an 'identifier' field or a 'symbol' field. A priority configuration system is used to resolve conflicts between data sources.
InterMine can operate on any data model but we provide an extension specifically for handling biological data, InterMine.bio. This includes a core data model (see below) and a series of 'sources' that include Java code to parse data from a particular data format; for example, UniProtKB, protein interactions and GO annotations each have their own source. The use of data represented by a particular standard facilitates the incorporation of future data into the database. For example, protein interaction data can be represented by the PSI-MI standard [31, 32] and by supporting this standard in InterMine we can easily accommodate future data published in this format. Data can also be loaded from an InterMine XML format, allowing the parsing code to be written in a language other than Java. Each source can add classes and fields to extend the data model if required (for example, the protein interaction source adds a ProteinInteraction class) and defines how the data should be integrated. Construction of an InterMine.bio data warehouse (for example, FlyMine) means configuring which sources should be included and specifying the particular organisms or data files to include. This system reduces the development required to update FlyMine and add new data types. It also makes possible the construction of comparable data warehouses for different organisms and data sets.
The data model
The increasing wealth of data from high-volume biological techniques has driven the development of tools for standardizing the representation of these data to facilitate data comparison, exchange and verification. Huge efforts are currently being put into creating ontologies and other standards to describe different aspects of the data. FlyMine makes use of the Sequence Ontology  to define a large part of the data model. This ensures that the relationships between sequence features in the model are biologically meaningful. The Sequence Ontology forms the core of the FlyMine data model, with each term in the ontology becoming a class in the FlyMine data model. A number of additional FlyMine classes allow storing of data, including evidence (the source of the data and publications), experimental and computational results and relationships between objects (for example, orthologues). For data that does not 'fit' to the Sequence Ontology, for example, protein interaction data, additional classes can easily be added to the model as appropriate. The FlyMine data model is designed to evolve as new data types need to be represented.
The data sources in FlyMine come from several public-domain databases and sources (Table 1).
Data integration allows connections to be made between otherwise disparate data sources. In FlyMine it is possible to navigate a path between all types of data in the database and thus combine them in different ways. For example, someone primarily interested in protein-based data, such as interactions or structural data, can easily combine these data with gene-based data, such as GO terms. One of the template queries facilitates such an approach, allowing users to find protein interactions for proteins encoded by genes annotated with a particular GO term. Someone interested in certain proteins that are predicted to interact can navigate to the corresponding information on protein domains/families and any structural information that may be available for these domains. In turn, these data can then be related to further gene-based data, such as GO terms annotated to the genes encoding the proteins or human disease-associated homologues. Orthologues allow further extension of such analysis to data from other organisms, in particular, projection of data from data-rich organisms to those that have had their genomes sequenced but are otherwise less studied. As a simple example, GO annotation can be projected to an organism such as D. pseudoobscura for which GO annotations are not yet available: for example, for the D. pseudoobscura gene FBgn0080992, GO annotations from the orthologous genes in D. melanogaster, Caenorhabditis elegans, Mus musculus and S. cerevisiae can be reported. A template query available from the object details page of this gene ('Show GO terms applied to the orthologues of a particular gene') means that this information is very quickly and easily accessible. Below we describe three more examples that illustrate the use of FlyMine: the first describes the use of overlapping genomic features, the second looks at the identification of interologues and finally some applications of 'lists' are described.
Overlapping genomic features
Many types of genome data involve the mapping or identification of features in a genome sequence - for example, transcription factor binding sites or transposable element insertion sites. To make such data useful it is often important to know what else has been mapped to, or identified in, that genomic region. Many of the genome features in FlyMine have a reference to a set of overlapping objects, enabling a user to easily retrieve and view anything that overlaps the feature of interest. For any particular chromosome region (for example, a user-defined region, DrosDel deletion or a gene) it is possible to query for features that are mapped to or overlap the region (for example, genes, transcripts, transcription factor binding sites and microarray probes). This allows, for example, the identification of resources that may be available for a particular gene or transcript; for example, P-element insertion sites and any resulting deletions from the DrosDel project [43, 44] or microarray tiling path amplicons. Such queries, starting from either a chromosome location or a DrosDel deletion, are available as template queries and, for instance, were used by the FlyChip Microarray Facility  to identify microarray probes falling within a set of DrosDel deletions, for testing array comparative genomic hybridization (CGH) protocols. Similarly, a template query that returns transcription factor binding sites mapped to a particular genomic region is available. This template also returns the factor that binds to the site (if known) and the gene(s) associated with the site. Representation of these data in GBrowse facilitates visual analysis of such overlapping features. In the future it will also be possible to query for objects that lie within a certain distance upstream or downstream of a particular feature and to distinguish different types of overlap (enclosing, enclosed by, identical, overlapping).
Identification of interologues
An example of a more complex query is the identification of interologues  (for example, proteins in D. melanogaster that potentially interact, whose orthologues in C. elegans also potentially interact). Since high-throughput two-hybrid protein interaction datasets are prone to a high false positive rate, the identification of interologues leads to increased confidence in particular interactions. For such a query one needs to be able to specify that the orthologues of two proteins that interact also interact with each other. Such a query can be constructed in the FlyMine query builder and is also available as a template query.
In addition to enhancing confidence in interaction data through identification of interologues, potential interactions can be transferred to another organism via orthologue mappings. Such transfer of information has, in previous studies, been used to provide detailed potential interaction networks for a number of organisms, for example, human , S. cerevisiae to C. elegans  and between S. cerevisiae, D. melanogaster, C. elegans and Helicobacter pylori . The inclusion in FlyMine of interaction datasets from a number of model organisms (currently D. melanogaster, C. elegans and S. cerevisiae), together with the orthologue predictions between these organisms and other organisms for which large-scale interaction datasets are not yet available, allows FlyMine to be used to infer interactions in organisms without their own protein interaction datasets.
The use of 'lists'
The ability to save lists of objects provides additional power, enabling queries to operate on a particular set of user defined data. Since many large scale experiments, such as microarray studies, produce large sets of potentially interesting genes that need to be investigated further, the ability to confine queries to such a set immediately provides researchers with a tool to investigate those genes without having to look each one up in several different databases. In addition, lists allow logical operations such as unions, intersections and subtraction to be performed. For instance, if one wishes to identify all the Anopheles gambiae genes that do not have a predicted orthologue in D. melanogaster, one could create a list of Drosophila genes with orthologues in A. gambiae, a similar list containing those orthologous to Apis mellifera, and then find the intersection of the two lists. This is a very simple three-step analysis, but provides data that can otherwise be difficult to create. Similarly, in the case of orthologue analysis, lists are of considerable utility: to find all of the D. melanogaster genes that have orthologues in A. gambiae and in A. mellifera one could create a list of orthologues between D. melanogaster and A. gambiae and a list of orthologues between D. melanogaster and A. mellifera and then find the intersection of the two lists. In general, the provision of lists means that more complicated queries can be built up in stages, with the output at each stage available for close examination, validation and manual pruning.
Lists also have an application in the comparison of entire data sources. The benefit of combining data sources through their union or intersection depends on the nature of the two datasets being combined. Different datasets, which have a low false positive but high false negative rate, can be combined via a union to increase the overall coverage of positives. Alternatively, for datasets that have high false positive and low false negative rates, analysis of their intersection may enrich for the most reliable data - that is, the subset of the two datasets most likely to be true positives . Each 'aspect' allows easy access to all the data from a particular data source, making creation of specific lists and their comparison a straightforward task.
FlyMine is still in a phase of rapid development. The system is engineered to accommodate additional types of data and we aim to add new functional genomics data sources and increase coverage to further insects as well as other model organisms as data become available. The inclusion of data from other Drosophila species and other insects (for example, the silkworm, Bombyx mori) will allow interesting cross-species comparisons to be performed. Apart from the two already covered by FlyMine, ten other Drosophila species have now been sequenced and assembled : we will load annotation and comparative genomics data for these along with tools to aid in their analysis and visualization. We also plan to add many small data sets so they may be queried and viewed in the context of existing large scale data.
As well as further development of the web interface, further data sources, templates and tools will be added. In the latter case we are interested in adding both tools to improve data visualization as well as those to allow data mining, and input from user communities is welcomed. A current focus of activity is on increasing the functionality available through list details pages: further viewing and analysis tools will be added. For example, for a list of genes, we will provide a visual representation of the chromosomal locations of the genes and provide information on potential commonalities of genes within the list by looking for statistically enriched use of GO terms within the set. We will increase the number of graphical summaries of object sets, and, as for the FlyAtlas data summary graph, will allow lists containing further subsets of the data to be generated by clicking on different columns of the graph. In addition to tools available on list details pages, other tools will be included, such as graphical viewing of interaction data with the ability to overlay other data sets (for example, expression data and GO annotations). While it will not always make sense to integrate tools closely with FlyMine, it is easy to generate different tabular data formats, and we aim to make it as easy as possible to export data for use in other applications, or to render sets of objects (for example, genes) through other web resources such as KEGG  and Reactome . In this way we hope the utility of FlyMine to browse, query, analyze and visualize diverse integrated datasets will increase. To increase access between FlyMine and other resources we also plan to add support for querying FlyMine via web services.
Availability, contact and help
The FlyMine query interface can be accessed from the FlyMine website . From here there is access to help in the form of tutorials and a user manual. The help pages are under continued development. In addition, a feedback form is available from query pages that can be used to ask for help with queries or to provide us with comments or suggestions. This feedback form will automatically send us the query that is currently being worked on, making it easier to give an accurate response. The FlyMine team includes biologists experienced in using the web interface who will respond to help requests from users and add new templates as required. Further information is available by joining one of the FlyMine electronic mailing lists (details on the website) or by email to firstname.lastname@example.org. Comments and suggestions for improvements, new functionality and additional data sources are welcome.
FlyMine is a new source of integrated data that allows researchers to make use of the huge amounts of high-throughput data currently being generated. The above examples provide a few illustrations of the way data can be manipulated in FlyMine. The number of possible combinations of data is large and will continue to grow and become more comprehensive as new and different types of data are added. The structure of FlyMine means researchers can rapidly accumulate a wealth of information about a particular object or set of objects, facilitating the formulation of new hypotheses for refining subsequent investigations. In addition to refinement and extension of smaller scale investigations, FlyMine can also facilitate whole genome approaches by allowing the investigation of networks and interactions among genome-wide datasets. The addition of graphical viewing and analysis tools and further data export options will greatly improve the ability to analyze data at this level.
FlyMine is funded by the Wellcome Trust (grant number: 067205) awarded to M Ashburner, G Micklem, S Russell, K Lilley and K Mizuguchi. We would like to thank the FlyMine Advisory Board for their valuable input and support for the project.
- Ge H, Walhout AJM, Vidal M: Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet. 2003, 19: 551-560. 10.1016/j.tig.2003.08.009.PubMedView ArticleGoogle Scholar
- Gerstein G, Lan N, Jansen R: Integrating interactomes. Science. 2002, 295: 284-287. 10.1126/science.1068664.PubMedView ArticleGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001, 29: 482-486. 10.1038/ng776.PubMedView ArticleGoogle Scholar
- Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 2001, 29: 3513-3519. 10.1093/nar/29.17.3513.PubMedPubMed CentralView ArticleGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res. 2002, 12: 37-46. 10.1101/gr.205602.PubMedPubMed CentralView ArticleGoogle Scholar
- Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FCP: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell. 2002, 9: 1133-1143. 10.1016/S1097-2765(02)00531-2.PubMedView ArticleGoogle Scholar
- Boulton SJ, Gartner A, Reboul J, Vaglio P, Dyson N, Hill DE, Vidal M: Combined functional genomic maps of the C. elegans DNA damage response. Science. 2002, 295: 127-131. 10.1126/science.1065986.PubMedView ArticleGoogle Scholar
- Marcotte EM, Pellegrini M, Thompson M, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature. 1999, 402: 83-86. 10.1038/47048.PubMedView ArticleGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-protein interaction network. Nat Biotechnol. 2005, 23: 951-959. 10.1038/nbt1103.PubMedView ArticleGoogle Scholar
- Jansen R, Lan N, Qian J, Gerstein M: Integration of genomic datasets to predict protein complexes in yeast. J Struct Funct Genomics. 2002, 2: 71-81. 10.1023/A:1020495201615.PubMedView ArticleGoogle Scholar
- Walhout AJM, Reboul J, Shtanko O, Bertin N, Vaglio P, Ge H, Lee H, Doucette-Stamm L, Gunsalus KC, Schetter AJ, et al: Integrating interactome, phenome, and transcriptome mapping data for the C. elegans germline. Curr Biol. 2002, 12: 1952-1958. 10.1016/S0960-9822(02)01279-4.PubMedView ArticleGoogle Scholar
- Hazbun TR, Malmstrom L, Anderson S, Graczyk J, Fox B, Riffle M, Sundin BA, Aranda DJ, McDonald WH, Chui CH, et al: Assigning function to yeast proteins by integration of technologies. Mol Cell. 2003, 12: 1353-1365. 10.1016/S1097-2765(03)00476-3.PubMedView ArticleGoogle Scholar
- Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 2001, 11: 2120-2126. 10.1101/gr.205301.PubMedPubMed CentralView ArticleGoogle Scholar
- Wong L: Technologies for integrating biological data. Briefings Bioinformatics. 2002, 3: 389-404. 10.1093/bib/3.4.389.View ArticleGoogle Scholar
- Stein LD: Integrating biological databases. Nat Rev Genet. 2003, 4: 337-345. 10.1038/nrg1065.PubMedView ArticleGoogle Scholar
- Etzold T, Argos P: SRS - an indexing and retrieval tool for flat file data libraries. Comput Appl Biosci. 1993, 9: 49-57.PubMedGoogle Scholar
- Baker PG, Brass A, Bechhofer S, Goble CA, Paton NW, Stevens R: TAMBIS: Transparent access to multiple bioinformatics information sources. Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology: June 28-July 1 1998; Montreal, Canada. Edited by: Glasgow J, Littlejohn T, Major F, Lathrop R, Sankoff D, Sensen C. 1998, AAAI Press, United States, 25-34.Google Scholar
- Haas L, Schwarz P, Kodali P, Rice JE, Schwarz PM, Swope WC: DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst J. 2001, 40: 489-511.View ArticleGoogle Scholar
- Wilkinson MD, Links M: BioMoby: An open source biological web services proposal. Briefings Bioinformatics. 2002, 3: 331-341. 10.1093/bib/3.4.331.View ArticleGoogle Scholar
- Davidson SB, Crabtree J, Brunk BP, Schug J, Tannen V, Overton GC, Stoeckert CJ: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst J. 2001, 40: 512-530.View ArticleGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004, 14: 160-169. 10.1101/gr.1645104.PubMedPubMed CentralView ArticleGoogle Scholar
- Birkland A, Yona G: BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics. 2006, 7: 70-10.1186/1471-2105-7-70.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee TJ, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert DW, Tenenbaum JD, Karp PD: BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics. 2006, 7: 170-10.1186/1471-2105-7-170.PubMedPubMed CentralView ArticleGoogle Scholar
- Cornell M, Paton NW, Hedeler C, Kirby P, Delneri D, Hayes A, Oliver SG: GIMS: an integrated data storage and analysis environment for genomic and functional data. Yeast. 2003, 20: 1291-1306. 10.1002/yea.1047.PubMedView ArticleGoogle Scholar
- Shah SP, Huang Y, Xu T, Yuen MM, Ling J, Ouellette BF: Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics. 2005, 6: 34-10.1186/1471-2105-6-34.PubMedPubMed CentralView ArticleGoogle Scholar
- Köhler J, Baumbach J, Taubert J, Specht M, Skusa A, Rüegg A, Rawlings C, Verrier P, Philippi P: Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics. 2006, 22: 1383-1390. 10.1093/bioinformatics/btl081.PubMedView ArticleGoogle Scholar
- InterMine. [http://www.intermine.org]
- Mungall CJ, Emmert DB, The FlyBase Consortium: A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics. 2007Google Scholar
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005, 6: R44-10.1186/gb-2005-6-5-r44.PubMedPubMed CentralView ArticleGoogle Scholar
- The gff3 Format. [http://www.sequenceontology.org/gff3.shtml]
- The Protein Standards Initiative (PSI). [http://www.psidev.info/]
- Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, Von Mering C, et al: The HUPO PSI's Molecular Interaction Format - a community standard for the representation of protein interaction data. Nat Biotechnol. 2004, 22: 177-183. 10.1038/nbt926.PubMedView ArticleGoogle Scholar
- Microarray and Gene Expression Group (MAGE). [http://www.mged.org/Workgroups/MAGE/mage.html]
- Spellman p, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, et al: Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 2002, 3: 00461-00469. 10.1186/gb-2002-3-9-research0046.View ArticleGoogle Scholar
- UniProt Knowledgebase. [http://www.uniprot.org]
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34: D187-D191. 10.1093/nar/gkj161.PubMedPubMed CentralView ArticleGoogle Scholar
- The OBO Flat File Format. [http://www.geneontology.org/GO.format.obo-1_2.shtml]
- GO annotation file format. [http://www.geneontology.org/GO.format.annotation.shtml]
- The Ensembl Database. [http://www.ensembl.org]
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, et al: Ensembl 2006. Nucleic Acids Res. 2006, 34: D556-D561. 10.1093/nar/gkj133.PubMedPubMed CentralView ArticleGoogle Scholar
- Interpro. [http://www.ebi.ac.uk/interpro]
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, et al: InterPro, progress and status in 2005. Nucleic Acids Res. 2005, 33: D201-D205. 10.1093/nar/gki106.PubMedPubMed CentralView ArticleGoogle Scholar
- DrosDel. [http://www.drosdel.org.uk]
- Ryder E, Blows F, Ashburner M, Bautista-Llacer R, Coulson D, Drummond J, Webster J, Gubb D, Gunton N, Johnson G, et al: The DrosDel Collection: a set of P-element insertions for generating custom chromosomal aberrations in Drosophila melanogaster. Genetics. 2004, 167: 797-813. 10.1534/genetics.104.026658.PubMedPubMed CentralView ArticleGoogle Scholar
- Inparanoid. [http://inparanoid.sbc.su.se]
- O'Brien KP, Remm M, Sonnhammer ELL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, 33: D476-D480. 10.1093/nar/gki107.PubMedPubMed CentralView ArticleGoogle Scholar
- The GBrowse Genome Browser. [http://www.gmod.org/wiki/index.php/Gbrowse]
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 10: 1599-1610. 10.1101/gr.403602.View ArticleGoogle Scholar
- The Jmol Structural Viewer. [http://jmol.sourceforge.net]
- FlyBase. [http://www.flybase.org]
- Drysdale RA, Crosby MA, The FlyBase Consortium: FlyBase: genes and gene models. Nucleic Acids Res. 2005, 33: D390-D395. 10.1093/nar/gki046.PubMedPubMed CentralView ArticleGoogle Scholar
- Chintapalli VR, Wang J, Dow JAT: Using FlyAtlas to identify better Drosophila models of human disease. Nat Genet. 2007, 39: 715-720. 10.1038/ng2049.PubMedView ArticleGoogle Scholar
- Open Office. [http://www.openoffice.org]
- Arbeitman MN, Furlong EEM, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP: Gene expression during the life cycle of Drosophila melanogaster. Science. 2002, 297: 2270-2275. 10.1126/science.1072152.PubMedView ArticleGoogle Scholar
- GNU Library General Public License. [http://www.gnu.org/copyleft/library.html]
- The Apache Software Foundation. [http://www.apache.org]
- PostgreSQL. [http://www.postgresql.org]
- FlyChip. [http://www.flychip.org.uk]
- Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biol. 2004, 5: R63-10.1186/gb-2004-5-9-r63.PubMedPubMed CentralView ArticleGoogle Scholar
- Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JJ, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 2004, 14: 1107-1118. 10.1101/gr.1774904.PubMedPubMed CentralView ArticleGoogle Scholar
- AAA Assembly/Alignment/Annotation of 12 Related Drosophila Species. [http://rana.lbl.gov/drosophila]
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMedPubMed CentralView ArticleGoogle Scholar
- Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005, D428-432. 33 DatabaseGoogle Scholar
- FlyMine. [http://www.flymine.org]
- IntAct. [http://www.ebi.ac.uk/intact]
- Giot L, Bader JS, Brouwer A, Chaudhuri A, Kuang B, Li Y, Hao C, Ooi B, Godwin E, Vitols G, et al: A protein interaction map of Drosophila melanogaster. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.PubMedView ArticleGoogle Scholar
- Stanyon CA, Guozhen L, Mangiola BA, Patel N, Giot L, Kuang B, Zhang H, Zhong J, Finley RL: A Drosophila protein-interaction map centered on cell-cycle regulators. Genome Biol. 2004, 5: R96-10.1186/gb-2004-5-12-r96.PubMedPubMed CentralView ArticleGoogle Scholar
- Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, et al: Protein interaction mapping: a Drosophila case study. Genome Res. 2005, 15: 376-384. 10.1101/gr.2659105.PubMedPubMed CentralView ArticleGoogle Scholar
- Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al: A map of the interaction network of the metazoan C. elegans. Science. 2004, 303: 540-543. 10.1126/science.1091403.PubMedPubMed CentralView ArticleGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009.PubMedView ArticleGoogle Scholar
- Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci USA. 2000, 97: 1143-1147. 10.1073/pnas.97.3.1143.PubMedPubMed CentralView ArticleGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98: 4569-4574. 10.1073/pnas.061034498.PubMedPubMed CentralView ArticleGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183. 10.1038/415180a.PubMedView ArticleGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al: Functional organisation of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.PubMedView ArticleGoogle Scholar
- WormBase. [http://www.wormbase.org]
- Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, et al: Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003, 421: 231-237. 10.1038/nature01278.PubMedView ArticleGoogle Scholar
- Fraser AG, Kamath RS, Zipperlen P, Martinez-Campos M, Sohrmann M, Ahringer J: Functional genomic analysis of C. elegans chromosome 1 by systematic RNA interference. Nature. 2000, 408: 325-330. 10.1038/35042517.PubMedView ArticleGoogle Scholar
- Simmer F, Moorman C, Van der Linden AM, Kuijk E, van den Berghe PVE, Kamath RS, Fraser AG, Ahringer J, Plasterk RHA: Genome-wide RNAi of C. elegans using the hypersensitive rrf-3 strain reveals novel gene functions. PLoS Biol. 2003, 1: 77-84. 10.1371/journal.pbio.0000012.View ArticleGoogle Scholar
- Agaisse H, Burrack LS, Philips JA, Rubin EJ, Perrimon N, Higgins DE: Genome-wide RNAi screen for host factors required for intracellular bacterial infection. Science. 2005, 309: 1248-1251. 10.1126/science.1116008.PubMedView ArticleGoogle Scholar
- Baeg GH, Zhou R, Perrimon N: Genome-wide RNAi analysis of JAK/STAT signaling components in Drosophila. Genes Dev. 2005, 19: 1861-1870. 10.1101/gad.1320705.PubMedPubMed CentralView ArticleGoogle Scholar
- Boutros M, Kiger AA, Armknecht S, Kerr K, Hild M, Koch B, Haas SA, Paro R, Perrimon N, Heidelberg Fly Array Consortium: Genome-wide RNAi analysis of growth and viability in Drosophila cells. Science. 2004, 303: 832-835. 10.1126/science.1091266.PubMedView ArticleGoogle Scholar
- DasGupta R, Kaykas A, Moon RT, Perrimon N: Functional genomic analysis of the Wnt-wingless signaling pathway. Science. 2005, 308: 826-833. 10.1126/science.1109374.PubMedView ArticleGoogle Scholar
- Eggert US, Kiger AA, Richter C, Perlman ZE, Perrimon N, Mitchison TJ, Field CM: Parallel chemical genetic and genome-wide RNAi screens identify cytokinesis inhibitors and targets. PLoS Biol. 2004, 2: e379-10.1371/journal.pbio.0020379.PubMedPubMed CentralView ArticleGoogle Scholar
- Philips JA, Rubin EJ, Perrimon N: Drosophila RNAi screen reveals CD36 family member required for mycobacterial infection. Science. 2005, 309: 1251-1253. 10.1126/science.1116006.PubMedView ArticleGoogle Scholar
- Vig M, Peinelt C, Beck A, Koomoa DL, Rabah D, Koblan-Huberson M, Kraft S, Turner H, Fleig A, Penner R, Kinet JP: CRACM1 is a plasma membrane protein essential for store-operated Ca2+ entry. Science. 2006, 312: 1220-1223. 10.1126/science.1127883.PubMedView ArticleGoogle Scholar
- ArrayExpress. [http://www.ebi.ac.uk/arrayexpress]
- The Gene Ontology Consortium. [http://www.geneontology.org]
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticleGoogle Scholar
- Drosophila Dnase 1 Footprint Database. [http://www.flyreg.org]
- Bergman CM, Carlson JW, Celniker SE: Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics. 2005, 21: 1747-1749. 10.1093/bioinformatics/bti173.PubMedView ArticleGoogle Scholar
- Gallo SM, Li L, Hu Z, Halfon MS: REDfly: a regulatory element database for Drosophila. Bioinformatics. 2006, 22: 381-383. 10.1093/bioinformatics/bti794.PubMedView ArticleGoogle Scholar
- The International Drosophila Array Consortium (INDAC). [http://www.indac.net]
- Homophila. [http://superfly.ucsd.edu/homophila]
- Chien S, Reiter LT, Bier E, Gribskov M: Homophila: human disease gene cognates in Drosophila. Nucleic Acids Res. 2002, 30: 149-151. 10.1093/nar/30.1.149.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.