DiscoverySpace: an interactive data analysis application
© Robertson et al.; licensee BioMed Central Ltd. 2007
Received: 24 March 2006
Accepted: 8 January 2007
Published: 08 January 2007
DiscoverySpace is a graphical application for bioinformatics data analysis. Users can seamlessly traverse references between biological databases and draw together annotations in an intuitive tabular interface. Datasets can be compared using a suite of novel tools to aid in the identification of significant patterns. DiscoverySpace is of broad utility and its particular strength is in the analysis of serial analysis of gene expression (SAGE) data. The application is freely available online.
Discovery data sources and their update frequency
Update frequency (days)*
CGAP (SAGE) 
Ensembl (human and mouse) 
Gene Expression Omnibus (SAGE) 
Gene Ontology 
Taxonomy (NCBI) 
DiscoverySpace was developed to support serial analysis of gene expression (SAGE)  technologies, and throughout the paper we illustrate the features of the application with scenarios from example SAGE analyses. Other examples are provided to show how DiscoverySpace is applicable to a wider range of bioinformatics use cases.
The paper does not focus on the details of the low-level implementation, but instead describes the approach, the architecture of the application, conceptual underpinning and use of key technologies such as the Resource Description Framework (RDF) . We introduce the various user interfaces of DiscoverySpace, explain the functionalities made available, and, where possible, contrast it with other available tools. We show that DiscoverySpace offers an innovative and extensible example of a graphical bioinformatics environment. The application and code are freely available to academic researchers.
Biological database integration
Bioinformatics is a data-driven discipline in which the available data sources dictate the scope of possible research. Biological data are dynamic; new databases are constantly being created , and existing databases are constantly updated and extended. It remains a challenge to integrate the data and analyze them in an effective manner.
The problem of integrating biological databases is well known . Our approach has been to centralize all data into a relational database where they can be shared and readily accessed. A drawback of this 'data warehousing' method is the ongoing need to maintain the database and develop data import tools ; though many groups, including this one, have successfully managed to sustain such an effort over time [5, 6].
A key feature of the 'data warehousing' method is that it concentrates all of the data at a single physical location. This allows complex and highly optimized queries to be run at the site of data storage, with resulting gains in efficiency and performance. The alternative, a more distributed 'federated' solution, draws data from a number of remote servers before processing and returning the result [7, 8]. Federated systems amalgamate content from multiple data warehouses, therefore permitting the organizational independence of each data provider. Distributed systems are still an emerging technology, with rapidly evolving standards and best practices . We chose to concentrate our efforts on utilizing the capabilities of one database, leaving the challenge of supporting multiple databases to a later stage of development.
The DiscoveryDB database
The DiscoveryDB database supports 26 biological databases, including Ensembl , Gene Ontology (GO) , Refseq , Entrez , Mammalian Gene Collection (MGC)  and Uniprot  (Table 1). The database also hosts data generated by the Genome Sciences Centre (GSC), such as the results of SAGE experiments.
At present, many biological data providers do not publish their data in a database-compatible tabular format, and require specialized analysis and parsing to prepare them for import into a relational database. Proprietary flat-file formats, such as those used by the Uniprot and GenBank  databases, centralize all of an entity's data into a single document-like record, and are well suited to access by UNIX command line tools and scripting languages. Unfortunately, such proprietary formats make efficient mass analysis using relational databases much more difficult. Recently, many data providers, such as Entrez, GO and Ensembl, have begun to publish data files in a tabular, tab-separated format. Such files are optimal because they can be directly imported into a database with little, or no, additional processing. Such files are also easily accessible via traditional UNIX tools.
The DiscoveryDB database is housed in a MySQL database server  (presently being upgraded to PostgreSQL ) that supplies all of the data content for the DiscoverySpace application. Because data sources are frequently updated, we have developed software to automatically download and import data files in a series of regular update cycles. Data files are parsed, if necessary, using dedicated parsing tools and then imported into the central database system.
Accessing the data
Once the various data sources have been imported into DiscoveryDB's central relational database, researchers need a means to access the data. While SQL provides a powerful interface to the database, gaining full command of the SQL language can be challenging and time-consuming for those not trained as programmers.
The most rudimentary method to promote data access is to provide a list of documented, 'pre-canned' SQL queries; a researcher can adapt a query to suit their needs and then execute it in a script or database client. The GO database  provides such example queries. This solution does require a degree of technical confidence from the researcher, but requires little development. It has the disadvantage that the researcher needs to rework all their queries when the data structure changes.
An alternative is to develop tools that wrap the database query with another interface, such as a web interface or API (application programming interface). Web interfaces typically provide a form to capture parameters, and produce a chart or other report given those parameters; DAVID  and FatiGO  are examples of web interfaces. For the more programming-literate researcher, some biological databases provide APIs. These APIs wrap SQL calls in programming interfaces and save the researcher from having to analyze the data model and code the SQL themselves; the Ensembl database  and GO database  provide such APIs. APIs assume a level of comfort with the given programming language.
Most tools are narrowly focused and, depending upon the sophistication of the implementation, restrict the user to a finite number of specific questions: for instance, 'get the Refseq accessions for these GenBank accessions', or 'get the GO terms for these genes at level 4', and so on. In such instances the interface and underlying query are dedicated to one particular usage, so the researcher does not have free rein over the data but is restricted to those functionalities that the developer exposes. For more complex tasks the researcher will need to learn and integrate multiple interfaces into a single methodology.
Because of the dynamic nature of the available data, and because of the rapidity with which researchers alter their methodologies, it is a challenge for developers to keep tools current and relevant. This is particularly acute in the case of API development where multiple programming languages are supported, as is the case with the SeqHound  and Atlas  projects. The developer must struggle to anticipate future analyses, as well as maintain the existing functionality.
The strategy of the DiscoverySpace project has been to develop a comprehensive graphical interface that supports all possible data models with only minimal configuration on the part of the database administrator. We have aimed to create an application that allows the researcher to explore the available knowledge domain freely with a limited amount of training, to expose the content and power of the underlying database while abstracting away its low-level complexity.
We decided to develop a graphical standalone application rather than a browser-based application. Standalone applications are more difficult to develop, but permit a richer user experience as there is more scope for customization. Standalone applications can also make full use of the features of the client computer, rather than offloading all work to the server (which is a shared resource). Throughout the application we have used familiar interactive devices that enhance user productivity, such as 'drag and drop' functionality. 'Drag and drop' is used to exchange data between DiscoverySpace's various internal tools; throughout the application it is possible to define a dataset in one tool, then drag it out and drop it onto another tool. We have also consistently provided features that promote interoperability with external applications, such as 'cut and paste'.
The DiscoverySpace architecture
Both client and server-side components are written in the Java programming language . The main strengths of Java are that it is object-oriented, platform independent, and offers a wealth of well-designed APIs. The middleware component is a Java servlet  and is deployed in the Apache Tomcat  reference servlet container. The client is distributed using Java Web Start technology , which integrates with the user's desktop and updates the application automatically as newer versions are released.
The middleware layer decouples the client and the database so that database drivers do not need to be deployed with the standalone client; the underlying database implementation can be changed without needing to re-release the client software. This decoupling is particularly vital when considering that future versions of DiscoverySpace may progress to a federated architecture with many servers per client, each of which might use a database from a different vendor. Future versions would also benefit from a server discovery protocol that would enable the client to find and identify available DiscoverySpace servers.
As each DiscoverySpace client starts up, it contacts its configured server and retrieves a schema describing the available data content. The client then communicates with the server using DiscoverySpace's custom protocol to query and download data. The protocol, which uses RDF/XML  in the request and tab-separated data in the response, is designed and optimized specifically for DiscoverySpace interactions. Each request is authenticated using the user's name and password, and the server has the ability to restrict data types and to filter content based upon the user's permissions. This means that confidential or sensitive information can be limited to specific collaborators.
The DiscoverySpace data model
A data model is an abstract framework for data representation that determines how data are conceptualized and understood. A data model acts as a common definition of terms for both the user and the developer, and needs to offer broad descriptive power and extensibility, while remaining simple and intuitive. Like the basic architecture, the data model is fundamental and determines the capabilities of the application; finding the correct model is vital.
Many groups have used ontologies, or controlled vocabularies, to describe biological knowledge domains: for example the GO  and Sequence Ontology  projects. Models with ontological support are advantageous because they help to describe the semantics of the data rather than merely the syntax. While SQL is extremely good at defining the format of data, it is poor at describing meaning. If data are properly annotated with rich ontological meta-information, in addition to their syntactic constraints, then they are truly self-describing.
Prototypes of DiscoverySpace used an ontological data model provided by the KDOM API . However, in this latest iteration we have adopted the Jena API , which provides full support for the Resource Description Framework (RDF)  and its associated ontology languages (DAML+OIL , OWL ). RDF is a widely used metadata language and is the foundation of other bioinformatics projects such as BioMOBY . By annotating relational data with RDF metadata, data integration occurs at the semantic level, not the syntactic level .
RDF conceptualizes data as graphs of atomic and compound nodes connected by edges known as predicates, or properties. RDF graphs are formally described using statement-like structures called triples, each of which comprises a subject, a predicate and an object. An example triple would be 'gene NM_032983 translates to protein NP_116765', where the gene and protein are subject and object, respectively, and "translates to" is the predicate. Compound nodes, termed resources, may be both the subject and object of a triple. Atomic nodes, or literals, can only be the object. RDF mandates that globally accessible resources should have a worldwide web-friendly universal resource identifier (URI). DiscoverySpace adopts a specialized form of URI designed for the biological knowledge domain: Life Science Identifiers .
Supporting SAGE analysis
The features of DiscoverySpace are illustrated through SAGE analysis use cases; therefore, it is necessary to introduce the pertinent aspects of a SAGE experiment. SAGE is a gene expression profiling technology . The result of a SAGE experiment is a library of SAGE tags, in which a tag is derived from a transcribed RNA sequence. A tag has a quality score (derived from PHRED  values) and a sequence, ten or more base pairs in length (depending upon the protocol used), that can be used to identify the corresponding transcript. SAGE libraries can be compared to other libraries to identify common or differential patterns of expression. A typical SAGE analysis scenario is composed of three stages: first, specify tag sequences; second, compare tag sequences and perform statistical analysis; and third, map tag sequences to genes and proteins for interpretation.
This specific use case can be extended to a general bioinformatics scenario: importing and defining datasets; performing quantitative and qualitative analysis on given datasets; and mapping data to available annotations for semantic interpretation.
The capabilities of DiscoverySpace will be illustrated by two example experiments. These examples provide a biological context to showcase the features of the application and its underlying database.
In the first example, we compare the expression of two sets of short SAGE tags: one a set of tags from a library generated from a normal pancreas tissue, the other the combined set of tags from two pancreatic cancer libraries. The sets are compared using the Audic-Claverie  significance test and those sequences that are significantly up- and down-regulated (to 95% confidence) are isolated. The isolated sequences are then mapped to Refseq transcripts, via position one, sense strand virtual tags. The functional qualities of the Refseq transcripts are analyzed using GO annotations. Functions of particular interest are reviewed and interpreted by the researcher; those genes that are associated with significant functions are then selected and mapped back to the dataset of up- and down-regulated tag sequences.
In the second example, we compare five Cancer Genome Anatomy Project (CGAP) breast long SAGE libraries; four from cancer samples and one from normal tissue. Logical analysis is performed to isolate those non-singleton tag sequences that are present in all of the cancer libraries and not at all in the normal library. Those isolated sequences are then mapped to their counterpart virtual tags, to Refseq transcripts, to their Entrez genes and to predicted subcellular localizations generated from the translations of the transcripts (using PSORT ). With this additional annotation the researcher can identify genes of further interest, for example, those that are predicted to be extracellular. These tag sequences are then compared with other available long SAGE libraries to determine whether the tags are significantly expressed in comparison to a broader range of samples.
Importing and defining datasets
Performing quantitative and qualitative analysis on given datasets
DiscoverySpace integrates commonly used tools for performing statistical analysis of SAGE data. Specifically, these tools are the Scatterplot and Venn table.
Data points on the Scatterplot chart can be selected manually or by setting criteria of up- or down-regulated confidence thresholds. Points can also be selected by dropping tag sequences from outside the Scatterplot onto the chart; this allows the user to visualize the relative expression of a given set of tags with regards to the comparison. The tags represented by the selected data points can be dragged out of the chart for further analysis using other DiscoverySpace tools.
Mapping data to available annotations for semantic interpretation
As with the query, the Explorer allows the user to attach constraints to the view to filter any associated sets. This can help to reduce datasets to an informative and manageable amount. For example, a constraint can reduce the set of all associated Refseq genes to only those associated Refseq genes that are human, non-predicted and located on chromosome 1. Constraints can be attached to any non-literal node.
Data in the Explorer can be manipulated in many ways, including tag to gene mapping, and assignment of annotations (for example, GO terms, PSORT annotations) to genes.
Tag to gene mapping with the CMOST database
Several quality resources exist to assist investigators in tag assignment, notably the NCBI SAGEmap  and SAGE Genie  efforts. These resources focus primarily on identifying genes that, in general, have been highly characterized or have significant expressed sequence tag (EST) data. SAGE Genie uses multiple (seven) ranked transcript sources to map tags to genes focusing on the more abundant tags and ignoring tags with single base variations with respect to the reference sequence or tags that occur only once. SAGEmap also provides mappings to ESTs. For both SAGEmap and SAGE Genie, mappings are predefined by an algorithm.
We have implemented a database that allows the user to choose the data source to which tags are mapped. They may choose to map (concurrently) to one or more of RefSeq , MGC  and Ensembl  genes. They may also map tags directly to the genome. The results of the mapping are presented in the DiscoverySpace Explorer.
A unique feature of the application is that it allows the user to map 'off-by-one' tags. During the construction of and sequencing of SAGE libraries, single base pair errors (insertions, deletions and permutations) may be incorporated into tag sequences to create off-by-one tags. Several groups have developed methods to cluster off-by-one tags with the highly expressed tag from which they are derived [46–49]. Imperfect tag clustering and the presence of a single nucleotide polymorphism in the tag sequence for the individual gene under study means that some high frequency off-by-one tags will not be mapped by standard methods.
The comprehensive mapping of SAGE tags (CMOST) database allows the user to map tags to RefSeq, MGC and ENSEMBL genes and to the genome, allowing for the possibility of single base pair insertions, deletions and permutations in tag sequences. This is achieved by pre-populating the CMOST database with the off-by-one mapped location of all experimentally observed tags. All possible one-off tags are generated for each experimental observed tag. Those off-by-one sequences that match an exact map to a sequence database (the same set of pre-extracted tags described previously) are stored in the database for later retrieval. As new SAGE libraries are sequenced and additional tag sequences generated, the off-by-one calculations are performed for new tags.
The user may elect to utilize the off-by-one mappings or not and has complete control over the entire tag mapping process.
The tag clustering and off-by-one mapping features are only available for LongSAGE libraries (comprising 21 base pair tags). Tags from regular SAGE libraries (14 base pair tags) are too short and map to too many locations for these features to be effective.
Drawing together multiple annotations with the DiscoverySpace Explorer
The DiscoverySpace Explorer enables the researcher to navigate and view multiple annotation paths at once, so that it is possible, for instance, to view both associated Refseq genes and associated MGC genes, and even the proteins of those genes, concurrently in the same table (Figure 7).
The representation of one-to-many properties is complicated by the fact that sibling, one-to-many properties are 'in competition'. The product of a gene and its synonyms is simple to comprehend because it reflects the hierarchy of the model and the path from gene to synonym. However, the product of a gene's synonyms and the gene's GO terms is slightly obscure and does not reflect a path in the hierarchy. The Explorer protects the user against such situations by dimming expansion points if they are in conflict with already open expansion points (Figure 8). Simultaneous expansions are only possible if the properties are nested and the expansions follow exactly one path down the hierarchy. If a subject resource has an expanded one-to-many property then that property will be collapsed if a competing property is expanded.
DiscoverySpace is a supportable and extensible software application; the architecture is strong and scaleable, and the core functionality has wide utility. The application allows a user to traverse multiple biological databases without requiring detailed knowledge of the source databases and provides useful domain-specific tools. The application presents a consistent, uniform view of the data, simplifying the process of analysis.
Further development will include adding further client-side logic and visualizations for domain-specific functionalities. Effort is also required to complete the DiscoverySpace server and release it as a standalone distribution. This will entail upgrading the client application for multi-server support and polymorphic queries.
A particular aim is to strengthen DiscoverySpace for development by third-parties. Though we are not yet at the stage of having a stable and publishable API, DiscoverySpace has a well-defined internal structure and strong feature set. Continuing work will develop the core application into a general bioinformatics platform. The application and code are freely available at .
We wish to thank all of our dedicated users who have persevered with DiscoverySpace throughout its various rounds of development. Particular thanks to Anita Charters, Lisa Lee, Greg Vatcher, Angelique Schnerch and Erin Pleasance of the BCCRC for helpful thoughts and feedback.
- Velculescu VE, Zhang L, Zhou W, Polyak K, Basrai M, Bassett D, Hieter P, Vogelstein B, Kinzler KW: Serial analysis of gene expression (SAGE). Am J Hum Genet. 1997, 61: A36-A36.Google Scholar
- Resource Description Framework (RDF). [http://www.w3.org/RDF/]
- Galperin MY: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res. 2005, D5-24. 33 DatabaseGoogle Scholar
- Stein LD: Integrating biological databases. Nat Rev Genet. 2003, 4: 337-345. 10.1038/nrg1065.PubMedView ArticleGoogle Scholar
- Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CW: SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics. 2002, 3: 32-10.1186/1471-2105-3-32.PubMedPubMed CentralView ArticleGoogle Scholar
- Shah SP, Huang Y, Xu T, Yuen MMS, Ling J, Ouellette BFF: Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics. 2005, 6: 34-10.1186/1471-2105-6-34.PubMedPubMed CentralView ArticleGoogle Scholar
- Haas LM, Rice JE, Schwarz PM, Swope WC, Kodali P, Kotlar E: DiscoveryLink: A system for integrated access to life sciences. IBM Systems J. 2001, 40: 489-511.View ArticleGoogle Scholar
- Goble CA, Paton NW, Stevens R, Baker PG, Ng G, Peim M, Bechhofer S, Brass A: Transparent access to multiple bioinformatics information sources. IBM Systems J. 2001, 40: 532-549.View ArticleGoogle Scholar
- Wilkinson M, Schoof H, Ernst R, Haase D: BioMOBY successfully integrates distributed heterogeneous bioinformatics Web services. The PlaNet exemplar case. Plant Physiol. 2005, 138: 5-17. 10.1104/pp.104.059170.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 38-41. 10.1093/nar/30.1.38.PubMedPubMed CentralView ArticleGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, D258-261. 32 DatabaseGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, D501-504. 33 DatabaseGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005, D54-58. 33 DatabaseGoogle Scholar
- Strausberg RL, Feingold EA, Klausner RD, Collins FS: The mammalian gene collection. Science. 1999, 286: 455-457. 10.1126/science.286.5439.455.PubMedView ArticleGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al: The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005, D154-159. 33 DatabaseGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2005, D34-38. 33 DatabaseGoogle Scholar
- MySQL Database Server. [http://www.mysql.com/products/mysql/]
- PostgreSQL Database Management System. [http://www.postgresql.org]
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4: P3-10.1186/gb-2003-4-5-p3.PubMedView ArticleGoogle Scholar
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004, 20: 578-580. 10.1093/bioinformatics/btg455.PubMedView ArticleGoogle Scholar
- Java Technology. [http://java.sun.com/]
- Java Servlet API. [http://java.sun.com/products/servlet/index.jsp]
- Apache Tomcat. [http://jakarta.apache.org/tomcat/]
- Java Web Start Technology. [http://java.sun.com/products/javawebstart/]
- RDF/XML. [http://www.w3.org/TR/rdf-syntax-grammar/]
- Ashburner M, Ball CA, Blake JA, Butler H, Cherry JM, Corradi J, Dolinski K, Eppig JT, Harris M, Hill DP, et al: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11: 1425-1433. 10.1101/gr.180801.View ArticleGoogle Scholar
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005, 6: R44-10.1186/gb-2005-6-5-r44.PubMedPubMed CentralView ArticleGoogle Scholar
- Zuyderduyn SD, Jones SJ: A knowledge discovery object model API for Java. BMC Bioinformatics. 2003, 4: 51-10.1186/1471-2105-4-51.PubMedPubMed CentralView ArticleGoogle Scholar
- Jena - A Semantic Web Framework for Java. [http://jena.sourceforge.net/]
- DAML+OIL. [http://www.w3.org/TR/daml+oil-reference]
- Web Ontology Language (OWL). [http://www.w3.org/2004/OWL/]
- Wang X, Gorlitsky R, Almeida JS: From XML to RDF: how semantic web technologies will change the design of 'omic' standards. Nat Biotechnol. 2005, 23: 1099-1103. 10.1038/nbt1139.PubMedView ArticleGoogle Scholar
- Life Science Identifiers RFP Response Revised Joint Submission. [http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02]
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.PubMedView ArticleGoogle Scholar
- Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Res. 1997, 7: 986-995.PubMedGoogle Scholar
- Nakai K, Horton P: PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci. 1999, 24: 34-36. 10.1016/S0968-0004(98)01336-X.PubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
- Strausberg RL, Buetow KH, Emmert-Buck MR, Klausner RD: The cancer genome anatomy project: building an annotated gene index. Trends Genet. 2000, 16: 103-106. 10.1016/S0168-9525(99)01937-X.PubMedView ArticleGoogle Scholar
- Chen H, Centola M, Altschul SF, Metzger H: Characterization of gene expression in resting and activated mast cells. J Exp Med. 1998, 188: 1657-1668. 10.1084/jem.188.9.1657.PubMedPubMed CentralView ArticleGoogle Scholar
- Boon K, Osorio EC, Greenhut SF, Schaefer CF, Shoemaker J, Polyak K, Morin PJ, Buetow KH, Strausberg RL, De Souza SJ, et al: An anatomy of normal and malignant gene expression. Proc Natl Acad Sci USA. 2002, 99: 11287-11292. 10.1073/pnas.152324199.PubMedPubMed CentralView ArticleGoogle Scholar
- Vencio RZ, Brentani H, Patrao DF, Pereira CA: Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE). BMC Bioinformatics. 2004, 5: 119-10.1186/1471-2105-5-119.PubMedPubMed CentralView ArticleGoogle Scholar
- Pylouster J, Senamaud-Beaufort C, Saison-Behmoaras TE: WEBSAGE: a web tool for visual analysis of differentially expressed human SAGE tags. Nucleic Acids Res. 2005, W693-695. 10.1093/nar/gki444. 33 Web ServerGoogle Scholar
- Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF: SAGEmap: a public gene expression resource. Genome Res. 2000, 10: 1051-1060. 10.1101/gr.10.7.1051.PubMedPubMed CentralView ArticleGoogle Scholar
- Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, et al: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci USA. 2002, 99: 16899-16903. 10.1073/pnas.242603899.PubMedView ArticleGoogle Scholar
- Birney E, Clamp M, Kraspcyk A, Slater G, Hubbard T, Curwen V, Stabenau A, Stupka E, Huminiecki L, Potter S: Ensembl: A multi-genome computational platform. Am J Hum Genet. 2001, 69: 219-Google Scholar
- Beissbarth T, Hyde L, Smyth GK, Job C, Boon WM, Tan SS, Scott HS, Speed TP: Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics. 2004, 20 (Suppl 1): I31-I39. 10.1093/bioinformatics/bth924.PubMedView ArticleGoogle Scholar
- Akmaev VR, Wang CJ: Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics. 2004, 20: 1254-1263. 10.1093/bioinformatics/bth077.PubMedView ArticleGoogle Scholar
- Colinge J, Feger G: Detecting the impact of sequencing errors on SAGE data. Bioinformatics. 2001, 17: 840-842. 10.1093/bioinformatics/17.9.840.PubMedView ArticleGoogle Scholar
- Siddiqui AS, Khattra J, Delaney AD, Zhao Y, Astell C, Asano J, Babakaiff R, Barber S, Beland J, Bohacec S, et al: A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proc Natl Acad Sci USA. 2005, 102: 18485-18490. 10.1073/pnas.0509455102.PubMedPubMed CentralView ArticleGoogle Scholar
- DiscoverySpace. [http://www.bcgsc.ca/discoveryspace]
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006, D173-180. 10.1093/nar/gkj158. 34 DatabaseGoogle Scholar
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, D476-480. 33 DatabaseGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Katz KS, Sicotte H, Maglott DR: Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 2000, 16: 44-47. 10.1016/S0168-9525(99)01882-X.PubMedView ArticleGoogle Scholar
- Lu P, Szafron D, Greiner R, Wishart DS, Fyshe A, Pearcy B, Poulin B, Eisner R, Ngo D, Lamb N: PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization. Nucleic Acids Res. 2005, D147-153. 33 DatabaseGoogle Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2002, 30: 276-280. 10.1093/nar/30.1.276.PubMedPubMed CentralView ArticleGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31: 365-370. 10.1093/nar/gkg095.PubMedPubMed CentralView ArticleGoogle Scholar
- Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet. 2004, 36: 949-951. 10.1038/ng1416.PubMedView ArticleGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, D108-110. 10.1093/nar/gkj143. 34 DatabaseGoogle Scholar
- Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000, 28: 316-319. 10.1093/nar/28.1.316.PubMedPubMed CentralView ArticleGoogle Scholar
- Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res. 2006, D319-321. 10.1093/nar/gkj147. 34 DatabaseGoogle Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, D514-517. 33 DatabaseGoogle Scholar
- Safran M, Solomon I, Shmueli O, Lapidot M, Shen-Orr S, Adato A, Ben-Dor U, Esterman N, Rosen N, Peter I, et al: GeneCards 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics. 2002, 18: 1542-1543. 10.1093/bioinformatics/18.11.1542.PubMedView ArticleGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, et al: InterPro, progress and status in 2005. Nucleic Acids Res. 2005, D201-205. 33 DatabaseGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004, D226-229. 10.1093/nar/gkh039. 32 DatabaseGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.