GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers
© Svensson et al.; licensee BioMed Central Ltd. 2003
Received: 12 November 2002
Accepted: 19 March 2003
Published: 25 April 2003
Global gene-expression analysis is carried out using different technologies that are either array- or sequence-tag-based. To compare experiments that are performed on these different platforms, array probes and sequence tags need to be linked. An additional challenge is cross-referencing between species, to compare human profiles with those obtained in a mouse model, for example. We have developed the web-based search engine GeneHopper to link different expression resources based on UniGene clusters and HomoloGene orthologs databases of the National Center for Biotechnology Information (NCBI).
Genome-wide analysis of gene expression provides insight into the transcriptional state of a cell or tissue sample, measuring RNA levels for thousands of genes in parallel. Among the most commonly applied technologies are photolithographically synthesized oligonucleotide chips , printed microarrays using cDNA probes or oligonucleotides [2–5], and serial analysis of gene expression (SAGE) . Array-based technologies measure hybridization signal intensities in one or two channels for each array feature or probe, resulting in absolute or relative information about the expression levels of the corresponding transcripts in the samples. Array probes are produced by polymerase chain reaction (PCR) amplification of selected cDNAs or designed on the basis of cDNA or gene sequence information. SAGE results in quantitative information connected to 10-14 base-pair (bp) sequence tags derived from the 3' ends of transcripts.
Expressed sequence tag (EST) sequencing projects have generated millions of cDNA sequences for human, mouse and other organisms that are identified by an accession number in databases such as GenBank. To group these individual sequence reads into sets representing one transcript, several efforts to cluster the ESTs were developed [7–9]. UniGene, developed at the National Center for Biotechnology Information (NCBI) , automatically clusters ESTs derived from one organism on the basis of sequence homology, generating a nonredundant set of clusters representing (parts of) transcripts . As GenBank is growing, the clustering is carried out regularly, resulting in new so-called UniGene builds. A number of commercial cDNA clone libraries and commercial chip formats are based on UniGene clusters (including [11, 12]). In general, a sequence-specific identifier (GenBank accession number) serves as a reference to the array probe sequence. Likewise, SAGE tags, which are unique for each transcript, are mapped to UniGene clusters and GenBank accession numbers [13, 14].
The existence of several technologies for measuring gene expression makes cross-technology comparison an important issue. Cross-referencing array probes allows comparison of studies carried out using different technologies, and facilitates validation of gene expression. In addition, gene-expression data need to be compared between species, signifying the need to link genes from different organisms. Because no unique gene nomenclature or sequence identifier is used for all platforms, a database linking the different identifiers belonging to one gene and its orthologs is required.
To address these issues, we have developed GeneHopper, which provides a web-based user interface to link gene-expression platforms within and between species in a batchwise fashion. Currently, GeneHopper contains the most commonly used array resources for human, mouse and rat. In addition, the database can be used to annotate array probes with reference sequences and updated gene descriptions from UniGene. While this work was in progress, a microarray annotation tool with a different focus, RESOURCERER, was presented that allows cross-species and cross-platform comparisons [15, 16].
The GeneHopper database system
Target data sets currently represented in GeneHopper
Sequence-validated human cDNA library 40K
Research Genetics 
NIA mouse 15K cDNA clone set
National Institute on Aging 
Sequence-verified rat clone collection
Research Genetics 
On the GeneHopper search page, a service is selected from a dropdown menu, which contains the represented microarray resources for within-species and cross-species queries. Next, a list of user-defined accession numbers is uploaded to the database by submitting them as a plain-text file. The information retrieved is a tab-delimited file listing input accession number, UniGene cluster and build, gene symbol and title, and probe location (cDNA and oligo libraries) or probe identifier (Affymetrix). For an ortholog gene search, the homologous gene name, UniGene cluster, and the type of homology (calculated or curated) are also indicated.
It is possible that a submitted accession number yields several hits in the target dataset. The main reason for this redundancy is that array clones or sequences were selected on the basis of earlier builds of UniGene and that the clustering has changed over time. On Affymetrix chips several transcripts from one gene, including alternative splice forms, can be represented by different probe sets. Failure to identify a linking probe occurs when the input accession number is not used by UniGene, or when the UniGene cluster is not represented by a probe in the target dataset. For cross-species searches, the joined UniGene cluster may not be present in the HomoloGene data table. Special queries, such as finding all the accession numbers in a specific chromosomal region, or annotation of all genes in a set for protein identification, for example , can currently be carried out upon specific request to the database manager. When required, automated services will be implemented later.
The GeneHopper database was developed to facilitate microarray research with cross-species and cross-platform questions. Three examples that demonstrate different types of applications are given below.
Within species: human
Between species: human to mouse
Mouse and rat models are used to study human development, physiology and disease, generating the need to compare gene-expression profiles across species. In a study on Duchenne Muscular Dystrophy (DMD), gene regulation in mouse muscles lacking the DMD gene product had to be compared to that in muscle tissue from human DMD patients (Figure 1b). One hundred and thirty-one differentially expressed genes in patients, identified on the Affymetrix human HuGeneFL chips , were linked to the Affymetrix mouse U74Av2 chips that were used for a mouse model study . For 64 out of these 131 human genes the curated or calculated best reciprocal hit mouse ortholog could be identified, represented by 82 different probe sets on the target chips.
Between technologies: mouse
Several expression-measurement platforms may be used in parallel for the same or similar samples, both in the same and different research groups. Expression analysis of the same samples on different platforms gives information on the comparability of the technologies, thereby cross-validating the qualitative and quantitative measurements [25, 26]. In addition, results from one technology can be verified and extended using another technology. Using the GeneHopper database, we linked accession numbers mapping to SAGE tags to the Affymetrix mouse Mu11KB chip in a study on arteriosclerosis in mice (Figure 1c and ). For about half of the SAGE tag accession numbers, a UniGene cluster could be found, and again for half of these a probe set was present on the Mu11KB Affymetrix chip.
The GeneHopper database is a timesaving tool developed for microarray researchers to link and annotate different gene-expression analysis platforms across several species (Figure 1). A user-defined list of accession numbers is processed batchwise against a selected microarray resource, and the corresponding array probes are returned. The linking is carried out through residence in the same UniGene gene cluster. This approach has the advantage of identifying probes that are nonoverlapping in sequence. In addition, the database facilitates special searches, for instance to find accession numbers in a specific chromosomal region, or to annotate all genes in a set or array for chromosomal localization, protein identifier, and so on. These approaches may be useful for constructing a chromosomal region-specific or pathway-specific array.
Accessing the GeneHopper database
The database is freely accessible via the Leiden Genome Technology Center website , which includes information pages about the database and online help. The source code is available from the authors on request.
We thank Paul van der Elst for help with the server, Michel Villerius, Rolf Turk, Ellen Sterrenburg, Barbera van Schaik and Antoine van Kampen for fruitful discussions, and Tineke van der Pouw-Kraan for meeting B.A.T.S. on the train. This work was supported in part by a Biomolecular Informatics program grant from the Netherlands Organization for Scientific Research (NWO-BMI; B.A.T.S.) and the Centre of Biomedical Genetics, The Netherlands (J.M.B.).
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, et al: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996, 14: 1675-1680.PubMedView ArticleGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000, 28: 4552-4557. 10.1093/nar/28.22.4552.PubMedPubMed CentralView ArticleGoogle Scholar
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470.PubMedView ArticleGoogle Scholar
- Ramakrishnan R, Dorris D, Lublinsky A, Nguyen A, Domanus M, Prokhorova A, Gieser L, Touma E, Lockner R, Tata M, et al: An assessment of Motorola CodeLink microarray performance for gene expression profiling applications. Nucleic Acids Res. 2002, 30: e30-10.1093/nar/30.7.e30.PubMedPubMed CentralView ArticleGoogle Scholar
- Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, Shannon KW, Lefkowitz SM, Ziman M, Schelter JM, Meyer MR, et al: Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol. 2001, 19: 342-347. 10.1038/86730.PubMedView ArticleGoogle Scholar
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.PubMedView ArticleGoogle Scholar
- Boguski MS, Schuler GD: ESTablishing a human transcript map. Nat Genet. 1995, 10: 369-371.PubMedView ArticleGoogle Scholar
- Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA: A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 1999, 9: 1143-1155. 10.1101/gr.9.11.1143.PubMedPubMed CentralView ArticleGoogle Scholar
- Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 2001, 29: 159-164. 10.1093/nar/29.1.159.PubMedPubMed CentralView ArticleGoogle Scholar
- NCBI UniGene. [http://www.ncbi.nlm.nih.gov/UniGene]
- ResGen Invitrogen Corporation. [http://www.resgen.com]
- Affymetrix. [http://www.affymetrix.com]
- SAGEmap. [http://www.ncbi.nlm.nih.gov/SAGE]
- Serial Analysis of Gene Expression. [http://www.sagenet.org]
- RESOURCERER 6.0. [http://pga.tigr.org/tigr-scripts/magic/r1.pl]
- Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biol. 2001, 2: software0002.1-0002.4. 10.1186/gb-2001-2-11-software0002.View ArticleGoogle Scholar
- NCBI HomoloGene. [http://www.ncbi.nlm.nih.gov/HomoloGene]
- NCBI RefSeq. [http://www.ncbi.nih.gov/RefSeq]
- NCBI LocusLink. [http://www.ncbi.nlm.nih.gov/locuslink]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Gene Ontology Consortium. [http://www.geneontology.org]
- SWISS-PROT and TrEMBL. [http://www.expasy.ch/sprot]
- Chen YW, Zhao P, Borup R, Hoffman EP: Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology. J Cell Biol. 2000, 151: 1321-1336. 10.1083/jcb.151.6.1321.PubMedPubMed CentralView ArticleGoogle Scholar
- Boer J, de Meijer EJ, Mank EM, van Ommen GJB, den Dunnen JT: Expression profiling in stably regenerating skeletal muscle of dystrophin-deficient mdx mice. Neuro Musc Disord. 2002, 12: S118-S124. 10.1016/S0960-8966(02)00092-5.View ArticleGoogle Scholar
- Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002, 18: 405-412. 10.1093/bioinformatics/18.3.405.PubMedView ArticleGoogle Scholar
- Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC: Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res. 2002, 30: e48-10.1093/nar/30.10.e48.PubMedPubMed CentralView ArticleGoogle Scholar
- Kreeft AJ, Moen CJA, Hofker MH, Frants RR, Vreugdenhil E, Gijbels MJJ, Havekes LM, Datson NA: Identification of differentially regulated genes in mildly hyperlipidemic ApoE3-Leiden mice by use of serial analysis of gene expression. Arterioscler Thromb Vasc Biol. 2001, 21: 1984-1990.PubMedView ArticleGoogle Scholar
- TIGR Gene Indices. [http://www.tigr.org/tdb/tgi]
- Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, et al: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 2002, 12: 493-502. 10.1101/gr.212002.PubMedPubMed CentralView ArticleGoogle Scholar
- GeneHopper. [http://www.lgtc.nl/GeneHopper]
- Sigma-Genosys. [http://www.sigma-genosys.com]
- Tanaka TS, Jaradat SA, Lim MK, Kargul GJ, Wang X, Grahovac MJ, Pantano S, Sano Y, Piao Y, Nagaraja R, et al: Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. Proc Natl Acad Sci USA. 2000, 97: 9127-9132. 10.1073/pnas.97.16.9127.PubMedPubMed CentralView ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying andredistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL