GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers
Genome Biology volume 4, Article number: R35 (2003)
Global gene-expression analysis is carried out using different technologies that are either array- or sequence-tag-based. To compare experiments that are performed on these different platforms, array probes and sequence tags need to be linked. An additional challenge is cross-referencing between species, to compare human profiles with those obtained in a mouse model, for example. We have developed the web-based search engine GeneHopper to link different expression resources based on UniGene clusters and HomoloGene orthologs databases of the National Center for Biotechnology Information (NCBI).
Genome-wide analysis of gene expression provides insight into the transcriptional state of a cell or tissue sample, measuring RNA levels for thousands of genes in parallel. Among the most commonly applied technologies are photolithographically synthesized oligonucleotide chips , printed microarrays using cDNA probes or oligonucleotides [2–5], and serial analysis of gene expression (SAGE) . Array-based technologies measure hybridization signal intensities in one or two channels for each array feature or probe, resulting in absolute or relative information about the expression levels of the corresponding transcripts in the samples. Array probes are produced by polymerase chain reaction (PCR) amplification of selected cDNAs or designed on the basis of cDNA or gene sequence information. SAGE results in quantitative information connected to 10-14 base-pair (bp) sequence tags derived from the 3' ends of transcripts.
Expressed sequence tag (EST) sequencing projects have generated millions of cDNA sequences for human, mouse and other organisms that are identified by an accession number in databases such as GenBank. To group these individual sequence reads into sets representing one transcript, several efforts to cluster the ESTs were developed [7–9]. UniGene, developed at the National Center for Biotechnology Information (NCBI) , automatically clusters ESTs derived from one organism on the basis of sequence homology, generating a nonredundant set of clusters representing (parts of) transcripts . As GenBank is growing, the clustering is carried out regularly, resulting in new so-called UniGene builds. A number of commercial cDNA clone libraries and commercial chip formats are based on UniGene clusters (including [11, 12]). In general, a sequence-specific identifier (GenBank accession number) serves as a reference to the array probe sequence. Likewise, SAGE tags, which are unique for each transcript, are mapped to UniGene clusters and GenBank accession numbers [13, 14].
The existence of several technologies for measuring gene expression makes cross-technology comparison an important issue. Cross-referencing array probes allows comparison of studies carried out using different technologies, and facilitates validation of gene expression. In addition, gene-expression data need to be compared between species, signifying the need to link genes from different organisms. Because no unique gene nomenclature or sequence identifier is used for all platforms, a database linking the different identifiers belonging to one gene and its orthologs is required.
To address these issues, we have developed GeneHopper, which provides a web-based user interface to link gene-expression platforms within and between species in a batchwise fashion. Currently, GeneHopper contains the most commonly used array resources for human, mouse and rat. In addition, the database can be used to annotate array probes with reference sequences and updated gene descriptions from UniGene. While this work was in progress, a microarray annotation tool with a different focus, RESOURCERER, was presented that allows cross-species and cross-platform comparisons [15, 16].
The GeneHopper database system
The probe identifiers and the associated GenBank accession numbers representing the genes on widely used microarray platforms were parsed and loaded into a database system. These include cDNA clone libraries, Affymetrix GeneChips, and oligonucleotide sets for human, mouse and rat (Table 1). Linking of genes represented on expression platforms from a particular species is based on UniGene, of which new builds are downloaded monthly . A GenBank accession number serves as a unique gene identifier for which the UniGene cluster from the currently used build is retrieved. One or more accession numbers in the target resource belonging to the same cluster are returned. For between-species searches, the UniGene cluster identifiers are subsequently used to retrieve a pair of ortholog clusters based on HomoloGene. . Only UniGene clusters for reciprocal best BLAST hits and curated orthologs in the target species are retrieved and linked to the resource requested. Additional services that are offered by the GeneHopper database are updated annotation of accession numbers with NCBI's RefSeq , UniGene cluster, LocusLink identifier , Gene Ontology functional annotation [20, 21], chromosomal localization, gene title, and so on.
On the GeneHopper search page, a service is selected from a dropdown menu, which contains the represented microarray resources for within-species and cross-species queries. Next, a list of user-defined accession numbers is uploaded to the database by submitting them as a plain-text file. The information retrieved is a tab-delimited file listing input accession number, UniGene cluster and build, gene symbol and title, and probe location (cDNA and oligo libraries) or probe identifier (Affymetrix). For an ortholog gene search, the homologous gene name, UniGene cluster, and the type of homology (calculated or curated) are also indicated.
It is possible that a submitted accession number yields several hits in the target dataset. The main reason for this redundancy is that array clones or sequences were selected on the basis of earlier builds of UniGene and that the clustering has changed over time. On Affymetrix chips several transcripts from one gene, including alternative splice forms, can be represented by different probe sets. Failure to identify a linking probe occurs when the input accession number is not used by UniGene, or when the UniGene cluster is not represented by a probe in the target dataset. For cross-species searches, the joined UniGene cluster may not be present in the HomoloGene data table. Special queries, such as finding all the accession numbers in a specific chromosomal region, or annotation of all genes in a set for protein identification, for example , can currently be carried out upon specific request to the database manager. When required, automated services will be implemented later.
The GeneHopper database was developed to facilitate microarray research with cross-species and cross-platform questions. Three examples that demonstrate different types of applications are given below.
Within species: human
Commercial microarrays for gene-expression studies, such as Affymetrix GeneChips, have the advantage that they address a large number of genes, covering the whole genome for many organisms. Although giving very complete information about the transcriptional status of the cell or tissue, such screens are expensive and therefore not affordable when many samples need to be analyzed. After an initial analysis on genome-wide arrays, a researcher may desire to produce project-specific microarrays. Entering the accession numbers of genes that were selected as differentially expressed from the pilot study will generate the corresponding clones from an available cDNA clone collection, for example, the Human cDNA Library from Research Genetics (Figure 1a). From the test example, the majority of genes were represented as cDNAs in the library. Using the plate, row and column information for clone location in the database output file, a robot can be directed to rearray the desired clones into a sublibrary. PCR-amplified cDNA clones can be generated and used to produce a large number of the desired project-specific arrays at low cost.
Between species: human to mouse
Mouse and rat models are used to study human development, physiology and disease, generating the need to compare gene-expression profiles across species. In a study on Duchenne Muscular Dystrophy (DMD), gene regulation in mouse muscles lacking the DMD gene product had to be compared to that in muscle tissue from human DMD patients (Figure 1b). One hundred and thirty-one differentially expressed genes in patients, identified on the Affymetrix human HuGeneFL chips , were linked to the Affymetrix mouse U74Av2 chips that were used for a mouse model study . For 64 out of these 131 human genes the curated or calculated best reciprocal hit mouse ortholog could be identified, represented by 82 different probe sets on the target chips.
Between technologies: mouse
Several expression-measurement platforms may be used in parallel for the same or similar samples, both in the same and different research groups. Expression analysis of the same samples on different platforms gives information on the comparability of the technologies, thereby cross-validating the qualitative and quantitative measurements [25, 26]. In addition, results from one technology can be verified and extended using another technology. Using the GeneHopper database, we linked accession numbers mapping to SAGE tags to the Affymetrix mouse Mu11KB chip in a study on arteriosclerosis in mice (Figure 1c and ). For about half of the SAGE tag accession numbers, a UniGene cluster could be found, and again for half of these a probe set was present on the Mu11KB Affymetrix chip.
The GeneHopper database is a timesaving tool developed for microarray researchers to link and annotate different gene-expression analysis platforms across several species (Figure 1). A user-defined list of accession numbers is processed batchwise against a selected microarray resource, and the corresponding array probes are returned. The linking is carried out through residence in the same UniGene gene cluster. This approach has the advantage of identifying probes that are nonoverlapping in sequence. In addition, the database facilitates special searches, for instance to find accession numbers in a specific chromosomal region, or to annotate all genes in a set or array for chromosomal localization, protein identifier, and so on. These approaches may be useful for constructing a chromosomal region-specific or pathway-specific array.
Another high-throughput web-based database for the annotation and comparison of commonly available microarray resources, RESOURCERER, was developed at The Institute for Genomic Research (TIGR) [15, 16]. This database allows comparisons between resources from the same species using either the TIGR Gene Indices  or UniGene, and between species using the TIGR EGO database . The GeneHopper Database and RESOURCERER contain partly the same target data sets, but each also reflects the needs of local researchers both in available resources and output formats. GeneHopper is focused on uploading user-defined lists of accession numbers, which generally run faster than uploading a virtual set in RESOURCERER, while the latter very quickly combines predefined large array resources. Because the clustering algorithms underlying the TIGR and NCBI databases are different, the same queries give different, although largely overlapping, results (Figure 2). For the between-species comparison from human to mouse (Figure 2b), 4 of the 52 accession numbers that generated ortholog hits in both databases returned different genes. Although these examples are indicative of inherent differences between the search results in the two databases, the comparisons are too limited to draw general conclusions. However, it could be advisable to run queries against both databases, compare the results, and thereby increase the overall search quality.
Accessing the GeneHopper database
The database is freely accessible via the Leiden Genome Technology Center website , which includes information pages about the database and online help. The source code is available from the authors on request.
Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, et al: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996, 14: 1675-1680.
Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000, 28: 4552-4557. 10.1093/nar/28.22.4552.
Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470.
Ramakrishnan R, Dorris D, Lublinsky A, Nguyen A, Domanus M, Prokhorova A, Gieser L, Touma E, Lockner R, Tata M, et al: An assessment of Motorola CodeLink microarray performance for gene expression profiling applications. Nucleic Acids Res. 2002, 30: e30-10.1093/nar/30.7.e30.
Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, Shannon KW, Lefkowitz SM, Ziman M, Schelter JM, Meyer MR, et al: Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol. 2001, 19: 342-347. 10.1038/86730.
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.
Boguski MS, Schuler GD: ESTablishing a human transcript map. Nat Genet. 1995, 10: 369-371.
Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA: A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 1999, 9: 1143-1155. 10.1101/gr.9.11.1143.
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 2001, 29: 159-164. 10.1093/nar/29.1.159.
NCBI UniGene. [http://www.ncbi.nlm.nih.gov/UniGene]
ResGen Invitrogen Corporation. [http://www.resgen.com]
Serial Analysis of Gene Expression. [http://www.sagenet.org]
RESOURCERER 6.0. [http://pga.tigr.org/tigr-scripts/magic/r1.pl]
Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biol. 2001, 2: software0002.1-0002.4. 10.1186/gb-2001-2-11-software0002.
NCBI HomoloGene. [http://www.ncbi.nlm.nih.gov/HomoloGene]
NCBI RefSeq. [http://www.ncbi.nih.gov/RefSeq]
NCBI LocusLink. [http://www.ncbi.nlm.nih.gov/locuslink]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Gene Ontology Consortium. [http://www.geneontology.org]
SWISS-PROT and TrEMBL. [http://www.expasy.ch/sprot]
Chen YW, Zhao P, Borup R, Hoffman EP: Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology. J Cell Biol. 2000, 151: 1321-1336. 10.1083/jcb.151.6.1321.
Boer J, de Meijer EJ, Mank EM, van Ommen GJB, den Dunnen JT: Expression profiling in stably regenerating skeletal muscle of dystrophin-deficient mdx mice. Neuro Musc Disord. 2002, 12: S118-S124. 10.1016/S0960-8966(02)00092-5.
Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002, 18: 405-412. 10.1093/bioinformatics/18.3.405.
Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC: Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res. 2002, 30: e48-10.1093/nar/30.10.e48.
Kreeft AJ, Moen CJA, Hofker MH, Frants RR, Vreugdenhil E, Gijbels MJJ, Havekes LM, Datson NA: Identification of differentially regulated genes in mildly hyperlipidemic ApoE3-Leiden mice by use of serial analysis of gene expression. Arterioscler Thromb Vasc Biol. 2001, 21: 1984-1990.
TIGR Gene Indices. [http://www.tigr.org/tdb/tgi]
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, et al: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 2002, 12: 493-502. 10.1101/gr.212002.
Tanaka TS, Jaradat SA, Lim MK, Kargul GJ, Wang X, Grahovac MJ, Pantano S, Sano Y, Piao Y, Nagaraja R, et al: Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. Proc Natl Acad Sci USA. 2000, 97: 9127-9132. 10.1073/pnas.97.16.9127.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.
We thank Paul van der Elst for help with the server, Michel Villerius, Rolf Turk, Ellen Sterrenburg, Barbera van Schaik and Antoine van Kampen for fruitful discussions, and Tineke van der Pouw-Kraan for meeting B.A.T.S. on the train. This work was supported in part by a Biomolecular Informatics program grant from the Netherlands Organization for Scientific Research (NWO-BMI; B.A.T.S.) and the Centre of Biomedical Genetics, The Netherlands (J.M.B.).
About this article
Cite this article
Svensson, B.A.T., Kreeft, A.J., van Ommen, GJ.B. et al. GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers. Genome Biol 4, R35 (2003). https://doi.org/10.1186/gb-2003-4-5-r35