- Open Access
MatchMiner: a tool for batch navigation among gene and gene product identifiers
© Bussey et al.; licensee BioMed Central Ltd. 2003
- Received: 10 October 2002
- Accepted: 28 February 2003
- Published: 25 March 2003
MatchMiner is a freely available program package for batch navigation among gene and gene product identifier types commonly encountered in microarray studies and other forms of 'omic' research. The user inputs a list of gene identifiers and then uses the Merge function to find the overlap with a second list of identifiers of either the same or a different type or uses the LookUp function to find corresponding identifiers.
- Bacterial Artificial Chromosome
- Bacterial Artificial Chromosome Clone
- Array Comparative Genomic Hybridization
- Gene Identifier
- UniGene Cluster
One of the more painful tasks in 'omic' research [1, 2] is navigating among different gene or gene product identifiers. After a cDNA microarray experiment, for example, one usually must translate from IMAGE clone ids to GenBank accession numbers, HUGO names, common names, or chromosome locations for a list of genes. As we generate more and more data from diverse platforms and species, such translations will become increasingly complex but also more important to the synthesis of a coherent biological picture. Beyond simply looking up additional information about a list of genes, such synthesis will require the ability to find the intersection between two lists of genes that are designated by the same or a different identifier type.
Currently, the basic translations can be done on a gene-by-gene basis using public databases such as UniGene, LocusLink, OMIM (Online Mendelian Inheritance in Man), and the working draft of the human genome (from the University of California Santa Cruz (UCSC) or the National Center for Biotechnology Information) [3–5] or else in batch through Source  or GeneLynx . However, no single data source contains all the necessary information about every gene and, to complicate matters further, the relationships among identifiers are often not one-to-one. For example, there may be several GenBank accession numbers and multiple IMAGE clone ids for the same gene, and a single gene symbol may be an alias for multiple different genes. Therefore, any high-throughput solution to the problem must take these challenges into account and respond with an approach that minimizes the need for human intervention. At the same time, those instances when human intervention is necessary must be flagged and enough metadata must be provided for accurate decision-making without extensive further research.
Motivated by many days spent at the computer doing these tedious, time-consuming translations for our own experimental data, we developed MatchMiner  as a freely available public resource that automates the process for collections of genes. MatchMiner provides two primary functions. The first, LookUp, translates an input list of gene identifiers into a matching output list of identifiers of a different type; the second, Merge, combines two separate lists of either the same or different types of identifiers into one list that details all one-to-one, one-to-many, and many-to-many relationships between corresponding gene identifiers in the two lists.
In one illustrative case that motivated development of MatchMiner, we (X. Lee, K.J.B., F.G. Gwadry, W.C.R., G. Riddick, S. L. Pelletier, S.N., and J. N.W., unpublished data) had to match up as many as possible of 9,706 cDNA microarray clones [9, 10] with HU6,800 Affymetrix chip oligonucleotide sets , having run both platforms on the same 60 human cancer cell samples (the NCI-60). To do so, we developed an early form of MatchMiner. The particular task was to identify all relationships between the 9,706 IMAGE clone ids and 7,129 GenBank accession numbers based on UniGene cluster membership. To complete the task manually, one gene at a time at maximum speed (about 30 seconds per gene) would take over 140 hours - even if one could keep accurate track of the results. In contrast, the current version of MatchMiner took 10 minutes on a 750 MHz Pentium III PC with 320 MB RAM to generate the merged list, specifying all possible matches between IMAGE clone ids and GenBank accession numbers. When we compared MatchMiner Merge results with those obtained using the LookUp function for a random sample of the genes, there were no discrepancies. The same task with Source required translating both lists into UniGene clusters and then further processing the data. After identification and reformatting of entries with multiple UniGene cluster associations, the resulting lists were imported into Microsoft Access and queried to create the appropriate matches. The entire procedure gave results similar to those of MatchMiner but took approximately one hour, most of that user time.
Comparison of the capabilities of gene identifier translation tools
Translation path traceable in interactive (single-gene) mode?
Translation path traceable in batch(gene-list) mode?
Multiple input associations flagged?
Output in form suitable for automated processing?
Command line, Web application
Yes, if "Show all Cluster Ids if Multiple Clusters" option selected
As noted previously, identifiers are not always unique or uniquely assigned. For example, GenBank accession numbers are specific to a sequence, but the assignment of that sequence to a gene may change over time. Even more disconcerting, common gene names or aliases are often used by different investigators for different genes. Therefore, it is important to look in detail at the results of searches to check for correspondences other than one-to-one and to examine the data source tags to get a sense of the strength of the association between identifiers.
One non-obvious advantage of MatchMiner is that it can combine information from more than one of the data sources to show matches that could not be made on the basis of any single source. The gene ACVR2B, which has aliases ACTR-IIB and ACTRIIB, provides an example. LocusLink and OMIM both reference the HUGO symbol ACVR2B, but LocusLink does not reference ACTRIIB, and OMIM does not reference ACTR-IIB. Therefore, if one of the aliases were used as input, the success of any search outside of MatchMiner would be data-source dependent.
ChainOfResponsibility hierarchies for data sources in MatchMiner
Hierarchy of source reliability
UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM
GenBank accession number
UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM
HUGO gene symbol
IMAGE clone id
Long gene name
UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM
Affymetrix probe id
UniGene cluster id
Although currently human-specific, MatchMiner will be expanded in the near future to incorporate data from other species, with emphasis on mouse. Additional features to be implemented include the ability to handle lists of mixed types of identifiers, the ability to request multiple types of identifiers within a single search, and the incorporation of additional public sources for use in making translations. We will continue to enhance and develop MatchMiner under a contract funded by the Center for Cancer Research of the US National Cancer Institute.
MatchMiner is available as a web-application or as a command line jar file at . The MatchMiner database is maintained on our server and updated at approximately 6-month intervals. Detailed documentation for both implementations is available at the site.
In summary, MatchMiner is an efficient application for navigating the complex world of gene and gene product identifiers. It can batch search publicly available databases to convert between identifier types and can determine the intersection of two gene lists with different identifiers. MatchMiner will greatly enhance the ability of the research community to annotate and compare 'omic' datasets.
- Weinstein JN: Fishing expeditions. Science. 1998, 282: 628-629. 10.1126/science.282.5389.627g.PubMedView ArticleGoogle Scholar
- Weinstein JN: 'Omic' and hypothesis-driven research in the molecular pharmacology of cancer. Curr Opin Pharmacol. 2002, 2: 361-365. 10.1016/S1471-4892(02)00185-6.PubMedView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res. 2002, 30: 17-20. 10.1093/nar/30.1.17.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, et al: Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res. 2002, 30: 13-16. 10.1093/nar/30.1.13.PubMedPubMed CentralView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Source. [http://source.stanford.edu]
- GeneLynx: a portal to the human genome. [http://www.genelynx.org]
- MatchMiner. [http://discover.nci.nih.gov/matchminer]
- Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, et al: A gene expression database for the molecular pharmacology of cancer. Nat Genet. 2000, 24: 236-244. 10.1038/73439.PubMedView ArticleGoogle Scholar
- Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de RM, Waltham M, et al: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432.PubMedView ArticleGoogle Scholar
- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, et al: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA. 2001, 98: 10787-10792. 10.1073/pnas.191368598.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey TS, Kim UJ, Kuo WL, Olivier M, et al: Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001, 409: 953-958. 10.1038/35057192.PubMedView ArticleGoogle Scholar
- Gamma E, Helm R, Johnson R, Vlissides J: Design Patterns. 1995, Boston: Addison-WesleyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.