The success (or not) of HUGO nomenclature
© BioMed Central Ltd 2006
Published: 15 May 2006
Skip to main content
© BioMed Central Ltd 2006
Published: 15 May 2006
Current usage of gene nomenclature is ambiguous and impairs the efficient handling of scientific information. Therefore it is important to propose guidelines to deal with this problem. This study attempts to evaluate the success of HUGO nomenclature for human genes. The results indicate that HUGO guidelines are not supported by the scientific community.
Ambiguous gene names impose a serious hurdle for the analysis of a wide range of high-throughput data, such as microarray experiments or protein-interaction maps. This sort of ambiguity also limits the efficiency of genome analysis and annotation and slows the implementation of automatic text-mining systems for using bibliographic information [1, 2]. While systems for automatic gene name recognition in other domains (such as in business or news reports) perform very well, the best systems in the biological field perform just slightly better than 80% .
Genes are commonly named using functional terms, such as 'insulin' or 'tumor necrosis factor', or symbols consisting of abbreviations such as INS for insulin or TNF for tumor necrosis factor. Functional names are usually unique, in the sense that a given name refers only to one gene family, even if not always to a single gene of the family. Ambiguity exists because often more than one functional name is used to refer to the same gene (synonymy), and also many functional names are descriptive of some phenotype of the gene (such as 'deafness' or 'wingless'), a practice that creates many complications . The use of symbols should alleviate some of the problems created by the use of functional names, but in practice seems to produce even more ambiguities. In addition to extended synonymy (with many symbols describing the same gene), a given symbol can also be used to describe different genes (homonymy). Moreover, many other meanings can match the abbreviation used for the gene name (acronyms). Text-mining systems are severely limited by these factors, as ambiguities decrease the precision in the retrieval of correct articles, and synonyms limit the number of total retrieved articles.
These limitations potentially impair the effective application of text mining and natural language processing (NLP) techniques in genomics. For instance, the comparison of microarray data from different sources requires the exact mapping of the names used by different authors. This task can be greatly complicated by ambiguous names such as 'PAP', which can refer to five different human genes, and will therefore be impossible to classify in the absence of additional information. In this type of situation, valuable experimental information could be lost because of nomenclature problems that could be solved by the use of standard names.
Standard nomenclatures, strictly following naming guidelines, are the most obvious solution to the problem. Indeed, considerable community effort has gone into the creation of these standards for gene symbols in organisms such as yeast, mouse, fly, and, of course, human. An illustrative example is the valuable effort of HUGO nomenclature for human genes [5, 6]. A single official symbol is proposed for every gene, and the aliases (alternative symbols, synonyms) for each gene are also listed. The obvious concern is the extent to which scientists follow these nomenclature rules. Other instances of standard nomenclatures, such as enzymatic codes (EC numbers), have been loosely followed.
A positive observation is that this small increment is in part due to new genes that are named preferentially according to the official standards. The genes mentioned for the first time after the year 2000 have a higher proportion of official symbols and a smaller number of synonyms (Figure 1); however, it can still be argued that it is only a question of time for these genes to acquire new synonyms. Furthermore, highly referenced genes are cited notably more often by unofficial gene names. For example, in 2004, only 38% of genes cited in more than 50 articles were named predominantly by following HUGO, whereas scarcely cited genes more often followed the standards (54% in 2004).
The tendency to improve the situation by replacing aliases in favor of HUGO official symbols is, unfortunately, weak. The changes in name usage, either from official to aliases or from aliases to official, are not very frequent, and the nomenclature of most genes remains rather stable with time. These findings seem to confirm the intuition that researchers remain attached to their favorite names.
This trend is not species-dependent. For example, in yeast, where there is also a proposed standard nomenclature , there is not a tendency to replace aliases with official names (the usage of official names has remained approximately the same in recent years as in the past), even if in this community official names are used more often (85% of the genes are preferentially cited using official names).
A similar case is illustrated in Figure 2b for the gene encoding the poliovirus receptor. In the mid-1990s, the only symbol used was PVR (which is today the official name for the gene). The alternative name CD155 for the protein appeared for the first time in 1997, but gained greater acceptance after the publication in the late nineties of several articles describing structural aspects of the CD155 protein  that are critical to the interaction with the virus (CD nomenclature for cell-surface proteins follows a long established standard nomenclature). These articles named the gene as CD155, and this has been the preferred name since then. In this case, HUGO nomenclature apparently did not take this fact into account, since the establishment of PVR as the official gene name took place in 2003.
Finally, Figure 2c shows an interesting case of the persistence of several different names for one gene, that for the chemokine lymphotactin. The cloning of this gene was reported almost simultaneously by three independent groups in Japan, Germany and the USA in 1995 [10–12]. The three groups named the gene differently (SCM1, ATAC and LTN, respectively). These names have all been used since then, as well as LPTN and, lately, the official name XCL1. It is interesting to notice that the three groups reporting the discovery kept using their own names for the gene, at least until very recently, a trend that can be observed also in the previous examples.
The problem of linking names in texts with the molecules they refer to can only be solved by a concerted community effort to explicitly mention the official names and/or the corresponding database accession numbers (such as these of UniProt or Refseq for proteins, and GenBank for genes). The use of accession numbers has the advantage of providing a unique and unambiguous reference that is also a direct link to the real biological object. But it does have some drawbacks. Citing accession numbers instead of gene or protein names would seriously affect the clarity and readability of the text. From this point of view, names and accession numbers must coexist. This could be done, for instance, by citing only names in the main text, and including accession numbers for the protein or gene names used in the text in a separate section. Also, our experience is that mapping between different databases is not exempt from problems. For instance, a single nucleotide sequence often has several different entries, corresponding to splice variants, polymorphisms or regions of the genome. Also, for these references to be really useful, they would have to cover all the mentions of genes including anaphoric (the use of a linguistic unit, such as the pronoun 'it' to refer to a previous mention of the name) and other forms of implicit mentions, and to take into account the difference between individual genes and proteins and general protein names referring to, for instance, protein familes (that is, 'tubulin beta1 protein' can be assigned to a well defined molecule, but 'tubulin' cannot, since it can refer to several different molecules). It would be important to develop adequate tools to facilitate the introduction of names and identifiers at the time of writing papers, and to enable the posterior recovery by both humans and software tools.
The task of tagging genes and proteins in papers with the corresponding official names and/or database entries will require the collaboration of authors, journals and grant agencies, and could be facilitated by the development of adequate text-mining methods.
J.T. developed the gene name recognition system Text Detective as part of his work at BioAlma SL (Tres Cantos, Madrid, Spain). This work was partly supported by research grants ENFIN LSGH-CT-2005-518254 (VI Framework Programme, European Comission), ESPAÑOL BIO2004-00875 (Spanish Ministry of Education and Science), and Fundación BBVA.