Concept recognition for extracting protein interaction relations from biomedical text

Table 4 GN results: performance impact of the seven heuristics used to normalize gene names on the development data.

	Rule	Example	P	R	F
0			0.783	0.469	0.586
1	Substitution: Roman letters > arabic numerals	carbonic andydrase XI to carbonic andydrase 11	0.778	0.492	0.603
2	Substitution: Greek letters > single letters	AP-2alpha to AP-2a	0.779	0.497	0.607
3	Normalization of case	CAMK2A to camk2a	0.787	0.619	0.693
4	Removal: parenthesized materials	sialyltransferase 1 (beta-galactoside alpha-2,6-sialytransferase) to sialyltransferase 1	0.782	0.623	0.694
5	Removal: punctuation	VLA-2 to VLA2	0.768	0.667	0.714
6	Removal: spaces	calcineurin B to calcineurinB	0.784	0.742	0.762
7	Removal: strings < 2 characters	P	0.827	0.727	0.774

Presented are the seven heuristics used to normalize gene names in both lexicon construction and during processing of the gene tagger output, and the performance on the development data after each step was performed. GN, gene normalization.

ISSN: 1474-760X