Skip to main content

Table 4 GN results: performance impact of the seven heuristics used to normalize gene names on the development data.

From: Concept recognition for extracting protein interaction relations from biomedical text

 

Rule

Example

P

R

F

0

  

0.783

0.469

0.586

1

Substitution: Roman letters > arabic numerals

carbonic andydrase XI to carbonic andydrase 11

0.778

0.492

0.603

2

Substitution: Greek letters > single letters

AP-2alpha to AP-2a

0.779

0.497

0.607

3

Normalization of case

CAMK2A to camk2a

0.787

0.619

0.693

4

Removal: parenthesized materials

sialyltransferase 1 (beta-galactoside alpha-2,6-sialytransferase) to sialyltransferase 1

0.782

0.623

0.694

5

Removal: punctuation

VLA-2 to VLA2

0.768

0.667

0.714

6

Removal: spaces

calcineurin B to calcineurinB

0.784

0.742

0.762

7

Removal: strings < 2 characters

P

0.827

0.727

0.774

  1. Presented are the seven heuristics used to normalize gene names in both lexicon construction and during processing of the gene tagger output, and the performance on the development data after each step was performed. GN, gene normalization.