Skip to main content

Table 6 Sources of errors for the gene mention normalization

From: Gene mention normalization and interaction extraction with context models and sentence motifs

n

Cause

Evidence or examples

 

False negatives

Evidence from abstract/closest lexicon entry

24

   Polluting tokens

spectrin betaIV/spectrin beta non-erythrocytic

35

   Unrecognized variations (orthographic,

DCoHm/DCOHM

 

   lexical, structural, morphological)

prothrombin/thrombin

4

   Segmentation of name failed

hOBP (IIb)/hOBPIIb

2

   Syntactically unrelated

polycomblike/PHD finger protein

66

   Removed by filtering step

 
 

False positives

Examples, with EntrezGene ID

30

   Triggered by wrong name boundary

type II IL-1 receptor

30

   Context filtering (reference to cell etc.)

CD4+

22

   TF*IDF filter

five EGF-like domains; ARC complex

11

   Disambiguation picked wrong gene

Nup358 (440872 instead of 5903)

8

   Abbreviation resolution failed

Wolf-Hirschhorn syndrome (WHS)

4

   Wrong species

Notch1 (...) murine tissues

2

   Overlap of names not recognized

 

2

   NER missed correct ID

TR2 (8740 instead of 10587)

26

   Multiple identifiers for one name

 

40

   Other

 
  1. Analysis of errors that occurred during gene identification, false negatives and false positives, and examples of errors. Words in italics are the parts recognized in longer compound names. NER, named entity recognition.