Gene mention normalization and interaction extraction with context models and sentence motifs

Table 6 Sources of errors for the gene mention normalization

n	Cause	Evidence or examples
	False negatives	Evidence from abstract/closest lexicon entry
24	Polluting tokens	spectrin betaIV/spectrin beta non-erythrocytic
35	Unrecognized variations (orthographic,	DCoHm/DCOHM
	lexical, structural, morphological)	prothrombin/thrombin
4	Segmentation of name failed	hOBP (IIb)/hOBPIIb
2	Syntactically unrelated	polycomblike/PHD finger protein
66	Removed by filtering step
	False positives	Examples, with EntrezGene ID
30	Triggered by wrong name boundary	type II IL-1 receptor
30	Context filtering (reference to cell etc.)	CD4+
22	TF*IDF filter	five EGF-like domains; ARC complex
11	Disambiguation picked wrong gene	Nup358 (440872 instead of 5903)
8	Abbreviation resolution failed	Wolf-Hirschhorn syndrome (WHS)
4	Wrong species	Notch1 (...) murine tissues
2	Overlap of names not recognized
2	NER missed correct ID	TR2 (8740 instead of 10587)
26	Multiple identifiers for one name
40	Other

Analysis of errors that occurred during gene identification, false negatives and false positives, and examples of errors. Words in italics are the parts recognized in longer compound names. NER, named entity recognition.

ISSN: 1474-760X