Text-mining assisted regulatory annotation

Table 2 Efficiency of document recovery, sequence extraction and genome mapping for the source lists of PMIDs with high cis-regulatory content

	TRANSFAC	FlyReg	ORegAnno	Queue	top4,501	All
Number of PMIDs	5,719	202	914	4,145	4,491	11,437
Number of PMIDs with PDF	5,302	187	835	3,710	3,677	9,940
Percent PMIDs with PDF	92.7%	92.6%	91.4%	89.5%	81.9%	86.9%
Number of PMIDs with text >2 Kbytes	5,051	175	793	3,517	3,498	9,440
Percent PMIDs with text >2 Kbytes	88.3%	86.6%	86.8%	84.8%	77.9%	82.5%
Efficiency of text conversion	95.3%	93.6%	95.0%	94.8%	95.1%	95.0%
Number of PMIDs with fasta sequence	4,357	155	660	3,044	3,080	8,066
Percent PMIDs with fasta sequence	76.2%	76.7%	72.2%	73.4%	68.6%	70.5%
Efficiency of sequence extraction	86.3%	88.6%	83.2%	86.6%	88.1%	85.4%
Number of PMIDs with fasta sequence mapped to genome	1,518	75	303	1,279	1,260	2,975
Percent PMIDs with fasta sequence mapped to genome	26.5%	37.1%	33.2%	30.9%	28.1%	26.0%
Efficiency of genome mapping	34.8%	48.4%	45.9%	42.0%	40.9%	36.9%

Note that totals are less than the sum of the sets since many PMIDs are found in more than one source list.

ISSN: 1474-760X