- Open Access
Automating curation using a natural language processing pipeline
Genome Biology volume 9, Article number: S10 (2008)
The tasks in BioCreative II were designed to approximate some of the laborious work involved in curating biomedical research papers. The approach to these tasks taken by the University of Edinburgh team was to adapt and extend the existing natural language processing (NLP) system that we have developed as part of a commercial curation assistant. Although this paper concentrates on using NLP to assist with curation, the system can be equally employed to extract types of information from the literature that is immediately relevant to biologists in general.
Our system was among the highest performing on the interaction subtasks, and competitive performance on the gene mention task was achieved with minimal development effort. For the gene normalization task, a string matching technique that can be quickly applied to new domains was shown to perform close to average.
The technologies being developed were shown to be readily adapted to the BioCreative II tasks. Although high performance may be obtained on individual tasks such as gene mention recognition and normalization, and document classification, tasks in which a number of components must be combined, such as detection and normalization of interacting protein pairs, are still challenging for NLP systems.
Curating biomedical literature into relational databases is a laborious task, in view of the quantity of biomedical research papers that are published on a daily basis. It is widely argued that text mining could simplify and speed up this task [1–3]. In this report we describe how a text mining system developed for a commercial curation project was adapted for the BioCreative II competition. Our submission (team 6) to this competition is based on research carried out as part of the Text Mining (TXM) program, a 3-year project aimed at producing natural language processing (NLP) tools to assist in the curation of biomedical papers. The principal product of this project is an information extraction (IE) pipeline, designed to extract named entities (NEs) and relations relevant to the biomedical domain, and to normalize the NEs to appropriate ontologies (Figure 1). Although the TXM pipeline is designed to assist specialized users, such as curators, it can equally be employed to extract information from the literature that is immediately relevant to biologists in general. For example, it can be used to automatically create large-scale databases or to generate protein-protein interaction networks.
In our BioCreative II submissions, we used the first release of the TXM pipeline, which identifies proteins, normalizes them to a RefSeq derived lexicon, and extracts mentions of protein-protein interactions (PPIs). Since then the pipeline has been extended to identify a wider range of NEs, including proteins, protein complexes, fragments and mutants, modifications, experimental methods, and cell lines. The latest pipeline can predict nested as well as non-nested entities ; in other words, it can predict entities that contain, or are contained in, other entities. Furthermore, the PPIs have been enriched  with additional information of biological interest, for example whether the PPI is direct or indirect, or what experimental method is used to detect the interaction. In order to demonstrate its adaptability, and to satisfy the needs of the commercial partner, the TXM pipeline was also adapted to the tissue expression domain. In this adaptation, the pipeline was further extended to recognize and normalize an appropriate set of NEs for that domain, such as tissue, protein, mRNA/cDNA, and gene, and to extract and enrich relations that indicate which proteins are expressed in which tissue types.
The TXM pipeline includes both rule-based linguistic preprocessing as well as machine learning (ML)-based IE components, trained on corpora annotated as part of the project. Greater detail regarding the exact implementation of the components is provided in the Materials and methods section (below). The partial reliance on ML is intended to make the pipeline more adaptable, so that it can be easily ported to a different domain if an annotated corpus is available. This adaptability is further enhanced by the use of string distance measures for term normalization, providing a generic method of rapidly comparing the textual form of entities with lexicon entries. Because the pipeline is designed to predict candidate NEs, their normalizations, and PPIs, BioCreative II provided an ideal testing ground to investigate how the pipeline generalizes from its training set. Indeed, one of the largest contributions of BioCreative II is providing training corpora to the research community. These annotated corpora provide common evaluation sets for fair comparison of different text mining algorithms, and provide the means for researchers to develop new ML methods and to encourage researchers in other domains to apply their ML methods to the biological domain.
Our team participated in the following tasks of the competition: gene mention (GM; recognizing gene names); gene normalization (GN; normalizing gene names to EntrezGene identifiers); interaction article subtask (IAS; selecting articles containing curatable PPIs); interaction pair subtask (IPS; extracting curatable PPIs); and interaction sentence subtask (ISS; extracting sentences with evidence for curatable PPIs).
For BioCreative II, and particularly so for the interaction-related tasks, the pipeline could not be used as is, but required certain extensions and modifications. For the IPS subtask, this was because of a fundamental difference between the pipeline's view of a PPI and the PPIs that were to be extracted for BioCreative II. Because the pipeline is intended to be used as a curation assistant, it just attempts to identify the candidate PPI mentions in a document, relying on the human curator to select the curatable PPIs. The definition of a curatable PPI may be somewhat dependent on the curation guidelines in force, but normally refers to PPIs that are experimentally proven in the work described in the paper, as opposed to PPIs that are merely referenced or posited. For the IPS subtask, only curatable PPIs were to be returned, and so additional functionality was implemented on top of the TXM pipeline PPI extraction to remove any extracted but noncuratable PPIs, and to collapse identical PPIs into one.
In the next section we summarize the results of our submissions on each task, and we give some analysis of the performance. This is followed by conclusions drawn from the BioCreative II experience and a description of each of the methods employed. For a comparison of the methods used by all of the participating teams, including our team, see the task overview papers [6–8].
Results and discussion
The aim of the GM task was to identify mentions of genes and gene products in sentences extracted from Medline abstracts. As described in the Materials and methods section (below), the submission for the GM task compared two different ML techniques in the three runs, using the same feature set. Runs 1 and 3 employed conditional random fields (CRFs)  with different settings of the Gaussian prior, whereas run 2 used a bidirectional maximum entropy Markov model (BMEMM) . (The Gaussian prior is a regularization term applied during learning, to prevent over-fitting. Its value is usually tuned empirically on a held-out set.) The performance of each system, measured by held-out testing on 20% of the training set, and on the test set, is shown in the Table 1.
The following is an example of the output of the GM system, with the predicted gene mentions highlighted in bold. In this example, the system predicted precisely the same gene mentions as identified by the annotators.
"The STP1 locus is located on chromosome IV, close to at least two other genes involved in RNA splicing: PRP3 and SPP41."
For the GN task, teams were asked to provide a list of EntrezGene identifiers for all of the human genes mentioned in a set of Medline abstracts. We used a string similarity based approximate search algorithm for generating candidate matches for the genes marked up by our GM system. In runs 1 and 2, two variants of an ML-based filter were tested, whereas run 3 used a heuristic filter. The matching and filtering algorithms are described in the Materials and methods section (see below), and Table 2 shows the results obtained on the held-out (20%) training dataset and the test set.
Submissions were made for three of the four PPI subtasks: the IAS, the IPS, and the ISS. All of these tasks were related to the identification of interactions in articles from PubMed. In the IAS teams, were asked to select abstracts that described curatable interactions, in the IPS teams had to use the full papers to extract pairs of normalized proteins corresponding to the curatable interactions in the paper, and in the ISS, the aim was to identify the sentences in the full texts that described such interactions.
For IAS only one run was submitted, and the performance on the test set is shown in Table 3.
For IPS, the three submitted runs varied both in the original data format of the article (HTML or PDF), and the algorithm used to generate the UniProt identifier matches (exact or fuzzy). The performances of each configuration, measured using fivefold cross-validation on the training set, and on the test set, are shown in Tables 4 and 5. Note that the scoring algorithm used on the training set is stricter in that it includes all gold (annotated) interactions, whereas scoring on the test set only includes interactions where protein identifiers are drawn from SwissProt.
To see examples of correctly predicted interactions (true positives) and incorrectly predicted interactions (false positives), consider the document with PubMed identifier 10713104. The system correctly predicted an interaction between LYN_MOUSE and HCLS1_MOUSE, and incorrectly predicted an interaction between LYN_HUMAN and HCLS1_HUMAN. In the document, there are many sentences in which the pipeline marked an interaction between the two proteins 'Lyn' and 'HS1', for example in the following:
"Here we show that the hemopoietic-specific protein HS1 interacted directly with the SH3 domain of Lyn, via its proline-rich region."
The UniProt lexicon contains three different possible exact matches for each of the proteins 'Lyn' and 'HS1', with different species, and so the system had to try to determine which particular species the protein mentions referred to. Out of the five species mentioned in the text (Escherichia coli, Homo sapiens, Mus musculus, Oryctolagus cuniculus, and Saccharomyces cerevisiae), the system chose M. musculus (correctly) for some of the interaction mentions and H. sapiens (incorrectly) for other interaction mentions.
Finally, for ISS the performance of the one submitted run is shown in Table 6. A sample sentence identified by the system, from PubMed document 14506250, as showing an interaction between MO4L1_HUMAN and RB_HUMAN, is as follows:
"We confirmed the association of MRGX with HDAC1 by immunoprecipitation/Western analysis and determined that MRGX complexes had HDAC activity."
The comparison between this sentence and the one selected by the curators attained a similarity score of 0.9574 (on a scale from 0 to 1).
The main observation to be made regarding the results for the GM task is that CRF outperforms BMEMM, using the same feature set, and either evaluated on the official test set or cross-validated on the training set. Although the difference in F1 is small (1.2 to 1.4 percentage points), it is noted in  that differences of this order can be significant on this dataset. The overall performance of the T6 system on recognizing gene names is competitive with the other submitted systems, although several systems performed significantly better. However, our submission involved a straightforward application of existing technology, there are many easily used CRF implementations available, and the feature set could be assembled and optimized rapidly.
The GN system identifies the entity mentions that have been marked up by GM. Therefore, the recall of the GM system sets an upper-bound for the recall of the GN. It is likely that a GM system optimized toward recall would improve performance of GN. In other words, if GM failed to recognize a gene entity, then there was no way that GN could find an identifier for that gene. Our GM system achieved a recall of 83% on a set of held-out GM training data (see Table 1), and therefore we would expect that the maximum recall of the GN system should be close to that number.
We applied an improved JaroWinkler measure to the GN training dataset and achieved a recall of 85% and a precision of 15%. The JaroWinkler measure is described in the Materials and methods section [below]. To maximize recall, we used a threshold confidence of 0 and took the top two matches. We could not test our GM system on the same dataset for a direct comparison, because gene entities were not marked up in the GN data.
The filter was ML based, and the features that we used in the submitted system are described in the the Materials and methods section (below). We also experimented with other features that were not included in our final system. For example, we obtained 'Google counts' for every name in the supplied gene lexicon, and then assigned Google counts to each identifier by summing up the gene names that associate with the identifier. The assumption was that the Google counts might indicate the popularity of the identifiers, and the less popular ones should be filtered out because they probably occurred rarely in the literature. We also tried the nearest 'species word' as a feature, which might help in filtering out the non-human genes. These features, however, did not improve performance of GN and therefore were not integrated into the final system. One reason that the Google count feature was not helpful was that the world-wide web is noisy, and many gene names are also English common words or other types of proper names, and therefore the counts did not accurately reflect the frequency of occurrences of the gene names. Counts obtained from large biomedical corpora, on the other hand, might help, but more experiments are needed to reach conclusions.
For IAS, the primary goal was to improve the results for article selection by extending the traditional bag-of-words model of text categorization to include features based on NLP. Table 7 compares results of a bag-of-words baseline system to the bag-of-NLP system. For the purposes of comparison, the results are presented for the original test set [see Table 3]. They differ slightly from those obtained for the official test set, which is still to be released by BioCreative II. The baseline system only used the 'word' and 'bigram' features but is otherwise identical to the bag-of-NLP system. The results, presented both for fivefold cross-validation on the training set and for the test set, indicate that the NLP-based features can provide small performance gains. Thus, in comprehensive curation systems that include both an article selection component and an NLP-based assisted curation component, there can be benefits from preprocessing all documents with NLP before article selection as a means of improving the article selection phase. The downside is that a bag-of-NLP system is significantly slower than a bag-of-words system (in our case it is two orders of magnitude slower), although much of the processing can be done off-line.
For IPS, several pre-existing TXM pipeline components were used and combined with additional steps to normalize protein names to the UniProt lexicon and to remove noncuratable PPIs. The pipeline is described in detail in the Materials and methods section (see below), but conceptually it can be considered as consisting of the following stages (see Figure 2).
1. Preprocessing: linguistic preprocessing includes tokenization and sentence splitting, lemmatization, chunking, and part-of-speech tagging.
2. Named entity recognition (NER): in this stage all mentions of proteins in the text are identified.
3.In relation extraction (RE), each pair of proteins occurring in the same sentence is examined, and whether the sentence refers to an interaction (PPI) between them is determined.
4. Normalization: in this stage a set of possible UniProt identifiers is generated for each protein mention.
5.The disambiguation stage ranks the set of identifiers produced by the normalization stage, using species information in the text, in order to identify the most likely identifier for each protein.
6. Finally, the curation filter combines the outputs of normalization and RE at a document level to give a list of pairs of UniProt identifiers, representing the PPIs mentioned in the document. The curation filter aims to remove the noncuratable PPIs from this list.
Because the overall system is comprised of several different stages, it would be useful to gain some idea of the performance of each stage to see where improvements could be made.
One way to consider the operation of the pipeline is that the preprocessing, NER, and normalization stages generate a set of possible UniProt identifier pairs, representing curatable interactions, which must then be filtered down by the subsequent three stages. It would therefore be useful to measure the performance of generating curatable PPIs at each stage to determine where improvements can be made. The initial set of UniProt identifier pairs are generated by considering all possible pairs of all possible matches generated for all the proteins found by NER. Consequently, an indication of the recall of each component can be estimated by measuring the number of correct interactions lost at each stage. The normalization requirement in IPS complicates any error analysis, because the gold data, in the form of pairs of UniProt identifiers, are not directly linked to surface forms in the text. However, a certain amount of information about the error sources is available.
In Tables 8 to 11, the results quoted use a version of the IPS training set with all papers with more than 30 interactions removed, which contains 2,039 gold (human curated) interactions. It is expected that similar error patterns would be observed when testing on the test set. Each of the tables shows the number of correctly predicted interactions, together with the total number of predicted interactions, so that the filtering process may be observed as it reduces the number of predictions by removing incorrect interactions, and as a side-effect removes some correct interactions. It is felt that these measures illustrate the filtering process better than the traditional true and false positive and false negative counts, although these counts can easily be derived from the information in Tables 8 to 11.
Table 8 shows the percentage of gold interactions for which NER and normalization successfully predicted the identifiers of both participants. Note that the total number of predicted interactions at this point would be equivalent to the count of all pairs of predicted normalizations, and hence is too large to show in the table.
The fuzzy match normalizer generates a much larger number of correct matches than the exact matcher, resulting in increased recall at this stage, although it also generates around ten times more false positives, making the filtering task much harder for the later stages. It is not possible to calculate separate recall figures for the NER and normalizer, because this would require linking each of the gold PPIs to the text, in order to determine whether the NER component had successfully recognized the proteins. Testing of the NER component on the held-out proportion of the TXM corpus gives a recall of about 80% on protein mentions, but the NER task within IPS is different because it only requires the identification of proteins involved in curatable interactions.
The next stage in the pipeline is RE, which takes the output of NER and normalization, examines each pair of proteins, and decides whether the text states that the two proteins interact. Table 9 shows the proportion of gold PPIs that are still extracted after RE, and the total number of proposed PPIs, considering all matches generated by normalization. Furthermore, the estimated recall of RE is given by comparing the number of correct interactions before and after RE. The number of proposed PPIs is large, especially in the fuzzy match configuration, because all possible UniProt matches for each protein have been retained. This means that, for example, if a pair of proteins each has two possible UniProt identifiers, then a total of four different candidate interactions will be generated between them.
In the next stage, the disambiguator chooses the single most likely identifier for each protein mention, using the species information in the text. Table 10 shows the numbers of proposed PPIs, the number of correct and percentage of the gold interactions that are identified, and an estimate of the recall for the disambiguator. It can be seen that the recall of the disambiguator in the fuzzy match configuration is worse; in other words, it throws away more of the correct answers in this configuration. However, it should be remembered that the disambiguator has a much harder task in this case because the number of false positives is much higher, by nearly an order of magnitude. At this point, the difference between the TXM pipeline, which extracts all PPIs, and the task of the BioCreative II challenge of identifying curatable interactions becomes apparent.
The final stage in the pipeline is the curation filter, which is designed to remove noncuratable PPIs from the set of proposed PPIs. Because the curation filter is an ML component trained on the BioCreative II data, fivefold cross-validation was used in the experiments. Its performance is shown in Table 11.
The preceding analysis illustrates one of the issues with the pipeline architecture. Although it provides modularity, which eases development, errors produced by early stages of the pipeline are passed down the pipeline and not corrected by later stages. For example, the disambiguator guesses the species associated with each protein and uses this species to choose the most likely UniProt identifier for the protein from the list proposed by the normalizer. However, if the disambiguator's choices result in a proposed PPI where there is a mis-match between the species of the participating proteins, then that proposed PPI is likely to be discarded by the curation filter. Ideally, the curation filter should be able to feed back to the disambiguator to ask it for alternative identifiers with compatible species. Another example is the interplay between NER and RE. If NER does not predict proteins in a particular sentence, then RE cannot predict a PPI, even if the sentence provides strong linguistic evidence of one. If RE could feedback to NER, then NER would be able to reconsider its decision. However, the possible downside of introducing such feedback between components is that it tends to make the system less modular, and therefore less flexible and maintainable.
In general, the performances of the systems submitted for IPS were low, with no team scoring above 0.3 on macro-averaged F1. No equivalent human score, such as an inter-curator agreement, is reported in the literature for comparison. Nevertheless, the level of performance appears to be too low to be usable for unassisted automatic curation. So the question arises, why is the extraction of curatable PPIs so difficult? The above analysis does not single out any component as being especially weak, but suggests that it is the aggregation of errors across the different components that is the problem. The IPS performances should be contrasted to those reported on evaluations that focus on a single task, often making simplifying assumptions, such as only considering human proteins in GN, where performance levels of around 80 to 90% of human performance are often reported.
For ISS the T6 results were quite low, with only 5% of sentences identified agreeing with those selected by the curators. However, it should be noted that the scoring criteria in this subtask are quite strict, in that credit is only given when the system chooses the same evidence sentence as the curator, when it is possible that other sentences from the document would also be appropriate. In order to accurately assess the ISS performance of the submitted systems, it would be necessary to perform an expensive manual examination of all the sentences provided.
For the PPI subtasks (IPS, ISS, and IAS), the IE pipeline developed for the TXM program proved effective because it addressed related problems (identification of proteins and their interactions) and was trained on similar data to those used in BioCreative II. For IPS the pipeline architecture was easily extended with two extra components (normalization and curation filtering) specific to the requirements of the subtask, showing the flexibility of this architecture. The extension also required a change of emphasis, from a system that aims to assist curators by indicating possible interactions, to a system that attempts to populate a curated database.
Our approach to normalization, based on a string distance measure and ML disambiguation, has the advantage of being more easily adaptable to other types of entities (for example, tissues and cell lines) than the approaches based on manually created matching rules. Given that it is very hard to predict automatically the single correct identifier for a biomedical named entity, it would be interesting to explore the relative merits of approaches that generate a ranked list of candidate identifiers, and also provide the users with fuzzy matching tools to help in searching ontologies more intelligently.
Our submission for IPS involved trying to reconstruct curated information from interactions mentioned explicitly in the text. However, it is not known what proportion of curated data can be obtained this way. In other words, are all or most curatable interactions mentioned explicitly in the text as an interaction between two named proteins? Recent work by Stevenson  showed that a significant proportion of facts in the Message Understanding Conference (MUC) evaluations are distributed across several sentences, and similar results appear likely to apply in the biomedical domain. Although the low overall scores in IPS show that NLP techniques are not yet ready to replace manual curation, they may be nevertheless able to aid curators in their work. Alternatively, they may be used to produce large volume, noisy data, which may be of benefit to biologists as evidenced by databases as such as TrEMBL, a computer-annotated database that supplements the manually curated SwissProt database .
Materials and methods
The TXM pipeline
The Team 6 system for BioCreative II made use of an IE pipeline developed for the TXM project. The TXM pipeline consists of a series of NLP tools, integrated within the LT-XML2 architecture . The development of the pipeline used a corpus of 151 full texts and 749 abstracts selected from PubMed and PubMedCentral as containing experimentally determined protein-protein interactions. The corpus was annotated by trained biologists for proteins and related entities, protein normalizations (to an in-house word list derived from RefSeq), and protein-protein interactions. Around 80% of the documents were used for training and optimizing the pipeline, whereas the other 20% were held back for testing.
The pipeline consists of the following components (see Figure 1).
The preprocessing component comprises tokenization, sentence boundary detection, lemmatization, part-of-speech tagging, species word identification, abbreviation detection, and chunking. The part-of-speech tagging uses the Curran and Clark maximum entropy Markov model tagger  trained on MedPost data , whereas the other preprocessing stages are all rule-based. The tokenization, sentence boundary detection, species word identification, and chunking components were implemented with the LT-XML2 tools. The Schwartz and Hearst abbreviation extractor  was used for abbreviation detection and morpha  for lemmatization.
Named entity recognition
In the pipeline, NER of proteins is performed using the Curran and Clark classifier , augmented with extra features tailored to the biomedical domain. The pipeline NER component was not used in the GM submission, because the pipeline component is trained to detect proteins, and the GM task was concerned with gene products.
The term normalization task in the pipeline involves choosing the correct identifier for each protein mention in the text, where the identifiers are drawn from a lexicon based on RefSeq. A set of candidate identifiers is generated using hand-written fuzzy matching rules, from which a single best identifier is chosen using an ML-based species tagger, and a set of heuristics to break ties. The term normalization component of the pipeline was not used directly in BioCreative II because they employ different protein lexicons.
To find the PPI mentions in the text, a maximum entropy relation extractor was trained using shallow linguistic features . The features include context words, parts-of-speech, chunk information, interaction words, and interaction patterns culled from the literature. The relation extractor examines each pair of proteins mentioned in the text, and occurring less than a configurable number of sentences apart, and assigns a confidence value that indicates the degree to which the mention is an interaction. All mentions with a confidence value above a given threshold are considered interactions, whereas those below the threshold are not. Although the relation extractor can theoretically recognize both inter-sentential and intra-sentential relations, because both types of candidate relations are considered, in practice very few inter-sentential relations are correctly recognized. Only around 5% of annotated relations are inter-sentential, and it is likely that using exactly the same techniques as on the intra-sentential relations is not optimal, especially because many of the inter-sentential relations use co-references. The detection of inter-sentential relations is the subject of ongoing research.
The remainder of this section describes how this pipeline was extended and adapted for BioCreative II (see Figure 2), resulting in the best performance per task. Although some time was spent on optimizing parameters and features, the overall infrastructure of the individual TXM pipeline components was applied immediately without significant changes.
To address the GM task, our team employed two different ML methods using similar feature sets. Runs 1 and 3 used CRFs , whereas run 2 used a BMEMM . Both CRF and BMEMM are methods for labeling sequences of words that model conditional probabilities, so that a wide variety of possibly inter-dependent features can be used. The named entity recognition problem is represented as a sequential word tagging problem using the BIO encoding, as in CoNLL (Conference on Computational Natural Language Learning) 2003 . In BMEMM, a log-linear feature-based model represents the conditional probability of each tag, given the word and the preceding and succeeding tags. In CRF, however, the conditional probability of the whole sequence of tags (in one sentence), given the words, is represented using a log-linear model. Both methods have been shown to give state-of-the-art performance in sequential labeling tasks such as chunking, part-of-speech-tagging, and named entity recognition [10, 21–23]. The CRF tagger was implemented with CRF++  and the BMEMM tagger was based on Zhang Le's MaxEnt Toolkit .
Gene mention preprocessing
Before training or tagging the documents with the machine learner, they were passed through the preprocessing stages of the TXM pipeline (as described above).
Gene mention features
For the machine learners, the following features were extracted for each word.
1. Word: the word itself is added as a feature, plus the four preceding words and four succeeding words, with their positions marked.
2. Headword: the headwords of noun and verb phrases are determined by the chunker, and, for all words contained in noun phrases, the head noun is added as a feature.
3. Affix: the affix feature includes all character n-grams with lengths between two and four (inclusive), and either starting at the first character, or ending at the last character of the word.
4. Gazetteer: the gazetteer features are calculated using an in-house list of protein synonyms derived from RefSeq. To add the gazetteer features to each word in a given sentence, the gazetteer is first used to generate a set of matched terms for the sentence, where each word is only allowed to be in one matched term and earlier starting, longer terms take precedence. The unigram gazetteer feature for each word has value B, I, or O, depending on whether the word is at the beginning, inside, or outside of a gazetteer matched term. The bigram gazetteer feature is also added, and this is the concatenation of the previous and current word's gazetteer feature.
5. Character: for each of the regular expressions listed in Table 12, the character feature indicates whether the word matches the regular expression. These regular expressions were derived from lists published in previous work on biomedical and newswire NER [15, 26]. The length of the word is also included as a character feature.
6. Postag: this feature includes the current word's part-of-speech (POS) tag and the POS tags for the two preceding and succeeding words. Also added are the bigram of the current and previous word's POS tag, and the trigram of the current and previous two words' POS tags.
7. Wordshape: the word shape feature consists of the word type feature of , and a variant of this feature that only collapses runs of greater than two characters in a word, and bigrams of the word type feature.
8. Abbreviation: the abbreviation feature is applied to all abbreviations whose antecedent is found in the gazetteer.
The GN system was developed with genericity in mind. In other words, it can be ported to normalize other biological entities (for example, disease types, experimental methods, and so on) relatively easily, without requiring extensive knowledge of the new domain. The approach that was adopted combined a string similarity measure with ML techniques for disambiguation.
For GN, the system first preprocesses the documents using the preprocessing modules in the TXM pipeline, and then uses the gene mention NER component to mark up gene and gene product entities in the documents. A fuzzy matcher then searches the gene lexicon provided and calculates scores of string similarity between the mentions and the entries in the lexicon using a measure similar to JaroWinkler [27–29].
The Jaro string similarity measure [27, 28] is based on the number and order of characters that are common to two strings. Given strings s = a1 ... a k and t = b1 ... b l , define a character a i in s to be shared with t if there is a b j in t such that b j = a i with i - H ≤ j ≤ i + H, where . Let be the characters in s that are shared with t (in the same order as they appear in s) and let be analogous. Now define a transposition for s' and t' to be a position i such that . Let Ts',t' be half the number of transpositions for s' and t'. The Jaro similarity metric for s and t is shown in Equation 1:
A variant of the Jaro measure proposed by Winkler  also uses the length P of the longest common prefix of s and t. It rewards strings that have a common prefix. Letting P' = max(P,4), it is defined as shown in Equation 2:
For the GN task, a variant of the JaroWinkler measure was employed, as shown in Equation 3, which uses different weighting parameters and takes into account the suffixes of the strings.
Here, θ = (# CommonSuffix - # DifferentSuffix)/lengthOfString. The idea is to look not only at the common prefixes but also at commonality and difference in string suffixes. A set of equivalent suffix pairs was defined; for example, the Arabic number 1 is defined as equivalent to the Roman number I. The number of common suffixes and the number of different suffixes (1 and 2 or 1 and II would count as different suffixes) is counted, and strings with common suffixes are rewarded whereas those with different ones are penalized.
At the end of the fuzzy matching stage, each mention recognized by NER is associated with the single highest scoring match from the gene lexicon, where the score indicates the string similarity. Note that each match is associated with one or more identifiers (in cases where ambiguity occurs) from the gene lexicon.
The GN system then collects all of the gene identifiers generated by the fuzzy matcher, and pairs each gene identifier with a set of features in order to use an ML-based disambiguator. These identifier-feature set pairs are used as training data to learn a model that predicts the most probable identifier out of a pool of candidates returned by the fuzzy matcher. The feature set consists of both simple features such as the contextual text properties surrounding the gene mentions (for example, their part-of-speech tags and so on), and complex features such as the distance scores between the mentions in text and the matches returned by the fuzzy matcher. It was found that the complex features are particularly helpful in terms of increasing the F1 score.
In more detail, all the identifiers in a document found by the fuzzy matcher were collected, then the ones that are correct according to the answer file were used as positive examples and the others were used as negative ones. In summary, each identifier was associated with a set of features as follows.
1. Fuzzy-confidence: confidence scores from the fuzzy matcher. (Only those matches with confidence scores higher than 0.80 were considered.)
2. Synonym-similarity: the averaged confidence score of the similarity between all synonyms linked to the gene identifier and the match.
3. Context-similarity: the similarity between descriptions (synonyms) associated with a gene identifier and all gene entities in the current document recognized by the NER. The similarity is calculated by two measures: dice coefficient and tf·idf. Dice coefficient is defined as twice the number of common terms in the two sets of tokens to compare, divided by the total number of tokens in both sets: . tf·idf is defined as the product of term frequency (tf) and inverse document frequency (idf). , where n i is the number of occurrences of the considered term and the denominator is the number of occurrences of all terms. , where |D| is the total number of documents and the denominator is the number of documents where the term appears.
4. NER confidence: confidence score generated by the NER tagger.
5. Context: local features, including contextual words (± 10), lemmas (± 4), POS tags (± 2), species words (± 10) and bigrams (± 5). (The numbers in parentheses denote the size of the context window.)
6. Length: length of the gene mention and length of the match.
With the positive and negative examples extracted, determining the correct normalizations becomes a standard ML task. We trained a support vector machine (SVM) classifier, using SVM light , on the examples extracted from the BioCreative II GN training data, and used it as a disambiguator to filter out false-positive identifiers.
Interaction articles subtask
The IAS was treated as a standard document classification problem [31, 32], where abstracts were classified as curatable if they contained curatable protein interaction information and noncuratable otherwise. Document classification techniques typically use a bag-of-words approach, which ignores the word order in the document. This approach was extended by using a 'bag-of-NLP' approach, where, in addition to words, a variety of features derived from the output of the TXM pipeline were added to the bag. The classification was performed with SVM light using the linear kernel with the default parameters. The documents were ordered based on the output from the SVM classifier.
Before the documents were passed to the machine learner for training or classification, they were first passed through the the TXM pipeline. In addition, each of the named entities and compound nouns in the document were marked as phrases.
The features extracted for each document are described below. Only features that occurred at least twice in the training data were used and each feature was given a binary weight. Each feature was converted to lower case and words found in a custom stop-word list were ignored. For each word a backoff version was also calculated by converting all numbers to a single '#' symbol and removing all punctuation, and a backoff-stemmed version was calculated by first lemmatizing and then performing the same substitutions.
1. Word: the word itself.
2. Word-backoff: the backoff version of the word.
3. Bigram: the bigrams of the backoff feature. The bigrams were not allowed to cross sentence boundaries.
4. Chunk: the concatenation of the backoff-stemmed versions of each word in a chunk up to a maximum of seven words.
5. Phrase: the concatenation of the backoff-stemmed versions of each word in a phrase (one-word phrases were included).
6. Phrase-bigram: the bigrams of the phrase feature. All proteins were converted to the token 'nerprotein'. The bigrams were not allowed to cross sentence boundaries.
7. Chunk-headword-bigram: the bigrams of the backoff-stemmed version of each headword of successive chunks. Chunks containing negative phrases (for instance, does not interact) were indicated by prefixing the bigram with 'neg'.
8. Chunk-headword-trigram: the trigrams of the backoff-stemmed version of each headword of successive chunks. All proteins were converted to the token 'nerprotein'. Chunks containing negative phrases were indicated by prefixing the trigram with 'neg'.
9. Protein: added if the document contained at least one protein.
10. Two-proteins: added if the document contained at least two unique proteins.
11. No-proteins: added if the document did not contain any proteins.
12. Title-proteins: added if the document contained two unique proteins in the title.
Interaction pairs subtask
The T6 IPS system made use of the TXM IE pipeline to identify mentions of PPIs, together with additional components to normalize proteins to UniProt and to identify the curatable interactions from among the interaction mentions.
Two methods of data preparation were used. In runs 1 and 3, the supplied pdftotext converted files were converted to the XML input format required by the pipeline, essentially by just wrapping the text in <text> and <document> elements and removing illegal characters. (These were ASCII control characters inserted by pdftotext, which are not legal in XML. They were all removed except for ASCII 0x0C, which was converted to a double new line.) In run 2, however, the supplied HTML files were used, having been first run through an in-house HTML to XML converter.
The NER and RE stages of the TXM pipeline were used to identify mentions of PPIs.
Neither the pipeline normalizer nor the GN system could be used directly for normalization: the former because it normalizes to RefSeq, and the latter because it was concerned with genes rather than proteins, and because the IPS required species disambiguation, which was not required for GN. Two approaches were used to assign UniProt identifiers to protein mentions: exact matching (in runs 1 and 2) and fuzzy matching (in run 3). In exact matching, the protein name in the text is compared against each protein synonym in the UniProt lexicon using a case-insensitive match, in order to obtain a list of possible identifiers. If no possible identifiers are found, and the protein name is the long or short form of an abbreviation identified by the abbreviation extractor, then the corresponding (short or long) form is also looked up in the lexicon. In order to filter the list of identifiers, each identifier is weighted according to how often its corresponding species name is mentioned in the text, with species name mentions closer to the protein mention receiving higher weights than those farther away. The identifier with the highest weight is then chosen.
The fuzzy match protein normalizer uses a string distance measure (as described in the GN method description) to find the set of protein names in the lexicon that are closest to the protein mention in the text. These distances are then weighted according to the species word mentions, as for exact matching, and the highest weighted identifier chosen.
The curation filter takes as its input the set of UniProt identifier pairs representing the interactions found in the text by the pipeline, with their UniProt normalizations, and outputs the set of normalized, curatable interactions. The filter was implemented with an SVM classifier (using  with an RBF kernel), trained on the supplied training data, using the following set of features.
1. Relation count: this feature counts the number of times that the interaction is mentioned in the document.
2. Inter-sentential: this indicates whether the majority of the mentions of the interaction are inter-sentential relations between proteins, or intra-sentential. As noted in the TXM pipeline description, the relation extractor does not perform well on inter-sentential relations, and so very few of these are predicted (only 15 in the training corpus).
3. Relation confidence: each interaction mention found by the pipeline has an associated confidence. The value of this feature is the maximum confidence assigned to an interaction's mentions.
4. Position: this feature specifies the relative position within the document of the first and last mentions of the interaction. In addition, the mean relative position of the interaction mentions is included, for each interaction.
5. Species: the species feature indicates whether the proteins in the proposed interaction have different species.
6. Title: this feature indicates whether the interaction is mentioned in the title.
7. Normalization confidence: when using the fuzzy-matched normalizations, this feature indicates how close a match has been found during normalization of the protein mention.
8. As recommended in the IPS task instructions, any documents containing more than 30 interactions were excluded from the training set.
Interaction sentences subtask
To identify the interaction sentences, the T6 system used the same sequence of steps as for IPS. For each interaction pair predicted, the top five corresponding PPI mentions were returned, where PPI mentions were ranked according to the confidence output by the relation extractor. In order to be able to track back the sentences to the original document, the HTML converted data were used (as in IPS run 2), because the HTML to XML converter provided a mapping between the original and converted versions.
bidirectional maximum entropy Markov model
conditional random field
interaction article subtask
interaction pair subtask
interaction sentence subtask
named entity recognition
natural language processing
support vector machine.
Yeh AS, Hirschman L, Morgan A: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics. 2003, 19 (suppl 1): i331-i339. 10.1093/bioinformatics/btg1046.
Rebholz-Schuhmann D, Kirsch H, Couto F: Facts from text: is text mining ready to deliver?. PLoS Biology. 2005, 3: e65-10.1371/journal.pbio.0030065.
Xu H, Krupke D, Blake J, Friedman C: A natural language processing (NLP) tool to assist in the curation of the laboratory mouse tumor biology database. AMIA Annu Symp Proc. 2006, :1150-[http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839428]
Alex B, Haddow B, Grover C: Recognising nested named entities in biomedical text. Proceedings of BioNLP; Prague, Czech Republic. 2007, [http://acl.ldc.upenn.edu/W/W07/W07-1009.pdf]
Haddow B, Matthews M: The extraction of enriched protein-protein interactions from biomedical text. Proceedings of BioNLP, Prague, Czech Republic. 2007, [http://acl.ldc.upenn.edu/W/W07/W07-1019.pdf]
Smith L, Tanabe LK, Ando R, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA, Hunter L, Carpenter B, Tsai RTH, Dai HJ, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata-Vázquez J, Wilbur WJ: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9 (Suppl 2): S2-
Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu H, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L: Overview of BioCreative II gene normalization. Genome Biol. 2008, 9 (Suppl 2): S3-
Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 2008, 9 (Suppl 2): S4-
Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of ICML. 2001, [http://www.cis.upenn.edu/~pereira/papers/crf.pdf]
Tsuruoka Y, Tsujii J: Bidirectional Inference with the easiest-first strategy for tagging sequence data. Proceedings of HLT/EMNLP. 2005, [http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/emnlp05bidir.pdf]
Wilbur J, Smith L, Tanabe L: BioCreative 2 gene mention task. Proceedings of the BioCreAtIvE II Workshop; Madrid, Spain. 2007, 7-16.
Stevenson M: Fact distribution in information extraction. Lang Resources Eval. 2006, 40: 183-201. 10.1007/s10579-006-9014-4.
Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucleic Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.
Language Technology Group Software. [http://www.ltg.ed.ac.uk/software/xml/]
Curran J, Clark S: Language independent NER using a maximum entropy tagger. Proceedings of CoNLL03; Edmonton, Canada. 2003, [http://aclweb.org/anthology-new/W/W03/W03-0424.pdf]
Smith L, Rindflesch T, Wilbur WJ: MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004, 20: 2320-2321. 10.1093/bioinformatics/bth227.
Schwartz A, Hearst M: A simple algorithm for identifying abbreviation definitions in biomedical text. Proceedings of PSB. 2003, [http://biotext.berkeley.edu/papers/psb03.pdf]
Minnen G, Carroll J, Pearce D: Robust, applied morphological generation. Proceedings of INLG. 2000, [http://www.aclweb.org/anthology-new/W/W00/W00-1427.pdf]
Nielsen LA: Extracting protein-protein interactions using simple contextual features. Proceedings of BioNLP; New York, USA. 2006, [http://acl.ldc.upenn.edu/W/W06/W06-3322.pdf]
Tjong Kim Sang EF, De Meulder F: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. Proceedings of CoNLL. 2003, [http://aclweb.org/anthology-new/W/W03/W03-0419.pdf]
McCallum A, Li W: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proceedings of CoNLL. 2003, [http://www.cs.umass.edu/~mccallum/papers/mccallum-conll2003.pdf]
McDonald R, Pereira F: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005, 6 (suppl 1): S6-10.1186/1471-2105-6-S1-S6.
Sha F, Pereira F: Shallow parsing with conditional random fields. Proceedings of HTL-NAACL. 2003, [http://www.cis.upenn.edu/~pereira/papers/shallow.pdf]
Maximum Entropy Modeling Toolkit for Python and C++. [http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html]
Collier N, Takeuchi K: Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Informatics. 2004, 37: 423-435. 10.1016/j.jbi.2004.08.008.
Jaro MA: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989, 84: 414-420. 10.2307/2289924.
Jaro MA: Probabilistic linkage of large public health data files. Stat Med. 1995, 14: 491-498. 10.1002/sim.4780140510.
Winkler WE: The state of record linkage and current research problems. Tech rep, Statistics of Income Division, Internal Revenue Service Publication R99/04. 1999, [http://www.census.gov/srd/papers/pdf/rr99-04.pdf]
Joachims T: Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines. 1999, Cambridge, MA: MIT Press
Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader G, Michalickova K, Pawson T, Hogue C: PreBIND and Textomy: mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003, 4: 11-10.1186/1471-2105-4-11.
Polavarapu N, Navathe SB, Ramnarayanan R, ul Haque A, Sahay S, Liu Y: Investigation into biomedical literature classification using support vector machines. Proc IEEE Comput Syst Bioinform Conf. 2005, 366-374.
ITI Life Sciences. [http://www.itilifesciences.com]
The TXM pipeline on which this system is based was developed as part of a joint project with Cognia EU , supported by the Text Mining Programme of ITI Life Sciences Scotland . The authors should also like to thank Stuart Roebuck for all his assistance during the development of the system and the preparation of this manuscript.
This article has been published as part of Genome Biology Volume 9 Supplement 2, 2008: The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge. The full contents of the supplement are available online at http://genomebiology.com/supplements/9/S2.
This work was funded by ITI Life Sciences, Scotland, whose mission is to explore commercialization of promising technologies in the life sciences.
The TXM pipeline was created by all authors. Within BioCreative II, BH worked on GM and IPS; MM worked on IAS, IPS and ISS; XW worked on GN and IPS; and CG worked on IPS and ISS. All authors contributed to the preparation of the article.
About this article
Cite this article
Alex, B., Grover, C., Haddow, B. et al. Automating curation using a natural language processing pipeline. Genome Biol 9, S10 (2008) doi:10.1186/gb-2008-9-s2-s10
- Natural Language Processing
- Name Entity Recognition
- Relation Extraction
- Gene Mention
- Gene Lexicon