Toward knowledge support for analysis and interpretation of complex traits
© BioMed Central Ltd 2013
Published: 30 September 2013
Skip to main content
© BioMed Central Ltd 2013
Published: 30 September 2013
The systematic description of complex traits, from the organism to the cellular level, is important for hypothesis generation about underlying disease mechanisms. We discuss how intelligent algorithms might provide support, leading to faster throughput.
The systematic description of variation has gained increasing importance since the discovery of the causal relationship between a genotype placed in a certain environment and a phenotype . The triumvirate connection of a phenotype, the underlying genotype and the environment in which the genotype is placed plays an important role to enhance our knowledge. Phenotypes can be applied to clinical questions, for example, the genetic origins of diseases [2–4], as well as biological problems, such as the evolution of species over time . For example, PhenomeNET  compares phenotypes recorded in mutagenesis experiments in eight different species with the signs and symptoms of human diseases and uses orthology to determine viable gene candidates. Another example for the application of phenotypes is the PhenoScape knowledge base , which records phenotypes to answer questions such as ‘How were limbs formed from fins?’ Effective use of phenotype information and an eventual facilitation of translational research  requires researchers to achieve a common mindset and build a shared conceptual view on the definition, representation and interoperability of phenotypes. While this need has been previously recognized, it has, however, proven to be a challenging process, even for biological data corresponding to one species . The intrinsic complexity of phenotypes is the most important obstacle in the process of reaching consensus and a common understanding. In general, phenotypes are considered to be observable characteristics, spanning from a molecular to an environmental level .
Finally, with the increasing amount of available phenotype data, initial steps have been taken to process and achieve interoperability of data from a range of resources using semantic layers [3, 4, 16, 17]. Interoperability through semantic layers means that ontologies are aligned to each other, for example, through lexical or ontological features, and the aligned ontologies enable the comparison of data being annotated with different ontologies. The integrated data can then be processed and facilitate biological discoveries. For example, PhenoDigm  aligns phenotypes from mutagenesis experiments in several species with the signs and symptoms of human diseases through ontological as well as lexical features. Once the phenotypes are aligned, the mutated genes are ranked according to their phenotype similarity with the disease, and the mutated genes exhibiting the highest similarity with the disease constitute candidate genes for this disease. However, interoperability and processing of data cover only a very small subset of all available data and further projects are required to address these aspects.
In this review, we assess the current status of phenotype information technologies, with a focus on the perception of phenotypes in different domains and the influence of this perception on the above-mentioned four dimensions. We highlight the progress made towards extant goals as well as provide a visionary perspective on the next steps required to bridge the existing solutions to facilitate seamless cross-domain research.
Phenotype annotation resources
Adult Mouse Ontology
The formalization of phenotypes raises several challenges, also common to other domain ontologies, for example, Gene Ontology (GO) . Most of the existing representations define phenotypes as pre-composed/pre-coordinated entities - that is, concepts that externalize as a whole the intrinsic duality of the underlying localization and the defined trait (for example, MP:0008572 - Abnormal Purkinje cell dendrite morphology; or HP:0008905 - Rhizomelic limb shortening). This implicit duality is sometimes made explicit via the structure of the ontology, using multiple inheritance (that is, one concept with multiple parents); for example, HP:0008905 is a descendent of HP:0001507 - Growth abnormality (denoting the focus on the trait) and of HP:0002813 - Abnormality of the limb bone morphology (denoting the focus on the localization - limb bone).
Glossary of terms
Descriptions that are added to data such as text
Representation of data (for example, biological data sets) through annotations (for example, through ontologies)
A conceptual model of a term according to (a) an entity part that denotes an anatomical or process part and (b) a quality part that characterises how the entity is affected
To establish the specific reference of a term according to an ontology
A sequence of words in a text that denotes a term according to some external reference system
Pre-composed (pre-coordinated) term
A term that has been affirmed and defined as a whole without division into its constituent parts
Post-composed (post-coordinated) term
A term that is defined according to the decomposition of its constituent parts and the grounding of those parts in one or more external ontologies
A specification of a conceptualization
The Web Ontology Language is a family of formal languages intended to aid machine understanding of resources on the World Wide Web
Resource Description Framework is family of specifications for describing resources on the World Wide Web. It is a World Wide Web Consortium standard
A collaborative movement to promote common data formats for data re-use by machines
Several ontological approaches have been proposed to implement EQ statements [26, 27] and, subsequently, tools have been developed to manually  or automatically [26, 29] construct them. These tools rely on the existence of ontologies that define localization and trait concepts, such as the Foundational Model of Anatomy (FMA)  for human, or the Mouse Adult Gross Anatomy Ontology (MA)  for mouse, and the species-agnostic Phenotype and Trait Ontology (PATO)  for traits.
The formal relations constructing the logical EQ statement are, in this case, inheres_in and has_qualifier (as defined in ), while the rest are the concepts introduced by the external ontologies and described above. This example also provides a glimpse of the complex logical formalisms that may emerge from post-composed entities, such as nested definitions of terms (dendrite part_of some Purkinje_cell). From an analysis and exploratory perspective, the EQ formalism provides clear advantages. However, it also features its own series of challenges; two of the most important are the formalization of complex entities and single-term phenotypes. Some representative examples of the former are the definition and representation of phenotypes that involve relationships between several anatomical elements, traits of specific parts of anatomical elements (for example, fingertips or interdigital folds), and traits of spatial, functional and non-functional properties of anatomical elements (for example, mineral density, movement, angles). Single-term phenotype expressions, on the other hand, do not externalize the localization-trait duality in an explicit manner (for example, HP:0010884 - Acromelia). Their semantics can still be encoded using the EQ formalism; however, it requires significant human input and comprehension, because in most cases the localization aspect is vaguely defined (for example, in the case of Acromelia: shortness of the <distal part> of a limb). Even though standardization efforts are ongoing, the representation of phenotypes varies across the different resources from narrative descriptions, over vocabularies and terminologies to ontologies . In order to derive novel genotype-phenotype associations or links between genes and drugs/diseases, this diversity of data needs to be integrated in a coherent manner. Efforts ranging from overarching databases to semantic integration via ontologies are underway, but, currently, none of the existing tools are capable of catering for all phenotype-relevant use cases.
Given the variety of phenotype descriptions  and resources [19, 35, 36], and the diversity of domains phenotypes are relevant to, existing tools fulfill versatile purposes. In the area of medicine, phenotypes are applied to: (i) screening, predicting or prioritizing genes that are potentially relevant to human genetic disorders [3, 4, 16] (for example, PhenomeNET showed a potential connection between Tetralogy of Fallot (OMIM:187500) and the mouse gene Adam19 (MGI:3028702) that is supported by other published studies); (ii) analyzing patients with unidentified medical conditions  (the authors suggested 431 potential causes, all novel, for 27 CNV disorders); or (iii) finding new ways of treating diseases with existing drugs  (for example, PhenomeDrug  suggests that tretinoin could be used as therapy for cystic fibrosis (OMIM:219700); this is also reported in the scientific literature). However, all these tools rely on the public availability of phenotype data represented with semantic annotations and diverse semantic similarity metrics  to derive associations between phenotypes and genes, diseases or drugs.
Phenotypes have also been used to support clinical diagnosis. Phenomizer , for example, uses a semantic scoring mechanism that calculates the similarity of a phenotype with the signs and symptoms of a disease. This procedure is particularly helpful in cases of patients where a diagnosis is difficult due to controversial phenotype information, for example, patients contained in the Database of Chromosomal Imbalance and Phenotype (DECIPHER) . A similar approach has also been followed by Paul et al.  with a focus on skeletal dysplasias. Finally, the same mechanism has been applied to study to what extent existing disorder classifications (for example, Orphanet ) are grounded in the publicly available phenotype-disorder associations . Koehler et al.  have shown that by combining OMIM and Orphanet phenotype data it is possible to re-create to a large extent human-made classifications, thus demonstrating the validity of the classifications as well as the value provided by existing disorder characterizations.
Based on the assumption that species possess orthologous genes and that these genes exhibit identical phenotypes, the systematic assessment of phenotypes and their corresponding genes may reveal new functions when assessed across species. PhenomicDB  is a database not only holding textual phenotype descriptions for a number of species but also enabling the comparison of phenotypes across species through text mining, thus enabling the discovery of novel gene functions. PhenoGO  also applies text mining but instead of directly identifying gene-phenotype associations, it lists connections between phenotypes and GO annotations. As long as a gene is phenotypically described, a GO profile can be derived based on the assigned phenotypes.
Most of the content in existing biological databases is populated through manual curation of the scientific literature; for example, MGD, OMIM or Zebrafish Information Network . The process of manual curation is, however, time-consuming and labor intense, resulting in huge costs for creating and maintaining the databases. To reduce time, labor and costs, semi-automated solutions gain more and more importance in supporting biocuration. PharmGKB , for example, is a database holding information about entities relevant to pharmacogenetics that have been automatically extracted from published literature with text mining. Only parts of the PharmGKB data have been validated through curation efforts.
PharmsPresso , another database generated from scientific literature, focuses on the extraction of relations between entities relevant to pharmacogenetics. When assessing phenotypes mentioned in OMIM records, van Driel and colleagues could derive meaningful phenotype clusters, resembling consistent GO annotations and protein-protein interactions . Obtained phenotype information from this study was made available via the MimMiner web interface . PhenoHM  allows, similar to PhenomicDB, comparison of phenotype information on a textual level across species and to access orthologous genes via their phenotypes.
In conclusion, even though initial steps have been made in the direction of the integration of phenotype data and text mining phenotype information, no exhaustive solutions have been developed yet to address the arising challenges from the analysis of phenotype data in medical, biological and translational contexts. Challenges include the differences in understanding of what a phenotype constitutes in different domains  (for example, the synonymous use of disease, syndrome, trait and phenotype), gaps in the terminologies, vocabularies and ontologies to represent phenotypes (for example, HPO with its 10,000 concepts has a lot of information about skeletal phenotypes but is sparse in other areas), missing annotations in databases (for example, diseases in OMIM are under annotated ), and jargon for individual domains (for example, automatically generating clusters from phenotypes from different species leads to mostly a cluster for certain species instead of a mixed cluster that would result from a shared terminology ). Most likely, similar to the generic Web environment, there will never be a ‘one size fits all’ solution; however, clearly defining biological and medical solutions will help to identify potential domain-specific breakthroughs, as well as highlight where improvements are required in order to keep pace with all the ongoing phenotype efforts.
Despite the fact that the existing text mining systems still need improvements, systems based on text mining from news articles are now being used to support analysts in detecting infectious disease outbreaks such as pandemic influenza . In experimental biology, groups of researchers have come together to propose shared tasks such as the Natural Language Processing of Biology Text (BioNLP)  and BioCreative challenges  that support database curators and accelerate the flow of results from the literature back to the scientific community. BioCreative, for example, has led to developments in gene normalization, chemical and drug name recognition, as well as assigning evidence codes to gene function. In the clinical domain, initiatives such as the i2b2 challenge  are aimed at helping translate the findings from genomics research into the design of targeted therapies for heritable diseases such as rheumatoid arthritis, hypertension and multiple sclerosis. One common factor linking all of these fields together is the heterogeneous conceptual class of phenotypes.
Two necessary research objectives for intelligent tools are (a) recognizing in text the phrases that form phenotypes and (b) linking them to established pre-composed or post-composed concepts in ontologies. As an example, consider the pre-composed term ‘Abnormal Purkinje cell dendrite morphology’ from Figure 2. This might appear in various forms in free text such as ‘The mice have abnormalities in their Purkinje cell dendritic tree resulting in abnormal morphology’ and ‘Abnormal morphology of dendrites in Purkinje cells’. Success is likely to require a fusion of technologies: prior domain knowledge, natural language processing algorithms and reasoning. We approach this section by briefly surveying the technical issues surrounding these goals and ask if there is a robust technical solution on the near horizon.
Constructing full phenotype vocabularies manually is a daunting task. Despite the success of dedicated phenotype ontologies such as HPO, MP, FYPO and others, the situation regarding pre-composed terminological resources - those in which the term appears without a division into its constituents - is still far from ideal. Such resources have been designed with a focus on classical centralized model databases, such as OMIM or MGD, and specific user communities in mind. However, as Thorisson et al.  argue, the centralized database structure sometimes has difficulty in handling complex relationships. In the case of phenotypes, both the concepts and the disciplines that use them are heterogeneous. This makes standards of scope, granularity and compositionality difficult to establish. Moreover, the generation of one pre-composed ontology covering an entire domain (for example, all phenotypes within one particular species) would not be maintainable due to the sheer amount of existing phenotypes.
In time, algorithmic techniques may be developed to fill the gap between pre-composed ontologies and free text variations. One approach to bridging (also called linking , normalization  and grounding ) from text to ontology is to develop automated mapping algorithms. Examples of such applications, currently used on a large scale by the biomedical community, are MetaMap , which bridges text and the Unified Medical Language System (UMLS) , or the NCBO Annotator , which maps textual entries to entities defined by ontologies stored in the NCBO BioPortal. In general, these applications identify term candidates using shallow parsing, generate plausible alternative forms (synonyms) and then match them to the entities forming the knowledge base (for example, the UMLS). Many options and configurations exist, including the ability to include/exclude particular ontologies or semantic groups, or to detect the degree to which variant candidates differ from the original textual form. However, from a user perspective, it is not apparent what weighting to attach to different forms of evidence. Furthermore, one of the major shortfalls of these algorithms is that they match only single constituent phrases, missing coverage in more complex grammatical structures such as striking upslanting of the palpebral fissures, small nose with broad root or short neck with loose skin noted by Schofield et al.  in (OMIM:211750). They may also fail in finding associations between closely related but superficially different surface forms such as high blood pressure and hypertension. Finally, a related challenge is in identifying semantic equivalence across ontologies: for example, in cross-species analysis where equivalent phenotypes need to be identified in model organisms; for example, enlarged hind paws in mouse and enlarged feet in human . Specific, phenotype and/or domain-oriented approaches have also been proposed based on data-driven learning, that is, machine learning, from labeled collections of texts and dictionaries (for example, [14, 15, 67]). In these examples, a software program learns from a small manually annotated data set whether a text span represents a phenotype or not, and can, after learning, be applied to more text to identify phenotype mentions. However, these represent mere pioneering efforts and require additional work in order to become reliable.
Another requirement of intelligent tools is the support of extensions and generation of mappings between ontological resources. As discussed earlier, phenotypes can be considered broadly as being compositional entities. For example, HP:0000365 - High frequency hearing loss consists of an anatomical process, GO:0007605 - Sensory perception of sound, and a quality, PATO:0002018 - Decreased magnitude, indicating an abnormality of the entity. Given the diverse nature of phenotypes, several researchers have suggested providing post-composed terms  in which the constituent parts are provided in a federated fashion by reference to external vocabulary systems. So far, production of post-composed terms has been mainly carried out by manual curation [28, 68]. Lately, however, several automated approaches have been proposed, each of which relies on natural language processing techniques to convert terms from the pre-composed to the post-composed form [26, 29, 69].
Current studies for free-text phenotype recognition and normalization appear hampered by a lack of gold standard data used for training and evaluation and there is a danger that inferences about the best methods may be impaired. Developing accurate systems depends crucially on both an open communication across domains, so that a common understanding about phenotypes and the research needs surrounding them can be achieved, as well as on the development of annotation standards. Furthermore, high-quality large-scale data sets are needed for both trainable systems and benchmark evaluation. The process of collecting and publishing such datasets is time-consuming and costly. However, several projects, such as the IMPC, aim to deal with this challenge yet require time until they reach a certain level of maturity. The lack of open data is also apparent in the clinical domain where the desire to develop new patient treatments has to be balanced against ethical concerns about patient privacy. Steady progress is being made alongside the development of de-identification algorithms , as well as collaborative initiatives, such as i2b2, which bring together patient data providers and technologists.
Even though initial work has been done in phenotype representation, acquisition and application, further steps are required in order to unlock the full potential of phenotype information, which in turn will drive the knowledge discovery process. Phenotype representations have to be harmonized across different species and a balance has to be found between terminologies used in communities and benefits across research domains. The complexity of phenotype information still hinders the development of a consistent formalization and prevents seamless integration of and data mining across diverse resources. The ongoing Linked Open Data efforts (for example, Bio2RDF ) provides access to increasing amounts of phenotype data that require a unified representation, which would then facilitate the creation of a broader picture surrounding hypotheses derivation from the data.
On a different note, promising first steps have been achieved in the domain of cross-species hypotheses generation. However, the benefits are impacted by both representation and acquisition. With the ever-growing amount of data, manual assessments are at this point infeasible and automated methods to analyze the data are urgently required. Due to a lack of a uniform representation of phenotypes across different domains, integration and consequently knowledge propagation are interrupted. The best benefits can be achieved with a complete and consistent coverage of the up-to-date knowledge about phenotypes and their influencing factors that enable hypotheses generation and derivation of novel findings. From a different perspective, the acquisition of phenotype data could also be tremendously improved through solving mismatched expectations. While a small subset of specific and a large set of generalized solutions exist, cross-community and cross-domain efforts are required to enable a better fit of generalized solutions to existing problems, and specific solutions to be repurposed to other problems. A clear and common understanding about existing problems and possible solutions is required that can only be achieved through open communication. Open communication will allow us to advance research in the field and to derive future solutions that target well-specified, real issues.
Furthermore, automated and supported acquisition of data is only possible with reliable methods. In recent years significant and welcome progress has been made in systematic evaluation of data-driven techniques through shared tasks like BioCreative and BioNLP. On the other hand, text mining progress has sometimes been behind the expectation of user communities due to inaccuracies in system output. This is largely because the language being processed is inherently ambiguous and requires new techniques and resources; for example, cross-domain event extraction, grounding, term decomposition, and harmonized understanding at a document-wide level. Phenotype concept recognition in text is a key non-trivial task that now needs to be addressed. Complex event extraction and normalization involving phenotypes are foundation tasks that need attention from the technical community to deliver working solutions into the hands of users.
Common representation formats for mark-up in text is also important, in particular for phenotype data, and the efforts made over the years by BioCreative and BioNLP should be closely followed. This should be aided by closer dialogue between the text mining, curator and biology communities. Developments in community dialogue on gold standards and system critiques could follow the encouraging model of the User Advisor Group in BioCreative 2011, leading to new approaches for enhancing the user experience.
In conclusion, we believe that improved communication would enable a common understanding across the different research domains and speed-up the development of solutions for most of the existing technical issues. Additional workshops are needed to allow researchers to gather and exchange phenotype resources, including their interpretation, representation, mining and integration. Once a shared mindset has been achieved, all four steps mentioned in this paper will reach a streamlining phase and will hence support translational research attain its real potential.
Natural language processing of biology text
Copy number variation
Deciphering developmental disorders
Database of chromosomal imbalance and phenotype in humans using ensembl resources
Foundation model of anatomy
Worm phenotype ontology
Human phenotype ontology
International mouse phenotyping Consortium
Mouse adult gross anatomy ontology
Mouse genome database
Mouse phenotype ontology
National center for biomedical Ontology
Open biological and biomedical Ontologies
Online Mendelian Inheritance of Man database
Phenotype and trait ontology
Unified Medical language system
Worm phenotype ontology.
The authors would like to thank Damian Smedley for providing the general idea of Figure 3. Nigel Collier’s research is supported by the European Commission through the Marie Curie International Incoming Fellowship (IIF) programme (Project: Phenominer, Ref: 301806). Tudor Groza’s research is funded by the Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) - DE120100508.
This article is published under license to BioMed Central Ltd.