Machine learning and genome annotation: a match meant to be?
© BioMed Central Ltd 2013
Published: 29 May 2013
Skip to main content
© BioMed Central Ltd 2013
Published: 29 May 2013
By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE.
The complete sequencing of the human genome marked an important milestone in modern biology [1, 2], but it also produced a whole new set of challenges in elucidating the functions and interactions of different parts of the genome. A natural first step to tackling these formidable tasks is to construct an annotation of the genome, which is to (1) identify all functional elements in the genome, (2) group them into element classes such as coding genes, non-coding genes and regulatory modules, and (3) characterize the classes by some concrete features such as sequence patterns. Over the years, many experimental and computational methods have been invented to accelerate this annotation process. Among the popular computational methods are those based on the concept of machine learning (Box 1). Originally a branch of artificial intelligence, machine learning has been fruitfully applied to a variety of domains. The basic idea of machine learning is to construct a mathematical model for a particular concept (an element class in the case of genome annotation) based on its features in some observed data. The model can then be applied to identify new instances of the concept in other data [3–5].
Reviews on machine learning methods for identifying some major classes of genomic elements
One major reason for the popularity of machine learning methods is its ability to automatically identify patterns in data. This is particularly important when the expert knowledge is incomplete or inaccurate, when the amount of available data is too large to be handled manually, or when there are exceptions to the general cases. We use protein binding motifs as an example for this part of the discussion.
Many DNA binding proteins, including transcription factors (TFs), recognize their target DNA regions by some short sequence motifs . The motifs are usually not exact, in that a TF can bind DNA sequences with some differences, albeit with different affinity. When the number of experimentally known binding sites of each TF was limited, it was common for human experts to abstract the binding motifs by some prominent features common to the observed binding sites, such as the most conserved locations within the motifs. The resulting expert knowledge was summarized by simple representations such as consensus sequences.
As high-throughput methods for probing TF binding sites, such as protein binding microarrays  and chromatin immunoprecipitation followed by microarrays (ChIP-chip) [39, 40] or high-throughput sequencing (ChIP-seq) [27, 28] have become popular, it has become easier to collect a large number of sequences that contain binding sites of a TF. Machine learning methods can automatically identify patterns common in these sequences but rare in the genomic background . Due to the large amount of examples available, the resulting models can have richer probabilistic representations with more parameters than what a human expert can easily handle, such as position weight matrices  and profile hidden Markov models .
In many cases, the exact binding locations of the TF in the input sequences are unknown. One needs to try different combinations of binding locations on these sequences and compare the resulting models. This computationally expensive task can be handled by standard methods such as Gibbs sampling  and expectation maximization . There could also be errors in the input data such as false positives, in the form of sequences that do not really contain a binding site of the TF. A human expert could be misled by the false positives and try to form over-fitted models (Box 1) that fit these error cases. A machine learning method with well-controlled complexity, on the other hand, may prefer a model that does not classify the error cases as positives. More generally, each input sequence may contain zero, one, or more occurrences of a motif , the input sequences may contain multiple motifs (for example, due to indirect binding ), and motif finding can be confounded by repeat sequences. All these complications are more easily handled by automatic methods.
Machine learning methods are also good at integrating multiple, heterogeneous features. This property allows the methods to detect subtle interactions and redundancies among features, as well as to average out random noise and errors in individual features. We use the identification of cis-regulatory modules (CRMs) as an example to illustrate this property.
A CRM is a DNA regulatory region, usually containing the binding sites of multiple TFs, that regulate a common gene nearby , such as cis-acting promoters, enhancers, silencers and insulators. Many types of features have been individually used by previous methods to identify CRMs, including the density and statistical over-representation of TF binding motifs, evolutionary conservation, direct binding signals from ChIP-seq or ChIP-chip data, and biochemical marks such as histone modifications . In general, information related to binding patterns is useful for distinguishing between CRMs and genomic regions with fewer binding sites such as exons; information related to evolutionary constraints is more useful in distinguishing between CRMs and other less conserved regions, such as introns and some intergenic regions; information about histone modifications is useful in distinguishing between different types of regulatory regions and between the active and inactive ones. It was found that no single type of features could perfectly separate CRMs from negative examples . As a result, some recent approaches have started to integrate different types of features by using a machine learning framework . Depending on the mathematical form of the model (Box 1), the different features can be integrated in ways from linear combinations to highly nonlinear ones.
Three aspects of data integration by machine learning deserve more discussions. First, the input features could contain redundant information. For example, ChIP-seq signals from TF binding and histone modification experiments can be highly correlated with open chromatin signals . Different machine learning methods handle redundant features in drastically different ways. At one extreme, methods such as the Naïve Bayes classifier  assume input features to be independent of each other for each class. If the features are in fact not conditionally independent, the redundant features could be unfavorably granted stronger influence on the predictions than the non-redundant ones, which affect the accuracy of the resulting models for some problems. On the other hand, methods such as decision trees and logistic regression could have one feature masking out the effects of all other similar features. In general it is good to carefully select a set of non-redundant input features based on biological knowledge, perform dimension reduction to remove dependency between features (by methods such as principal components analysis ) before the learning process, or test the stability of predictions using different subsets of input features.
Second, if a large number of features are integrated but the amount of training examples is limited (a phenomenon quite common in genome annotation), multiple issues could come up. The training examples may not be sufficient to capture the combination of feature values characteristic of the classes to be modeled. If some features irrelevant to the target concepts are included, they could mislead the modeling process, especially in unsupervised settings. There is also a high risk of over-fitting. Feature selection, dimension reduction, regularization and semi-supervised learning (Box 1) are all practical ways to alleviate the problem.
Third, it could be difficult to combine features of different data types. For example, conservation of a potential CRM region is represented by a numeric score (such as PhastCons  and phyloP ), the raw sequence of it is represented by a text string, while peaks of binding signals of a particular TF could be represented by a binary variable. One systematic approach to handling mixed data types is to turn each type of data into a numerical similarity matrix between the input regions before integrating them. Kernel methods  are one particular branch of machine learning methods that work on similarity (kernel) matrices with some simple mathematical requirements. They have been widely used in integrating different types of data for genome annotation. For example, the kernel between two sequences can be defined by their alignment, BLAST scores or k-mer composition [54, 55].
Another strength of machine learning is its ability to construct highly complex models needed by some genomic element classes. We use gene finding as an example here.
Eukaryotic genes have a complex structure with exons, introns and splice sites at the transcriptional level, and coding sequences and untranslated regions at the translational level. An early approach to computational gene finding involves homology search using tools such as BLASTX  to look for regions of a genome with similar sequences in a database of annotated genes or expressed sequence tags. This approach is similar to the standard machine learning method of predicting the label of an object as the one of its nearest neighbor among the labeled examples, but with a maximum dissimilarity cutoff between them. It suffers from not being able to identify genes with no annotated homologs, and not reporting the detailed sub-elements (exons, introns, and so on) of the genes.
Both issues suggest the need for ab initio methods for finding genes directly from sequences. Some of these methods derived sequence-based features of known genes called content statistics (such as codon usage), and defined rules for classifying genes based on these features . It was found that when the features were combined using non-linear artificial neural network classifiers, the prediction performance was much better than some simple combinations of the features , which highlights the need for complex models.
In order to model the detailed structures of eukaryotic genes instead of simply predicting if a region contains a gene or not, machine learning methods based on hidden Markov models [58–60] and generalized hidden Markov models [61–63] have become some of the most popular choices for computational gene finding. These methods consider the observed genomic sequence as the output of some hidden states (the sub-element types or their sub-classes). A complete model is composed of the set of states, and the probabilities of starting a sequence at each state, transition between states and outputting a base/sequence at each state as model parameters. Standard algorithms exist for learning the parameter values of such complex models.
With the advent of RNA-seq [64, 65] and other high-throughput experimental methods for identifying RNA transcripts, ab initio gene finding has become less popular. In the current post-transcriptomic era, machine learning has taken on some new roles in gene finding. First, specialized methods that take into account a large number of features and their complex interactions have been designed to model some biological mechanisms not yet fully understood, such as recognizing transcription start sites and determining the splicing events [66–68]. A related problem is to predict complete isoforms and their relative abundance of a gene in a certain sample, using single-end or paired-end short sequencing reads . Second, methods developed for identifying protein-coding genes are now adopted to identifying long non-coding RNAs (lncRNAs) , which share some common features with protein-coding genes (such as the presence of introns) but the annotations of which are much less complete and thus there are limited training examples available.
An ultimate goal of genome annotation is to identify all types of functional elements and all their occurrences in a genome. How far are we from this goal? Currently there are still likely undiscovered genomic element classes given the rapid discovery of new classes (such as many non-coding RNAs (ncRNAs)) in recent years. Some element classes also have so far very few discovered instances. In terms of machine learning, these two facts imply that currently it is impossible to perform purely supervised learning for all element classes. As a result, in a recent work by the Encyclopedia of DNA Elements (ENCODE) Project Consortium, which aims at delineating all functional elements encoded in the human genome , several different approaches have been adopted to confront this grand challenge.
ENCODE has recently produced about 1,600 sets of whole-genome experimental data that cover many types of molecular states and activities, including transcription, long-range DNA interactions, and chromatin features such as histone modifications, protein-DNA binding and open chromatin signals . In one approach to whole-genome annotation, the experimental data were used to perform unsupervised segmentation of the human genome [6, 71, 72], so that each genomic location was assigned to a segment. The segments were then grouped into clusters in an unsupervised manner. Each resulting cluster was described by the characteristic features of its members. Surprisingly, although the clusters were discovered by an unsupervised procedure, many of them have simple interpretations corresponding to known genomic element classes such as promoters, transcribed regions and enhancers. The segmentation was also able to provide sub-classes of particular element classes, such as enhancers with strong and weak activities in particular cell types, respectively. In general, this method can reveal groups of sequence elements according to the observed data alone without defining the target element classes a priori.
One difficulty in performing this unsupervised clustering was to determine the number of clusters to produce. Having too few clusters would merge elements from different genomic element classes together, while having too many clusters would make the results difficult to interpret. In order to avoid manually defining the number of clusters, in another approach the segments were put onto a two-dimensional toroidal map, where similar segments were put close to each other using the unsupervised Self-Organizing Map method . The resulting map provides a way to study the relationships between different segments and the meanings of each local region on the map without defining the number of clusters and the cluster boundaries . It also provides information about the similarity between different clusters identified by the segmentation method.
The whole-genome segmentation approach has the advantage of requiring no a priori definition of element classes, so that the discovery process is directly based on the observed data. On the other hand, when there is a specific type of genomic elements of interest, customized methods could potentially include more information specific to it. As an example, one important effort of ENCODE was to experimentally validate computationally predicted enhancers using different types of reporter assays . A number of methods had previously been proposed for identifying enhancers in a genome, including both supervised [74, 75] and unsupervised [76, 77] methods. These methods were constrained by a lack of whole-genome experimental data, and had thus relied on either a relatively small set of experimental features or static information such as genomic sequence and evolutionary conservation. Correspondingly, a specialized pipeline was designed by ENCODE to identify enhancers at the genome scale, utilizing the large amount of experimental data produced [6, 78]. Both the predictions from the segmentation approach and the enhancer prediction pipeline were found to achieve reasonable levels of accuracy .
Based on the ENCODE experience, one could imagine a potential hybrid approach that combines the benefits of both the unsupervised and supervised approaches described above. First, the segmentation approach is applied to systematically discover genomic element classes from large datasets. Specialized supervised methods can then be designed to provide detailed modeling of each element class using extra domain knowledge and auxiliary data available.
We conclude by discussing some current challenges in applying machine learning to genome annotation and the corresponding outstanding key research problems.
As mentioned above, very complex models have been proposed for some difficult genome annotation tasks. For example, a machine learning method involving hundreds of features has been reported to achieve high accuracy in predicting tissue-specific alternative splicing . There are also machine learning methods that make use of the concept of ensemble learning, which combines the predictions of multiple (possibly very complex) models to achieve better performance. Examples include the classical bagging  and boosting  methods and Random Forests , which build multiple models using different subsets of examples or features. For instance, Random Forests were reported to outperform some other machine methods in identifying ncRNAs . In fact, ensemble methods have become a popular choice in public machine learning challenges that involve big datasets, such as the well-known Netflix Prize . They outperformed methods that produced simpler models, which were unable to provide the required 10% accuracy improvement in recommending films when compared with the original method used by Netflix.
These complex models, achieving high prediction accuracy notwithstanding, are in general difficult to interpret. Whether one should use them in genome annotation depends on the exact goal of the project. If the goal is to produce a list of genomic elements as accurately as possible, it would be fine to use complex models as 'black boxes' as long as they can provide the required accuracy. On the other hand, if the goal is to use machine learning as a means to understand the underlying biological mechanisms, one may want to construct models that are more easily interpretable. For example, if one hopes to understand the major features that can help identify 80% of the elements of a certain class, a simple model may suffice, sacrificing the prediction accuracy of the remaining 20% as a tradeoff. It is rarely possible to achieve high accuracy and good interpretability at the same time; thus, it is important to define the goal clearly and select the machine learning method accordingly.
Large-scale genomic projects, such as ENCODE , modENCODE [83, 84], 1000 Genomes  and Roadmap Epigenomics , have produced a huge amount of valuable data that cover many aspects of genomes. These datasets offer an unprecedented opportunity to model genomic element classes and the effects of genetic mutations on them. However, a lot of these data are associated with properties specific to the corresponding experiments, such as cell or tissue types, experimental conditions, developmental stages of the animals and the population backgrounds of the sequenced individuals. Care should be taken when using these data to model the active genomic elements in other types of data or to construct general, non-context-dependent models.
It would be useful for machine learning methods to provide multiple levels of abstractions for the static and context-specific information. For example, when direct binding data of a certain TF × from ChIP-seq experiments are available for one cell type, a model can be constructed to describe the relationships between the ChIP-seq signals and the actual binding sites of TF × in this cell type. However, if in a second cell type ChIP-seq experiments have only been performed for some other TFs but not TF X, the model from the first cell type cannot be directly applied to predict the binding sites of TF × in this second cell type as the feature required by the model is not available. In this situation, the ChIP-seq data for the TFs available in the second cell type could be used to construct a higher-level model that describes some features common to the binding sites of different TFs, such as DNA accessibility. Combining it with non-context-specific static information such as sequence motifs of TF X, it is still possible to construct an accurate model for predicting the binding sites of TF × without ChIP-seq data in the second cell type .
A key to providing different levels of abstraction from the same input data is a careful selection of negative examples. In the above example, when constructing the general model for identifying binding sites of any TF, the negative set should contain regions not bound by any TF, including those with no direct ChIP-seq signals and those likely to be depleted of TF binding, such as coding exons. In contrast, when constructing the model for identifying binding sites of a particular target TF based on ChIP-seq data alone, the negative examples should also include binding sites of other TFs in addition to non-TF-binding regions, so that the learned model is specific to the target TF.
For some classes of genomic elements, there are insufficient known examples for supervised machine learning methods to capture the general patterns of the classes. For example, there are few validated enhancers cataloged in databases relative to the expected total number . Many prediction methods have thus relied on a combination of unsupervised learning and manually defined rules [6, 76–78]. In the case of ncRNAs, a large portion of the most functionally characterized ones are the short, strongly structured RNAs, which could bias models for identifying ncRNAs towards this subset and render them less able to detect ncRNAs with few known examples and novel ncRNA classes. Moreover, confirmed negative examples are seldom available, but are crucial to most machine learning methods. A related issue is that most genomic element classes occupy only a small portion of the genome, and therefore the ratio of positive to negative regions is very small. Even a highly accurate classifier could have a lot of false positives among its top predictions.
Second, systematic methods for selecting negative examples for genomic annotation should be developed, taking into account the accuracy of the examples and their influence on the models. For instance, extreme cases that are 'very negative' are likely accurate but not too informative. Relevant discussions for the problem of predicting protein physical interactions provide some good references on this topic [89–91]. There is a relatively small set of verified protein physical interactions, a large number of putative interactions from high-throughput experiments such as yeast-two-hybrid and co-immunoprecipitation, and no protein pairs that are confirmed to never interact. The way to choose negative examples could have profound effects on the resulting models.
When confirmed negative examples are scarce or unavailable, certain features indicative of the class label could be intentionally left out from the model training process and used to evaluate the performance of the model learned from the other features. For example, in a recent study for identifying lncRNAs, information useful for predicting protein-coding genes, including sequence conservation, homology to known genes, codon usage and coding potential, was not used in the lncRNA detection pipeline . An a posteriori check of the coding potential of the predicted lncRNAs could serve as an indirect evidence of the prediction accuracy.
Third, when constructing a model for a particular genomic element class, it is generally good to test for the existence of sub-classes, by means of either a model that allows for multiple clusters per class, pre-clustering of training examples and construction of separate models for different clusters, or post-clustering of predictions.
Finally, if experimental validations are performed to confirm the computational predictions, an active learning  strategy can be adopted to select predictions that maximize the expected information gain or similar measures . Ideally, the computational prediction and experimental validation phases should be repeated for multiple iterations, in order to facilitate the selection of the most informative examples for validation.
We first consider a basic setting of machine learning for binary classification, and later describe variations of it commonly encountered in genome annotation. Suppose we want to identify enhancers in a genome. We divide up the genome into a list of genomic regions X = (x 1, x 2, ..., x N ). Each region x i has a corresponding binary label y i , where y i = 1 if x i is an enhancer, and y i = 0 if not. Each region is described by a set of features x i = (x i 1, x i 2, ..., x im ). For example, x i 1 could be the evolutionary conservation of x i among several close species, x i 2 could be the average ChIP-seq [27, 28] signal of the active enhancer mark H3K27ac (histone 3 lysine 27 acetylation) among the bases within the region from a certain experiment, and so on. The goal of machine learning is to find a function f (called a model) such that f(x i ) is close to y i , that is, to tell if a region is an enhancer from some observed features alone.
To find a suitable f, we need to (1) decide on a mathematical form of f; (2) find known positive and negative examples that can help estimate the parameters of f; and (3) actually estimate the parameters of f, in a way that it likely predicts the labels of regions accurately, even for regions for which the corresponding labels are unknown.
For task 1, many families of f and their corresponding algorithms for learning the parameters have been studied. The popular ones include artificial neural networks , Bayesian networks , decision trees , k-nearest neighbors , random forests  and support vector machines . They differ in the form and complexity of their models. Some examples are shown in Figure 1. Predictions are made based on the mathematical form of f and the parameters learned from the examples, such as the location of orientation of the decision surface of a SVM (Figure 1a).
Task 2 could be quite tricky for some element classes (see the corresponding discussions in the main text). Task 3 can be further sub-divided into two sub-tasks, that of finding a model to fit the training examples, and of ensuring the model to be able to predict the labels of unseen regions correctly. The first sub-task can be achieved by finding parameter values of f that minimize a loss function, such as the sum of squared errors of the n examples, . Since the parameter values are determined according to the observed data, the process is described as 'learning' a model from the data. The second sub-task is achievable only if one makes certain assumptions about the models and examples. It is usually assumed that the observed examples and unobserved instances of each type of functional elements share the same distribution of feature values, and that when two models can fit the observed examples equally well, the simpler one (for example, one with a smaller number of parameters or a smaller total magnitude of the parameter values) is likely to generalize better to unobserved instances. A model too specific to the observed data, usually characterized by a high complexity of the model, may over-fit the data; that is, may capture patterns that are only true for the observed examples. To avoid over-fitting, some machine learning methods control the complexity of the models by model pruning  or regularization , with the observed examples fitting less well to the model as a tradeoff. Some other methods produce multiple models on different subsets of data to identify reliable patterns that appear frequently in these models (see main text for more discussions). Procedure-wise, over-fitting is detected by building a model based on a subset of the examples (the training set), and evaluating its accuracy based on another subset not involved in training (the testing set). An over-fitted model would have good training accuracy but poor testing accuracy. The process is usually repeated with different portions of data treated as the training set in turn to compute the average accuracy in a cross-validation procedure.
When we have a pre-defined set of discrete values for the labels, we have a classification problem with each value corresponding to a class and f is called a classifier. The simplest case of which, when there are only two classes, is called a binary classification problem. A more complex example of classification is to distinguish enhancers (y i = 1) from promoters (y i = 2) and other regions (y i = 0). There are also situations in which the labels can take on continuous values. The corresponding machine learning problem is called a regression problem and f is called an estimator or a regressor. In this review we focus on classification problems as the goal of genome annotation is to identify DNA sequences belonging to each element class. However, it should be noted that, in practice, many classifiers output a continuous value f j (x i ) that indicates how much a region x i appears to belong to the class j. For instance, probabilistic methods formally define f j (x i ) as the data likelihood Pr(x i |y i = j) or posterior probability Pr(y i = j|x i ). Classification can be performed by assigning each region x i to the class j with the largest value of f j (x i ) among all classes.
We have been implicitly assuming that the label of each genomic region can be determined by its own set of features alone. In genome annotation, this is often unrealistic for two reasons. First, it is usually hard to define the exact span of a region. Biologically it could be fuzzy to define exactly where a functional element starts and ends (as in the case of an enhancer), and even if the span could be formally defined (as in the case of an RNA transcript), it is usually not known prior to machine learning. One may therefore consider each base separately and predict whether it overlaps a functional element or not. Second, the class labels for neighboring genomic regions/bases are not independent. For example, if a base is within an intron, the next base should be either within an intron or a splice site. In this situation, the label of a base should be predicted according its own features as well as other bases. There are standard methods for this kind of learning tasks, such as hidden Markov models.
chromatin immunoprecipitation followed by microarrays
chromatin immunoprecipitation followed by high-throughput sequencing
long non-coding RNA
high-throughput cDNA sequencing
KYY is partially supported by the Hong Kong Research Grants Council General Research Fund CUHK418511.