Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: MBE: model-based enrichment estimation and prediction for differential sequencing data

Fig. 2

Overview of MBE at train time and prediction time. a At train time, next-generation sequencing reads from each condition, A and B, are used to train a probabilistic classifier. Without loss of generality, we encode condition A using the class label \(-1\) and condition B using the class label \(+1\) to train the classifier. More specifically, when applying MBE to the simulated data, we generate N training reads for condition A, and N training reads for condition B (i.e., equal class sizes). Then, we train a classifier on the \(N+N=2N\) data points, where each data point is a one-hot encoding of one read sequence with a corresponding label (\(-1\) for class A or \(+1\) for class B). (For all neural network models, we used a one-hot encoding. For linear models, we used other feature sets—see the “Methods” section. One could add any other features as desired, such as a mapped genomic position). The same read can appear more than once and may appear with discordant labels (e.g., if it appears in both conditions). The only difference in the implementation on the real data as compared to the simulated, is that the total number of reads may be quite different between the different conditions; to account for this, we re-weight the classifier loss function by the total number of reads in each condition to equalize the impact of each condition on the classifier (see the “Methods” section). b At prediction time, a single sequence is given to the trained classifier, which produces a predicted probability for each condition (class). For a two-condition model (A vs. B), we then compute the logarithm of the ratio of the two class probabilities to obtain the LE. When more than two conditions (classes) are used, a LE can be computed for any pair of conditions using the same calculations. For example, for a three-condition (three-class) model, there are \(\left( {\begin{array}{c}3\\ 2\end{array}}\right) = 3\) possible LEs that can be computed, each between two of the conditions

Back to article page