DGEclust: differential expression analysis of clustered count data

We present a statistical methodology, DGEclust, for differential expression analysis of digital expression data. Our method treats differential expression as a form of clustering, thus unifying these two concepts. Furthermore, it simultaneously addresses the problem of how many clusters are supported by the data and uncertainty in parameter estimation. DGEclust successfully identifies differentially expressed genes under a number of different scenarios, maintaining a low error rate and an excellent control of its false discovery rate with reasonable computational requirements. It is formulated to perform particularly well on low-replicated data and be applicable to multi-group data. DGEclust is available at http://dvav.github.io/dgeclust/. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0604-6) contains supplementary material, which is available to authorized users.


Background
Next-generation (NGS) or high-throughput sequencing (HTS) technologies provide a revolutionary tool for the study of the genome, epigenome and transcriptome in a multitude of organisms (including humans) by allowing the relatively rapid production of millions of short sequence tags, which mirror particular aspects of the molecular state of the biological system of interest.A common application of NGS technologies is the study of the transcriptome, which involves a family of methodologies, such as RNA-seq [1], CAGE (Cap Analysis of Gene Expression; [2]), SAGE (Serial Analysis of Gene Expression; [3]) and others.Most published studies on the statistical analysis of count data generated by NGS technologies have focused on the themes of experimental design [4], normalisation [5,6] and the development of tests for differential expression [7][8][9].Surprisingly, not much attention has been paid on cluster analysis.
Clustering is considered an important tool in the study of genomic data and it has been used extensively in the analysis of microarrays [10][11][12] (see [13] for a review of different clustering methods).It involves grouping together the expression profiles of different genes across different points in time, treatments and tissues, such that expression profiles in the same group are more similar in some way to each other than to members of other groups.Genes which are clustered together across samples exhibit co-related expression patterns, which might indicate co-regulation and involvement of these genes in the same cellular processes [14].Moreover, whole samples of gene expression profiles can be clustered together, indicating a particular macroscopic phenotype, such as cancer [15].
A large class of clustering methods relies on the definition of a distance metric, which quantifies the "similarity" between any two gene expression data points.Subsequently, clusters are formed, such that the distance between any two data points in the same cluster is minimised.Typical methods in this category are K-means clustering and self-organising maps (SOMs) [13].Another important category includes modelbased clustering algorithms.In this case, the whole gene expression dataset is modeled as a random sample from a finite mixture of probability distributions, where each component of the mixture corresponds to a distinct cluster.The parameters of each component in the mixture (e.g.mean and variance) are usually estimated using an Expectation-Maximization algorithm [13].Hierarchical clustering is yet a third type of clustering methodology, which is particularly suited for modelling genomic (often hierarchically organized) data.It generates a hierarchical series of nested clusters, which can be represented graphically as a dendrogram.This stands in contrast to partition-based methods (e.g.K-means or SOMs), which decompose the data directly into a finite number of non-overlapping clusters [13].
In this paper, we present a model-based statistical approach and associated software (DGEclust) for clustering digital expression data.The proposed methodology is novel, because it simultaneously addresses the problem of model selection (i.e.how many clusters are supported by the data) and uncertainty (i.e. the error associated with estimating the number of clusters and the parameters of each cluster).This is possible by exploiting a Hierarchical Dirichlet Process Mixture Model or HDPMM [16], a statistical framework, which has been applied in the past on multi-population haplotype inference [17] and for modelling multiple text corpora [18].In our version of the HDPMM, individual expression profiles are drawn from the Negative Bi-nomial distribution (as, for example, in [19][20][21]) and parameter estimation is achieved using a novel blocked Gibbs sampler, which permits efficiently processing large datasets (including more than 20K features).We show how the output of our clustering algorithm can be utilised in differential expression analysis and, using simulated data, we demonstrate its superior performance -in terms of Receiver Operating characteristic (ROC) and False Discovery Rate (FDR) curves -when compared to popular alternative methods.When applied on CAGE data from human brains, our methodology manages to detect a significantly larger number of differentially expressed transcripts than alternative methods.An early version of the proposed methodology has been presented previously in poster format and in [22].

Description of the model
Formally, the production of count data using next-generation sequencing assays can be thought of as random sampling of an underlying population of cDNA fragments.Thus, the counts for each tag describing a class of cDNA fragments can, in principle, be modelled using the Poisson distribution, whose variance is, by definition, equal to its mean.However, it has been shown that, in real count data of gene expression, the variance can be larger than what is predicted by the Poisson distribution [23][24][25][26].An approach that accounts for the so-called "over-dispersion" in the data is to adopt quasi-likelihood methods, which augment the variance of the Poisson distribution with a scaling factor, thus dropping the assumption of equality between the mean and variance [27][28][29][30].An alternative approach is to use the Negative Binomial distribution, which is derived from the Poisson, assuming a Gamma-distributed rate parameter.The Negative Binomial distribution incorporates both a mean and a variance parameter, thus modelling over-dispersion in a natural way [19][20][21].For this reason, in this paper we use the Negative Binomial distribution for modelling count data.
We indicate the number of reads for the i th feature (e.g.gene) at the s th sample/library of the j th class of samples (e.g.tissue or experimental condition) with the variable y jsi .In a normalised dataset, we assume that y jsi is distributed according to a Negative Binomial distribution with gene-and class-specific parameters θ ji = {α ji , p ji }: where p ji = α ji / (α ji + µ ji ) is a probability measure, µ ji is the (always positive) mean of the distribution and α ji is a dispersion parameter.Since, the variance σ 2 ji = µ ji + α −1 ji µ 2 ji is always larger than the mean by the quantity α −1 ji µ 2 ji , the Negative Binomial distribution can be thought of as a generalisation of the Poisson distribution, which accounts for over-dispersion.

Information sharing within samples
A common limitation in experiments using NGS technologies is the low number or even absence of biological replicates, which complicates the statistical analysis of digital expression data.One way to compensate for small sample sizes is to assume that all features share the same variance [24].A less restrictive approach is to implement some type of information sharing between features, which permits the improved estimation of the feature-specific over-dispersion parameters by pooling together features with similar expression profiles [19][20][21].In this paper, information sharing between features and between samples is introduced in a natural way due to the use of Dirichlet Processes as priors for the Negative Binomial distribution parameters and due to the hierarchical structure of the mixture model, as explained in this and the following section.
Specifically, within each sample class j, we assume that the set of gene-specific parameters {θ ji } are random and distributed according to a prior distribution G j , i.e.
Furthermore, G j is itself randomly sampled from a Dirichlet process with positive scaling parameter γ j and base probability distribution G 0 [16]: Dirichlet process priors are distributions over distributions and they have become a popular choice in Bayesian inference studies, since they provide an elegant and, in many ways, more realistic solution to the problem of determining the "correct" number of components in mixture models.Standard theoretical results [31] state that a sample G j from Eq. 3 is a discrete distribution with probability one over a countably infinite set of θs.Large values of γ j lead to a large number of similarly likely values of θ, while small values of this parameter imply a small number of highly probable values of θ.This and Eq. 2 imply that the gene-specific parameters θ ji are not all distinct.Different genes within the same class of libraries may share the same value of θ or, in other words, genes in class j are grouped in a (not known in advance) number of clusters, based on the value of θ they share.Equivalently, the expression profiles of different groups of genes in a given class of samples are drawn from different Negative Binomial distributions, each characterised by its own unique value of θ.This clustering effect within each sample class is illustarted in Fig. 1.

Information sharing between samples
Up to this point, we have considered clustering of features (e.g.genes) within the same class of samples, but not across classes of samples (e.g.tissue or conditions).However, in a given dataset, each cluster might include gene expression profiles from the same, as well as from different sample classes.In other words, clusters are likely shared between samples that belong to different classes.This sharing of information between sample classes can be expressed naturally in the context of Hierarchical Dirichlet Process Mixture Models [16].Following directly from the previous section, we assume that the base distribution G 0 is itself random and sampled from a Dirichlet Process with a global scaling parameter δ and a global base distribution This implies that G 0 is (similarly to each G j ) discrete over a countably infinite set of atoms θ k , which are sampled from H, i.e. θ k ∼ H. Since G 0 is the common base distribution of all G j , the atoms θ k are shared among all samples, yielding the desired information sharing across samples (see Fig. 1).

Generative model
In summary, we have the following hierarchical model for the generation of a digital gene expression dataset (see also Fig. 1): where the base distribution H provides the global prior for sampling the atoms θ k = (α k , p k ) and it takes the form of the following joint distribution: where φ is the set of hyperparameters, which H depends on.According to the above formula, the inverse of the dispersion parameter α k is sampled from a LogNormal distribution with mean µ α and variance σ 2 α , while the probability parameter p k follows a Uniform distribution in the interval [0, 1].Given the above formulation, α k is always positive, as it oughts to and, since the LogNormal distribution has a well known conjugate prior, the above particular form for H greatly facilitates the posterior inference of the hyperparameters φ (see below).

Inference
The definition of the HDPMM in Eqs. 5 is implicit.In order to facilitate posterior inference, an equivalent constructive representation of the above model has been introduced in [18] utilising Sethuraman's stickbreaking representation of a draw from a Dirichlet process [31].This representation introduces a matrix of indicator variables Z = {z ji }, where each element of the matrix, z ji , indicates which cluster the i th expression measurement in the j th class of samples belongs to.Two different features belong to the same cluster if and only if their indicator variables, e.g.z ji and z j i , are equal.A major aim of Bayesian inference in the above model, is to calculate the posterior distribution of matrix Z given the dataset Y , i.e. p(Z|Y ).
One approach to estimate this distribution is by utilizing Markov Chain Monte Carlo methods, which generate a chain of random samples as a numerical approximation to the desired distribution.We have developed a blocked Gibbs sampler in the software package DGEclust, which efficiently generates new samples from the posterior p(Z|Y ).The algorithm is an extension of the method presented in [22,32] for inference in non-hierarchical Dirichlet Process mixture models and its advantage is that it samples each element of Z independently of all others.This not only results in very fast convergence, but it also allows implementing the algorithm in vectorized form, which takes advantage of the parallel architecture of modern multicore processors and potentially permits application of the algorithm on very large datasets.Alternative MCMC methods, which are developed on the basis of the popular Chinese Restaurant Franchise representation of the HDP [16,33], do not enjoy the same advantage since they are restricted by the fact that sampling each indicator variable is conditioned on the remaining ones, thus all of them must be updated in a serial fashion.
Details of the algorithm are given in Vavoulis & Gough, 2014 (in preparation).

Testing for differential expression
Assuming that the above algorithm has been applied on a digital expression dataset Y and a sufficiently large chain of samples Z (T0+1) , Z (T0+2) , . . ., Z (T0+T ) -which approximates the posterior p(Z|Y ) -has been generated, we show how these samples can be utilised in a differential expression analysis scenario.We consider two classes of samples, 1 and 2, which might represent, for example, two different tissues or experimental conditions.
A particular feature (gene or transcript) is said to be not differentially expressed, if its expression mea-surements in classes 1 and 2 belong to the same cluster.In more formal language, we say that the expected conditional probability π i = p(nDE i |Y ) that feature i is not differentially expressed given data Y is equal to the expected probability p(z 1i = z 2i |Y ) that the indicator variables of feature i in sample classes 1 and 2 have the same value.This probability can be approximated as a simple average over the previously generated MCMC samples {Z T0+t } t=1,...,T : where 1 1(•) is equal to 1 if the expression inside the parentheses is true and 0 otherwise.Given a threshold π, we can generate a set D of potentially differentially expressed features with probabilities less than this threshold, i.e.D = {i : π i ≤ π}, where π i is calculated as in Eq. 7 for all i.
As observed in [34], the quantity π i measures the conditional probability that including the i th feature in list D is a Type I error, i.e. a false discovery.This useful property makes possible the calculation of the conditional False Discovery Rate (FDR) as follows: From Eq. 8, it can be seen that D always has an FDR at most equal to π.Alternatively, one can first set a target FDR, say tFDR, and then find the maximum possible value of π, such that F DR(π) ≤ tF DR.
Notice that, unlike alternative approaches, which make use of gene-specific p-values, this methodology does not require any correction for multiple hypothesis testing, such as the Benjamini-Hochberg procedure.
Although the computation of FDR using Eq. 8 is approximate (since it depends on the accuracy of the calculation of π i using Eq.7), it is reasonable to assume that the error associated with this approximation is minimised, if sufficient care is taken when postprocessing the MCMC samples generated by the Gibbs sampler.

Application on clustered simulated data
In order to assess the performance of our methodology, we applied it on simulated and actual count data and we compared our results to those obtained by popular software packages, namely DESeq, edgeR and baySeq.
First, we applied our algorithm on simulated clustered data, which were modelled after RNA-seq data from yeast (Saccharomyces cerevisiae) cultures [26].This data were generated using two different library preparation protocols.Samples for each protocol included two biological (i.e. from different cultures) and one technical (i.e. from the same culture) replicates, giving a total of six libraries.
The simulated data were generated as follows: first, we fitted the yeast RNA-seq data with a Negative Binomial mixture model, where each component in the mixture was characterised by its own α and p parameters.We found that 10 mixture components fitted the data sufficiently well.At this stage, it was not necessary to take any replication information into account.The outcome of this fitting process was an optimal estimation of the parameters of each component in the mixture, α k and p k , and of the mixture proportions, w k , where k = 1, . . ., 10.
In a second stage, we generated simulated data using the fitted mixture model as a template.For each simulated dataset, we assumed 2 different classes of samples (i.e.experimental conditions or tissues) and 2, 4 or 8 samples (i.e.biological replicates) per class.For gene i in class j, we generated an indicator variable z ji taking values from 1 to 10 with probabilities w 1 to w 10 .Subsequently, for gene i in sample s in class j, we sampled expression profile (i.e.counts) y jsi from a Negative Binomial distribution with parameters α zji and p zji .The process was repeated for all genes in all samples in both classes resulting in a matrix of simulated count data.Mimicking the actual data, the depth of each library was randomly selected between 1.7 × 10 6 to 3 × 10 6 reads.
Gene i was considered differentially expressed, if indicator variables z 1i and z 2i were different, i.e. if the count data for gene i in the two different classes belonged to different clusters.By further setting z 1i equal to z 2i for arbitrary values of i, it was possible to generate datasets with different proportions of differentially expressed genes.Since the proportion of differentially expressed genes may affect the ability of a method to identify these genes [9], we examined datasets with 10% and 50% of their genes being differentially expressed.
In our comparison of different methodologies, we computed the average Receiver Operating Characteristic (ROC) and False Discovery Rate (FDR) curves over a set of 10 different simulated datasets.In order to keep execution times to a reasonable minimum, we considered datasets with 1000 features.All examined methods allow to rank each gene by providing nominal p-values (edgeR, DESeq) or posterior probabilities (DGEclust, baySeq).Given a threshold score, genes on opposite sides of the threshold are tagged as differentially expressed or non-differentially expressed, accordingly.In an artificial dataset, the genes that were simulated to be differentially expressed are considered to be the true positive group, while the remaining genes are considered the true negative group.By computing the False Positive Rate (FPR) and the True Positive Rate (TPR) for all possible score thresholds, we can construct ROC and FDR curves for each examined method.
The area under a ROC curve is a measure of the overall discriminative ability of a method (i.e. its ability to correctly classify transcripts as differentially or non-differentially expressed).Similarly, the area under an FDR curve is inversely related to the discriminatory performance of a classification method.
Our results are summarised in Fig. 2. When datasets with 10% DE transcripts and a small number of samples is considered (Fig. 2Ai

Application on unclustered simulated data
Since our algorithm is designed to take advantage of the cluster structure that may exist in the data, testing different methods on clustered simulated data might give an unfair advantage to DGEclust.For this reason, we further tested the above methodologies on unclustered simulated data (or, equivalently, on simulated data, where each gene is its own cluster).As in the case of clustered simulated data, the unclustered data were modelled after yeast RNA-seq data [26], following a procedure similar to [9].In a first stage, we used DESeq to estimate unique α i and p i parameters for each gene i in the yeast data.In a second stage, for each gene i in each class j in the simulated data, we sampled randomly a unique index z ji ranging from 1 to N , where N was the total number of genes in the simulated data.Subsequently, for each gene i in each sample s in each class j, we sampled counts y jsi from a Negative Binomial distribution with parameters α zji and p zji .As for the clustered data, gene i was simulated to be differentially expressed by making sure that the randomly sampled indicator variables z i1 and z i2 had different values.The above procedure for simulating unclustered count data makes minimal assumptions about the expression profiles of differentially and non-differentially expressed genes and it is a simple extension of the procedure we adopted for simulating clustered count data.
As above, we considered datasets with 1000 genes, 2 sample classes and 2, 4 or 8 samples per class and we randomly selected the library depths between 1.7 × 10 6 and 3 × 10 6 reads.Also, datasets with either 10% or 50% of their genes being differentially expressed were considered.
Our results from testing DGEcust, edgeR, DESeq and baySeq on unclustered simulated data are presented in Fig. 3.We may observe that all methods perform similarly for both low (Fig. 3A,B) and high (Fig. 3C,D) proportions of DE genes.In particular, despite the absence of a clear cluster structure in the data, DGEclust is not worse than competitive methods.This is indicative of the fact that our algorithm is applicable on a more general category of count datasets, which includes either clustered or unclustered data.As in the case of clustered data, increasing the number of samples improves the overall performance of the various methodologies (Figs.3A) and reduces the number of false positives among the top-ranking discoveries (Figs.3B).The same trends are observed, when a high proportion of DE genes is considered (Figs.3C,D).

Application on CAGE human data
In addition to simulated data, we also tested our method on CAGE libraries, which were prepared according to the standard Illumina protocol described in [45].The dataset consisted of twenty-five libraries isolated from five brain regions (caudate nucleus, frontal lobe, hippocampus, putamen and temporal lobe) from five human donors and it included 23448 features, i.e. tag clusters representing promoter regions (see Materials and Methods for more details).
As illustrated in Fig. 4A, DGEclust was left to process the data for 10K iterations.Equilibrium was attained rapidly, with the estimated number of clusters fluctuating around a mean value of approximatelly 81 clusters (Figs.4A,B) and the normalised autocorrelation of the Markov chain dropping quickly below 0.1 after a lag of only around 10 iterations (Fig. 4C).A snapshot of the fitted model at the end of the simulation illustrates how samples from each brain region are tightly approximated as mixtures of Negative Binomial distributions (i.e.clusters; Fig. 5).After the end of the simulation, we used the procedure outlined in a previous section in order to identify differentially expressed transcripts using a nominal burn-in period of T 0 = 1000 iterations.We also applied edgeR, DESeq and baySeq on the same dataset in order to find differentially expressed transcripts between all possible pairs of brain regions.Transcripts selected at an FDR threshold of 0.01 were considered differentially expressed.
In a first stage, we compared the number of DE transcripts found by different methods.It can be observed (Fig. 6, upper triangle) that, for all pairs of brain regions, DGEclust returned the largest number of DE transcripts, followed by edgeR and, then, DESeq and baySeq which, for all cases, discovered a similar number of DE transcripts.In particular, DGEclust was the only method that found a significant number of DE transcripts (520) between the frontal and temporal lobes, whereas edgeR, DESeq and baySeq found only 7, 3 and 4 DE genes, respectively.By checking the overlap between transcripts classified as DE by different methods (Fig. 6, lower triangle), we conclude that the DE transcripts found by edgeR, DESeq and baySeq are in all cases essentially a subset of the DE transcripts discovered by DGEclust.DGEclust appears to identify a large number of "unique" transcripts, in addition to those discovered by other methods, followed in this respect by edgeR, which also found a small number of "unique" transcripts in each case.
In a second stage, we compared the number of DE genes identified by DGEclust between different brain regions and we constructed a (symmetric) similarity matrix, which can be used as input to hierarchical clustering routines for the generation of dendrograms and heatmaps.Specifically, each element s j1j2 of this matrix measuring the similarity between brain regions j 1 and j 2 is defined as follows: where N is the number of features in the dataset and π i is the probability that transcript i is differentially expressed between regions j 1 and j 2 , as computed by Eq. 7. The similarity matrix calculated as above was used to construct the dendrogram/heatmap in Fig. 7, after employing a cosine distance metric and complete linkage.
It may be observed that the resulting hierarchical clustering reflects the evolutionary relations between different regions.For example, the temporal and frontal lobe samples, which are both located in the cerebral cortex, are clustered together and they are maximally distant from subcortical regions, such as the hippocampus, caudate nucleus and putamen.The last two are also clustered together and they form the dorsal striatum of the basal ganglia.

Conclusions
Despite the availability of several protocols (e.g.single vs. paired-end) and sequencing equipment (e.g. Solexa's Illumina Genome Analyzer, ABI Solid Sequencing by Life Technologies and Roche's 454 Sequencing), all NGS technologies follow a common set of experimental steps (see [7] for a review) and, eventually, generate data, which essentially constitutes a discrete, or digital measure of gene expression.This data is fundamentally different in nature (and, in general terms, superior in quality) from the continuous fluorescence intensity measurements obtained from the application of microarray technologies.In comparison, NGS methods offer several advantages, including detection of a wider level of expression levels and independence on prior knowledge of the biological system, which is required by the hybridisation-based microarrays [7].
Due to their better quality, next-generation sequence assays tend to replace microarrays, despite their higher cost [35].
In this paper, we have addressed the important issue of clustering digital expression data, a subject which is surprisingly lacking in methodological approaches, when compared to micro-arrays.Most proposals for clustering RNA-seq and similar types of data have focused on clustering variables (i.e.biological samples), instead of features (e.g.genes) and they employ distance-based or hierarchical clustering methodologies on appropriatelly transformed datasets, e.g.[19,36,37].For example, the authors in [19] calculate a common variance function for all samples in a Tag-seq dataset of glioblastoma-derived and non-cancerous neural stem cells using a variance-stabilizing transformation, followed by hierarchical clustering using a Euclidean distance matrix.In [36], a Pearson correlation dissimilarity metric was used for the hierarchical clustering of RNA-seq profiles in 14 different tissues of soybean after these were normalised using a variation of the RPKM method [5,6].
The above approaches, although fast and relatively easy to implement, do not always take into account the discrete nature of digital gene expression data.For this reason, various authors have developed distance metrics based on different parameterizations of the log-linear Poisson model for modelling count data, e.g.[38][39][40].A more recent class of methods follows a model-based approach, where the digital dataset is modeled as a random sample from a finite mixture of discrete probability distributions, usually Poisson or Negative Binomial [41][42][43].Utilising a full statistical framework for describing the observed count data, these model-based approaches often perform better than distance-based algorithms, such as K-means [41].
Although computationally efficient and attractive due to their relative conceptual simplicity, the utility of both distance-and finite model-based clustering methods has been critisised [33,44].One particular feature of these methodologies, which compromises their applicability, is that the number of clusters in the data must be known a priori.For example, both the K-means and the SOM alogrithms require the number of clusters as input.Similarly, methods which model the data as a finite mixture of Poisson or Negative Binomial distributions [41][42][43] require prior knowledge of the number of mixture components.Estimating the number of clusters usually makes use of an optimality criterion, such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC), which requires repeated application of the algorithm on the same dataset with different initial choices of the number of clusters.Thus, the number of clusters and the parameters for each individual cluster are estimated separately, making the algorithm sensitive to the initial model choice.Similarly, hierarchical clustering methods often rely on some arbitrary distance metric (e.g. Euclidian or Pearson correlation) to distinguish between members of different clusters, without providing a criterion for choosing the "correct" number of clusters or a measure of the uncertainty of a particular clustering, which would serve to assess its quality.
In this paper, we have developed a statistical methodology and associated software (DGEclust) for clustering digital gene expression data, which (unlike previously published approaches [36][37][38][39][40]) does not require any prior knowledge on the number of clusters, but it rather estimates this parameter and its uncertainty simultaneously with the parameters (e.g.location and shape) of each individual cluster.This is achieved by embedding the Negative Binomial distribution for modelling count data in a Hierarchical Dirichlet Process Mixtures framework.This formulation implies that individual expression measurements in the same sample or in different samples may be drawn from the same distribution, i.e. they can be clustered together.This is a form of information sharing within and between samples, which is made possible by the particular hierarchical structure of the proposed model.At each level of the hierachy, the number of mixture components, i.e. the number of clusters, is assumed infinite.This representes a substantial departure from previously proposed finite mixture models and avoids the need for arbitrary prior choices regarding the number of clusters in the data.
Despite the infinite dimension of the mixture model, only the finite number of clusters supported by the data and the associated parameters are estimated.This is achieved by introducing a blocked Gibbs sampler, which permits efficiently processing large datasets, containing more than 10K genes.Unlike MCMC inference methods for HDPMM based on the popular Chinese Restaurant Franchise metaphor [16], our algorithm permits updating all gene-specific parameters in each sample simultaneously and independently from other samples.This allows rapid convergence of the algorithm and permits developing parallelised implementations of the Gibbs sampler, which enjoy the increased performance offered by modern multicore processors.
Subsequently, we show how the fitted model can be utilised in a differential expression analysis scenario.
Through comparison to popular alternatives on both simulated and actual experimental data, we demonstrate the applicability of our method on both clustered and unclustered data and its superior performance in the former case.In addition, we show how our approach can be utilised for constructing library-similarity matrices, which can be used as input to hierarchical clustering routines.A slight modification of Eq. 9 can be used for constructing gene-similarity matrices (see Vavoulis & Gough 2014, in preparation).Thus, our methodology can be used to perform both gene-and sample-wise hierarchical clustering, in contrast to existing approaches, which are appropriate for clustering either samples [36,37] or genes [38][39][40] only.
In conclusion, we have developed a hierarchical, non-parametric Bayesian clustering method for digital expression data.The novelty of the method is simultaneously addressing the problems of model selection and estimation uncertainty and it can be utilised in testing for differential expression and for sample-wise (and gene-wise) hierarchical grouping.We expect our work to inspire and support further theoretical research on modelling digital expression data and we believe that our software, DGEclust, will prove to be a useful addition to the existing tools for the statistical analysis of RNA-seq and similar types of data.

Materials and methods
We implemented the methodology presented in this paper as the software package DGEclust, which is written in Python and uses the SciPy stack.DGEclust consists of three command-line tools: clust, which expects as input and clusters a matrix of unnormalised count data along with replication information, if this is available; pvals, which takes as input the output of clust and returns a ranking of features, based on their posterior probabilities of being differential expressed; simmat, which also takes as input the output of clust and generates a feature-or library-wise similarity matrix, which can be used as input to hierarchical clustering routines for the generation of heatmaps and dendrograms.All three programs take advantage of multi-core processors in order to accelerate computations.Typical calculations take from a few minutes (as for the simulated data used in this study) to several hours (as for the CAGE data), depending on the size of the dataset and total number of simulation iterations.All analyses in this paper were performed using DGEclust and standard Python/SciPy tools, as well as DESeq, edgeR and baySeq for comparison purposes.

URL
The most recent version of DGEclust is freely available at the following location: https://bitbucket.org/DimitrisVavoulis/dgeclust  In all cases where 10% of the transcripts were differentially expressed (A,B), DGEclust was the best method, followed closely by baySeq.edgeR and baySeq perform similarly to each other and occupy the third place.The discriminatory ability of all methods increases with the available number of samples.In datasets where 50% of the transcripts were differentially expressed (C,D), DGEclust shows the best discriminatory ability, followed by baySeq and edgeR in the second place and DESeq in the third place, similarly to A and B. The larger proportion of differentially expressed transcripts results in worse perofrmance for all methods, with the exception of DGEclust (compare to Ai-Aiii and Bi-Biii).Notice that when 8 samples are available, DGEclust does not return any false positives among the first 500 discoveries, which is the true number of differentially expressed transcripts in the datasets (Diii).The number of differentially expressed transcripts discovered by different methods for each pair of brain regions is illustrated in the upper triangle of the plot matrix, while the overlap of these discoveries is given in the lower triangle.For all pairs of brain regions, DGEclust finds the largest number of differentially expressed transcripts, followed by edgeR (upper triangle of the plot matrix).In addition, the set of all differentially expressed transcripts discovered by edgeR, DESeq and baySeq is essentially a subset of those discovered by DGEclust (lower triangle of the plot matrix).Among all methods, DGEclust finds the largest number of "unique" DE genes, followed by edgeR.
), DGEclust performs better than the similarly performing baySeq, edgeR and DESeq.By increasing the number of samples to 4 and 8 (Figs.2Aii and Aiii), we can increase the discriminatory ability of all methods.Again, DGEclust is the top performer, with baySeq following closely.While ROC curves provide an overall measure of the discriminatory ability of differrent classification methods, they do not immediately indicate whether the deviation from a perfect classification is mainly due to false positives or false negatives.For this reason, we also constructed FDR curves, which illustrate the number of false positives as a function of the total number of positives (i.e. as the decision threshold changes).Mean FDR curves for datasets with 10% DE transcripts are illustrated in Figs.2Bi-Biii.Notice that we measure the false positives only among the first 100 discoveries, which is the true number of DE transcripts in the simulated datasets.We may observe that DGEclust keeps the number of false positives smaller than the corresponding number of the examined competitive methods.This is particularly true at a small number of samples (Figs.2Bi and Bii).For a large number of samples (Fig.2Biii), DGEclust and baySeq perform similarly in terms of keeping false positives to a minimum among the top ranking transcripts.The same trends are observed when we considered datasets with 50% DE transcripts.In this case, the difference in performance between DGEclust and the competitive methods is even more prominent, as indicated by the average ROC curves (Figs.2Ci-Ciii).This is mainly due to a drop in the performance of DESeq, baySeq and edgeR and not to an increase in the performance of DGEclust, which remains largely unaffected.This is particularly true when a larger number of samples is considered (Figs.2Cii,Ciii).In terms of keeping the false positives to a minimum among the top-ranking trascripts, DGEclust is again the top performer, with baySeq in the second place (Figs.2D).Notice that when a large number of samples is available, DGEclust does not return any false positives among the first 500 discoveries (Fig.2Diii), which is the true number of DE transcripts in the simulated datasets.

Figure 1 :
Figure1: Information sharing within and between sample classes.Within each sample class, the count expression profiles for each feature (e.g.gene) across all samples (replicates) follow a Negative Binomial distribution.Different genes within the same class share the same distribution parameters, which are randomly sampled from discrete, class-specific random distributions (G 1 and G 2 in the figure).This imposes a clustering effect on genes in each sample class; genes in the same cluster have the same color in the figure, while the probability of each cluster is proportional to the length of the vertical lines in distributions G 1 and G 2 .The discreteness of G 1 and G 2 stems from the fact that they are random samples themselves from a Dirichlet Process with global base distribution G 0 , which is also discrete.Since G 0 is shared among all sample classes, the clustering effect extends between classes, i.e. a particular cluster may include genes from the same and/or different sample classes.Finally, G 0 is discrete, because it too is sampled from a Dirichlet Process with base distribution H, similarly to G 1 and G 2 .If the expression profiles of a particular gene belong to two different clusters across two classes, then this gene is considered differentially expressed (see rows marked with stars in the figure).

Figure 2 :
Figure2: Comparison of different methods on clustered simulated data.We examined datasets where 10% (A,B) or 50% (C,D) of the transcripts were differentially expressed.The Receiver Operating Characteristic (ROC; Ai-Aiii, Ci-Ciii) and False Discovery Rate (FDR; Bi-Biii, Di-Diii) curves are averages over 10 distinct simulated datasets.The dashed lines in figures Ai-Aiii and Ci-Ciii indicate the performance of a completely random classifier.In all cases where 10% of the transcripts were differentially expressed (A,B), DGEclust was the best method, followed closely by baySeq.edgeR and baySeq perform similarly to each other and occupy the third place.The discriminatory ability of all methods increases with the available number of samples.In datasets where 50% of the transcripts were differentially expressed (C,D), DGEclust shows the best discriminatory ability, followed by baySeq and edgeR in the second place and DESeq in the third place, similarly to A and B. The larger proportion of differentially expressed transcripts results in worse perofrmance for all methods, with the exception of DGEclust (compare to Ai-Aiii and Bi-Biii).Notice that when 8 samples are available, DGEclust does not return any false positives among the first 500 discoveries, which is the true number of differentially expressed transcripts in the datasets (Diii).

Figure 3 :
Figure 3: Comparison of different methods on unclustered simulated data.As in Fig.2, datasets with 10% or 50% differentially expressed transcripts were examined.Average ROC (A,C) and FDR (B,D) curves and dashed lines in A and C are as in Fig.2.All methods, including DGEclust, perform similarly and their overall performance improves as an increasing number of samples(2, 4 and 8)  is considered.This is true regardless of the proportion of the differentially expressed transcripts in the data.

Figure 4 :
Figure 4: Number of clusters in the macaque CAGE data estimated by DGEclust.The Markov chain generated by the Gibbs sampler converged rapidly (after less than 500 iterations) and remained stable around a mean value of 81 clusters until the end of the simulation after 10K simulation steps (A).The histogram constructed from the random samples in A provides an approximation of the posterior probability mass function of the number of clusters in the macaque CAGE data (B).We may observe that the data supports between 75 and 90 clusters.The auto-correlation coefficient of the Markov chain in A drops rapidly below a value of 0.1 at a lag of only around 10 iterations (C).

Figure 5 :
Figure 5: A snapshot of the fitted model after 10K simulation steps.Each panel illustrates the histogram of the log-transformed counts of a random sample from each brain region.Each sample includes 23448 features (i.e.tag clusters corresponding to promoter regions) and it is modelled as a mixture of Negative Binomial distributions (i.e.clusters; the solid black lines in each panel).The overall fitted model for each sample (the red line in each panel) is the weighted sum of the Negative Binomial components estimated for this sample by DGEclust.

Figure 6 :
Figure 6: Comparison of differentially expressed transcripts in the macaque CAGE data discovered by different methods.The number of differentially expressed transcripts discovered by different methods for each pair of brain regions is illustrated in the upper triangle of the plot matrix, while the overlap of these discoveries is given in the lower triangle.For all pairs of brain regions, DGEclust finds the largest number of differentially expressed transcripts, followed by edgeR (upper triangle of the plot matrix).In addition, the set of all differentially expressed transcripts discovered by edgeR, DESeq and baySeq is essentially a subset of those discovered by DGEclust (lower triangle of the plot matrix).Among all methods, DGEclust finds the largest number of "unique" DE genes, followed by edgeR.

Figure 7 :
Figure7: Hierarchical clustering of brain regions in the macaque CAGE data.We constructed a similarity matrix based on the number of differentially expressed transcripts discovered by DGEclust between all possible pairs of brain regions.This similarity matrix was then used as input to a hierarchical clustering routine using cosine similarity as a distance metric and a complete linkage criterion resulting in the illustrated heatmap and dendrograms.Cortical regions (frontal and temporal lobe) are clustered together and are maximally distant from subcortical regions, i.e. the hipocampus and the dorsal striatum (putamen and caudate nucleus) of the basal ganglia.