Problem definition, aim, and approach
Gene expression datasets (RNA-seq and microarray) contain quantitative estimates (observations) of mRNA abundance for a set of genes at multiple experimentally, spatially, or temporally discrete conditions. Across these conditions, it is expected that the mRNA abundance of transcriptionally co-regulated genes will exhibit coordinate behavior. These co-regulated cohorts of genes include those that are inherent modules of the system being studied, as well as those that may be conditional on applied experimental perturbations. The observations also include transcript abundance estimates for genes that are behaving independently in the experimental series. Furthermore, for genes that are transcriptionally co-regulated, variance in RNA processing and mRNA half-life cause fluctuations in transcript abundance such that abundance estimates are inherently noisy. Thus, the goal of gene expression clustering is to identify and extract the discrete cohorts of genes whose transcripts are behaving coordinately (albeit with biological noise) across the observations under consideration.
Figure 1 presents simulated gene expression data to illustrate the problem of extracting distinct cohorts of co-expressed genes. Each simulated dataset contains 500 genes, with 100 genes in each of three distinct clusters and 200 genes that do not belong to any cluster. Detailed description of these datasets is provided in the “Methods” section, and their values are provided in Additional file 1: Table S1. Figure 1a shows the same datasets simulated with increasing levels of biological noise (D1 to D4), and Fig. 1b shows the desired results. That is, to extract three distinct clusters of genes (C1 to C3) while discarding the genes that behave independently. In conflict with the desired goal, data partitioning methods require all genes to be included in one of the clusters. For example, application of k-means (the most commonly used method for analyzing gene expression datasets) recovers the three simulated profiles (Fig. 1c). However, each cluster also contains a large cohort of genes that do not share the same expression profile (Fig. 1d). This inclusion results in clusters with high levels of dispersion (differences in expression levels between genes within a cluster) and high levels of inter-cluster similarity, violating the expectations of co-expressed gene clusters, and producing clusters whose gene assignment is unreliable. Clust is designed to address this problem by extracting the largest and least dispersed set of clusters whose profiles are distinct and exclude those genes that do not belong to these clusters. That is, to identify and extract the complete set of genes that are exhibiting coordinate behavior in the experimental series under consideration. The results of applying clust to these demonstrative datasets are included in Additional file 2: Figure S1.
The clust method
Figure 2 shows an overview pipeline of the steps composing the clust method. The method takes one or more datasets as an input. The first step is pre-processing the datasets by summarizing replicates, filtering out genes with low expression, and normalizing gene expression values as required. The user may choose their preferred pre-processing options, but the best practice options are indicated in the description of the publically available clust python package online. After that, clust produces a pool of “seed clusters” by applying k-means clustering multiple times to this data with different K values. If the input includes more than one dataset, consensus clusters over these datasets are calculated using the binarization of consensus partition matrices (Bi-CoPaM) method [13]. These seed clusters are then evaluated by the M-N scatter plots technique [14], and elite seed clusters are selected. This technique favours clusters of larger sizes that maintain low dispersion values and guarantees that clusters are distinct. Finally, the elite seed clusters are analyzed to learn the distributions of within-cluster dispersion in the selected datasets; this information is used to remove outliers from clusters and identify genes that fit within clusters but that have been missed by the previous steps. A full description of the clust algorithm is provided in Additional file 3: Text S1. A standalone Python implementation of clust is available at https://github.com/BaselAbujamous/clust [15].
Data sources and comparative methods
To demonstrate the performance characteristics of clust on real biological datasets, the method was applied to 100 different gene expression datasets (Additional file 4: Table S2). These datasets comprised ten microarray datasets and ten RNA-seq datasets from each of five different model organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, and Saccharomyces cerevisiae. To put these performance characteristics in context, seven of the most commonly used co-expression clustering methods (Cross Clustering (CC) [12], k-means [6], self-organizing maps (SOMs) [8], Markov clustering (MCL) [16], hierarchical clustering (HC) [7], Click [10], and WGCNA [17]) were also applied to these datasets. For each of these comparative methods, the best-practice operating procedures were followed as described in the “Methods” section. In all cases, the data pre-processing procedures were the same for each method and were applied as described in the “Methods” section.
Clust robustly extracts tight and non-overlapping clusters
Different clustering methods produce very different results when applied to the same dataset (Fig. 3a; Additional file 5: Table S3). For instance, when any two methods are applied to the same dataset, the results will, on average, only be 37% identical (i.e., adjusted rand index similarity score [18] of 0.37). Therefore, clustering results strongly depend on the method that was applied, which raises the question as to which method performs the best.
As clust is a cluster extraction method, it does not necessarily assign all genes to clusters. On average across the 100 test datasets clust assigned 50% of the input genes to clusters (Fig. 3b). The CC, MCL, and Click methods also extract clusters without forcing all input genes to be in clusters (Fig. 3b). Importantly, clust produced sets of clusters that have significantly lower dispersion than those produced by CC (p value 1.5 × 10−39), k-means (p value 3.2 × 10−10), SOMs (p value 4.8 × 10−27), MCL (p value 8.4 × 10−35), HC (p value 8.0 × 10−16), Click (p value 2.5 × 10−26), or WGCNA (p value 3.9 × 10−19) (Fig. 3c; p values obtained from paired T test, Additional file 6: Table S4). Clusters produced by clust are discrete, such that genes assigned to one cluster do not fit within the profile boundaries of any other cluster (JI = 0 for all clusters, Fig. 3d; see the “Methods” section for the definition of cluster boundaries). This is not the case for the other methods, where 10% to 50% of the genes that are included in a given cluster also fit within the boundaries of at least one other cluster (Fig. 3d). Thus, application of these methods to gene expression data produces clusters that are not discrete and contain between 10% and 50% unreliably assigned genes (Additional file 6: Table S4). Analogous results were obtained when the comparative methods were re-run while optimising the MSE and JI metrics, where applicable (Additional file 2: Figure S2). Notably, datasets with more than 50 samples were excluded from this analysis for logistical reasons. However, a total of 19 datasets would have been included had the number of samples not been limited (Additional file 7: Table S5). Replicating the clustering method comparison over these datasets also shows analogous results to those presented in Fig. 3 (Additional file 2: Figure S3). Thus the improved performance of clust relative to other clustering methods is independent of dataset size or the criteria used for competitor dataset optimization.
The properties of clust’s clusters are independent of their size, that is, the number of genes contained in a given cluster. In contrast, the properties of clusters returned by the majority of the other methods display a significant dependency on cluster size (Additional file 2: Figures S4 and S5; Additional file 8: Table S6). None of the eight methods, including clust, behaves differently on datasets from different species (Additional file 2: Figures S6 and S7) or with different numbers of expressed genes (Additional file 2: Figures S8 and S9). Therefore, the number of genes or the species from which the data was produced is not a factor that affects clustering performance. However, the dispersion of clusters produced by all methods is dependent on the number conditions under consideration such that the more conditions being considered the worse the results of the clustering (Additional file 2: Figure S10). This is particularly problematic for cluster overlap (Additional file 2: Figure S11), where the clusters produced by all methods except clust become less distinct as the number of conditions increases.
Clust outperforms other methods on seven different cluster validation indices
In order to provide an independent assessment of the performance of clust, the clusters produced by all of the 8 methods across all 100 datasets were assessed using seven additional cluster validation/separation indices. The indices comprise the Davies-Bouldin (DB) index [19], the Bayesian information criterion (BIC) [20], the silhouette index [21], the Calinski-Harabasz (CH) index [22], the Ball and Hall (BH) index [23], the Xu index [24], and the within-between (WB) index [25]. None of these seven validation indices are direct targets of optimization by clust. In contrast, some of the other tested clustering methods directly optimize some of these metrics (e.g., CC is designed to optimize the silhouette index [12]). To compare the methods on these metrics (whose scores differ in location shape and scale), a non-parametric rank-based comparison was performed. This revealed that clust significantly outperformed all of the other tested clustering methods (Fig. 3e; Additional file 9: Table S7). For example, clust shows significantly lower (better) rank scores than its closest competitor, CC, with a paired Wilcoxon test p value of 3.0 × 10−15.
Figure 4 shows a comparative example of the clusters produced by each one of the eight clustering methods when applied to one of the 100 datasets, namely D83 (Additional files 4 and 6: Tables S2 and S4). This dataset was chosen as it is the time-series dataset with the most similar number of clusters across all the tested methods. A similar figure showing (up to) 14 clusters produced by each method for all 100 datasets is provided for download from the Zenodo repository at https://doi.org/10.5281/zenodo.1298541 [26]. The reduced MSE and JI of clust in comparison to other methods is readily apparent from visual inspection of the gene expression profiles of genes assigned to each cluster in Fig. 4.
Clust extracts clusters with significantly enriched biological terms
One of the most commonly applied tests to co-expressed clusters of genes is functional term enrichment, as a cluster of co-expressed genes is expected to be enriched with genes that have related biological roles. As clust assigns on average 50% of genes to clusters, it was investigated if this reduction in gene number affects enrichment with functional terms. To do this, each of the methods were evaluated for their ability to detect enrichment of GO terms in 10 datasets from the multicellular organism Arabidopsis thaliana, and 10 datasets from the unicellular organism Saccharomyces cerevisiae.
Over the results of clustering these 20 datasets, different methods produced results with different numbers of enriched GO terms ranging from 1530 terms in the results of CC to 4317 terms in the results of WGCNA (Fig. 5a). Clust’s results include 2988 enriched GO terms (Fig. 5a). In total, 7404 GO terms were detected by at least one method, of which 4531 (61%) were detected by two or more methods and only 503 (7%) were detected by all methods (Fig. 5b; Additional file 10: Table S8). A similar observation can be seen when replicating this analysis using functional annotation terms from the REACTOME database [27] instead of GO terms (Additional file 2: Figure S12; Additional file 11: Table S6).
Given the disparity in detection of enriched functional terms between methods, and that the truth is unknown, a test was devised to assess which method best recovered those GO terms that were most likely to be true. Here the set of GO terms that were most likely to be true were defined as those that were identified as significantly over-represented within a given dataset by all methods (n = 503). The distribution of these 503 unanimously agreed terms in the different constituent datasets is shown in Fig. 5c. It should be noted here that such unanimous terms were not identified in four of these datasets (Fig. 5c) and thus only 16 of the 20 datasets contributed GO terms to this analysis. Although there was variation in the performance between methods on different datasets (Fig. 5d), overall, the clusters produced by clust achieved significantly better p values for these unanimous GO terms than clusters produced by CC, k-means, SOMs, HC, Click, or WGCNA, while being not significantly different from those produced by MCL (Fig. 5d; Additional file 10: Table S8; see the “Methods” section for description of the statistical tests). Thus in addition to improved clustering performance, as defined by multiple cluster validation indices, clust also performs as well as or better than other tested clustering methods in terms of GO term detection.
Clust extracts clusters of co-expressed genes from multiple datasets simultaneously
The quantity of gene expression data that is deposited in public repositories is increasing rapidly. This is primarily due to a reduction in the costs of acquiring such datasets. These datasets come from a multitude of different species, have been generated using different technologies (microarrays and RNA-seq), and have different properties such as numbers of conditions, replicates, and missing values. Clust is designed to enable simultaneous cluster extraction from multiple such heterogeneous gene expression datasets (Additional file 3: Text S1). Irrespective of datatype or source species, clust extracts clusters of genes that are consistently co-expressed with each other in all of the given datasets.
To evaluate this feature of clust, ten combinations of d datasets for each d in {2,3,4,5,6,7,8,9,10} were selected at random from the ten yeast RNA-seq datasets (D91 to D100; Additional file 4: Table S2). The same experiment was performed over Arabidopsis datasets. To provide a comparison, the other methods were also applied to these combinations of datasets. However, as these methods are only applicable to a single dataset at a time, the only way to enable their simultaneous analysis was to concatenate them together prior to clustering (Additional files 12, 13, 14, 15: Tables S10–S13). As before, clust produces tighter clusters with lower within-cluster dispersion (lower MSE) (Additional file 2: Figure S13a & b) and guarantees no cluster profiles which overlap (JI = 0, Additional file 2: Figure S13c & d). Moreover, and as expected from a biological point of view, both the percentage of input genes that are included in the extracted clusters (PAG) and the number of generated clusters (K) decrease as more datasets are included as input to clust (Additional file 2: Figure S13e–h). This behavior is expected because as the number of conditions increases, the less likely a group of genes are to be co-expressed under all conditions. For example, when all ten yeast RNA-seq datasets are provided as input to clust, only a single cluster of 52 genes is identified. Of these, 44 are components of the ribosome or participate in ribosome biogenesis (Additional file 16: Table S14). An analogous cluster of 19 genes (all of them are involved in translation (protein synthesis)) was obtained when all 10 Arabidopsis RNA-seq datasets were provided to clust (Additional file 16: Table S14).
Of the other methods, MCL maintains relatively low MSE values over increasing numbers of datasets (d). In contrast, MSE values of the other methods increase when d increases (Additional file 2: Figure S13a–d). Moreover, MCL is the only method, other than clust, which shows the biologically expected trend of decreasing values of PAG and K at higher d values (Additional file 2: Figure S13e–h). Nonetheless, the performance of clust is significantly better than MCL in terms of MSE and JI values at all d values (Additional file 2: Figure S13a–d).