Integration of a comprehensive resource of protein complexes
To systematically examine the co-expression of protein complex members, we first assembled an extensive dataset of mammalian protein complexes by integrating various large-scale resources. In order to gain sufficient statistical power in the normalization procedure, our analysis was carried out on protein complexes having at least five members. Initially, a manually curated set of 57 large protein complexes was obtained from our previous publication [10]. These complexes were further revised and with the inclusion of additional complexes, the in-house dataset was increased to 64 manually curated complexes. Next, we acquired 365 manually annotated protein complexes from the core non-redundant set of the CORUM database (downloaded from http://mips.helmholtz-muenchen.de/genre/proj/corum/) [14]. Last, the COMPLEAT protein complex resource (http://www.flyrnai.org/compleat/) was included in this study [15]. The latter resource contains 9703 human protein complexes that were either derived from literature or predicted from protein-protein interaction networks. Here, we only retained 332 reliable large complexes (> = 5 members) based on literature evidences derived from “PINdb” [37] and “CYC2008” (except “predicted”) [38] while discarding the rest. To eliminate redundant complexes in the combined dataset, we employed an iterative procedure as similarly described [15] (Additional file 1: Figures S1A and B). At the first stage, the complexes were ranked according to their source in the following order: manually curated complexes from Ori et al., COMPLEAT, and CORUM. Then, the complexes were sorted within each group according to the number of their members from largest to smallest. In a sequential order starting from first to last, we selected the highest-ranked complex as the representative and removed all complexes that shared 50 % and more of their members with this representative complex. This procedure was iteratively run till the end of the list. In total, 279 non-redundant protein complexes, having 2010 distinct members, were obtained for further analysis (Additional file 1: Figures S1C, S1D and Additional file 2). The filtering procedure used here did not take into account the proteomic data analyzed. We decided to define protein complexes a priori in order to be able to directly compare the co-expression of protein complex members across the two datasets analyzed (see below).
Large-scale proteomic dataset
Two large-scale shotgun proteomic datasets were used in this study. The first dataset was taken from Hansson et al., a time series proteomic experiment (referred to as “reprogramming” dataset) that profiles the proteomic changes occurring through the reprogramming of mouse embryonic fibroblasts to iPSCs [12]. The reprogramming dataset contains expression changes for 5451 proteins measured between six consecutive time points (day 0 – fibroblast – to day 15 – iPSCs, profiled at 3-day intervals) in two replicates. For our analysis, we used the expression changes that were reported as protein ratios between two consecutive time points in the original publication [12]. The second dataset consists of the proteomic profiles of 11 human cell lines generated by Geiger et al. [13] (referred to as “11 cell lines” dataset). From this dataset, we retained only the 3250 proteins that were quantified in at least two out of three replicates for all the 11 cell lines and the rest was discarded. For all the cell lines we retained all the three replicates with the exception of A549 and K562, for which single replicates were identified as outlier by hierarchical clustering and excluded. For our analysis, we used the estimated protein abundances that were reported as intensity Based Absolute Quantification (iBAQ) scores in the original publication [13]. In the next step, we checked which protein complexes were represented in either of the proteomic datasets by having at least five quantified members. For reprogramming datasets, protein complexes were mapped to mouse orthologs using the Ensembl orthology data [39, 40] using the R biomart package [41]. In the end, the analysis was performed on 175 complexes, comprising 1129 proteins from reprogramming dataset, and on 123 complexes, comprising 824 proteins from 11 cell lines dataset (Fig. 2a).
Gene expression dataset
The cell line annotation from Gene Expression Atlas [42] was used to select three replicate microarrays (except Jurkat with two replicates) for 10 cell lines (as used in Geiger et al. [13] but missing GAMG cell line). In addition, manual annotation was performed to include data for the cell lines with no available microarray experiment [43]. Randomly selected microarray experiments (listed in Additional file 3) were pre-processed using RMA normalization [44].
Identification of differentially expressed protein complex members
To investigate compositional rearrangements of protein complexes rather than changes in overall complex abundance, we adapted a two-step normalization method that we described previously [10]. For both the reprogramming and 11 cell lines datasets, the same analysis was separately carried out as follows: individual proteome-wide profiles were median-centered, followed by outlier removal as detailed above. Subsequently, the proteomic profiles were restricted to the proteins annotated to be part of protein complexes. In agreement with our previous work [10], we found that complex members were globally co-expressed across samples (Fig. 1b). Therefore, in case of a general change in complex abundance, the comparison between samples would reveal all members to be differentially expressed (e.g. as described in [12]). In order to reveal compositional changes rather than overall complex variations, we performed an additional complex-wise normalization procedure [10]. First, relative abundances of proteins were calculated with respect to their trimmed mean across all conditions. As a next step, the abundance value of each protein was corrected by subtracting the mean relative abundance of the rest of the complex members. In case of proteins involved in multiple complexes, the average value from all the corresponding complexes was taken into consideration. After complex-wise normalization, each condition (reprogramming time point or cancer cell line) was compared with the rest of experimental conditions to identify differentially expressed complex members by LIMMA (LInear Models for Microarray data Analysis) [45]. p values were adjusted for each experimental condition using false discovery rate (FDR) as described by Benjamini and Hochberg [46] and members were considered as differentially expressed if the adjusted p value was less than 0.05 in at least one of the conditions tested. Protein complexes were considered as “variable” or “stable” depending on the fraction of members that was observed as differentially expressed relatively to the other members. To avoid an inflation of variable complexes by experimental noise, we employed a stringent threshold that requires a complex to contain at least 20 % of differentially expressed members in order to be considered as “variable complex.” Fisher’s exact test was used to assess the significance of the overlap of both variable complex members and complexes between the datasets of reprogramming and 11 cell lines (Fig. 2b).
Analysis of protein-protein interfaces
Retrieval of protein-protein interfaces from the Protein Data Bank
The Protein Data Bank (PDB) structures of 281 protein-protein interfaces involving members of protein complexes included in our resource were derived using the UniProt annotation of the complexes members. All the interfaces structures were checked for quality controls: (i) the interacting proteins must be part of the biological assembly associated to the protein complex in the PDB structure; (ii) the interaction surface must be larger than 400 Å (buried surface area), this value was selected since it represents a valid lower-threshold for the association of biologically meaningful protein assemblies [47] (see below for the method used to calculate buried surface area); (iii) the proteins in the structures must be long, at least 20 amino acids. Since the protein-protein interaction can be represented by multiple PDB structures, the representative PDB entry was chosen as the one with the highest buried surface area. The final dataset composed of 184 protein-protein interfaces was analyzed as described below.
Buried surface area calculations
NACCESS 2.1.1 was used for accessibility calculations (http://www.bioinf.manchester.ac.uk/naccess/). In detail, we calculated the atomic accessible surface defined by rolling a probe of 1.4 Å size around the van der Waals surface of the binary protein complex [48]. We also applied the same to the separate components and then calculated differences in accessibility from the unbound to the bound state. The surface was defined by default van der Waals radii [49]. We calculated the apolar and polar buried surface areas, defined by the sum of surface accessibilities from N, O and C, S atoms, respectively. We then defined core and variable interfaces as follows:
-
Variable are interfaces in which at least one partner has been shown to be differentially expressed in at least one of the condition tested;
-
Core are interfaces in which both partners have been shown to be stably expressed.
Given the aforementioned conditions, we concluded that 184 complexes are suitable for subsequent energy calculations and proper analysis of the core and variable classes.
Energy calculations
In order to ensure that all potentially missing side-chains were properly built and the interface optimized, the HADDOCK webserver refinement protocol was used [50], first described by Kastritis and Bonvin [51]. We used the OPLS force field [52]. Non-bonded interactions were calculated using a cutoff of 8.5 Å. A shift function was applied for calculating Electrostatic energy (Eelec), while a switching function (between 6.5 and 8.5 Å) was applied for the calculation of van der Waals interaction energy (Evdw). Implementation of empirical atomic solvation parameters were used for Desolvation energy calculation (Edesolv) using parameters from Fernandez-Recio et al. [53]). This procedure generated 50 refined protein-protein interfaces per complex, starting from different random velocities. As is default in the HADDOCK protocol, the average score of the top four models was evaluated. All calculations were performed with HADDOCK version 2.1/CNS version 1.2 [54] through the refinement interface of the HADDOCK web server (http://haddock.science.uu.nl/). Details on the protocol have been previously described and can be found in [51] and [55].
Regulation of protein complex stoichiometry
For the reprogramming dataset, mRNA and miRNA expression profiles performed on the same time course experiment as the proteomics data were retrieved from Polo et al. [22]. These datasets were downloaded from the GEO database with the accession number GSE42478. Relevant gene expression profiles were normalized with RMA procedure and LIMMA analysis was used for the comparison of consecutive time points in order to identify differentially expressed probe sets (FDR adjusted p value <0.05 and absolute log2 fold change >0.5). The comparison between “day12” (GEO accession: GSM1038611) and “day9” (GEO accession: GSM1038607) could not be undertaken because both these time points had only single replicates (in order to generate the graphs displayed in Fig. 4c we therefore assigned a fold change with value 0 to this time point). In total, 9183 out of 22,716 probe sets were found to be significant in at least one of the time points. Only the probe set with highest variance was selected to avoid bias towards genes represented by multiple probe sets. Next, we compared the protein abundance profiles to changes in transcript expression across the time course experiment for differentially expressed complex members (Additional file 6). For 71 out of 223 analyzed cases for which we had complete data, we found significant and consistent (same sign) changes at both the protein and mRNA level that were co-occurring at the same time point during reprogramming (indicated as “TRUE” in Additional file 6). For additional 13 cases, the protein change was accompanied by a consistent trend at the mRNA level (absolute log2 fold change >0.5, FDR adjusted p value >0.05, indicated as “TREND” in Additional file 6). We interpreted both such cases as evidence that the abundance of complex member is regulated at the transcriptional level (Fig. 3a and b). Additionally, we retrieved the predicted mRNA targets of significantly regulated miRNAs (LIMMA, FDR adjusted p value <0.01) from targetscan database [56], as similarly done in Polo et al. [22]. Finally, differentially co-expressed mRNA/proteins were linked with inversely expressed miRNAs and these cases were indicated as potentially mediated by miRNAs (Fig. 3b).
Analysis of NuRD composition in HeLa and HEK293 nuclear extracts
Nuclei were isolated from HeLa and HEK293 cells as described in [57]. All the following steps were performed at 4 °C, unless otherwise stated. Nuclei were resuspended at concentration between 1.5e8/mL and 3.0e8/mL in digestion buffer A (0.1 mM MgCl2, 1 mM DTT, 10 μg/mL aprotinin, 5 μg/mL leupeptin) supplemented with DNaseI (Roche, cat.n: 104145) and RNAseA (Sigma, R4642), and immediately diluted with 4 volumes of digestion buffer B (5 % (v/v) glycerol, 20 mM Tris–HCl pH 8.5, 0.1 mM MgCl2, 1 mM DTT, 10 μg/mL aprotinin, 5 μg/mL leupeptin). DNA and RNA digestion was allowed to proceed for 15 min at room temperature. Nuclei were then diluted by addition of 2 volumes of lysis buffer (5 % (v/v) glycerol, 40 mM Tris–HCl pH 7.5, 300 mM KCl, 0.4 mM MgCl2, 2 mM DTT, 4 mM Na3VO4, 10 μg/mL aprotinin, 5 μg/mL leupeptin) and sonicated 4× 30 s; each sonication cycle was followed by 30 s incubation on ice. Lysate was clarified by centrifugation at 14,000× g for 10 min, and the resulting supernatant was further centrifuged at 100,000× g for 30 min. High molecular weight protein complexes were concentrated using a spin filter concentrator (100,000 MWCO) to reach a protein concentration of approximately 20 mg/mL. A total of 80 μL of this solution was separated using by size-exclusion chromatography (SEC) using a 600 × 7.8 mm BioSep4000 column (Phenomenex, Inc.) operated at 250 μL/min in SEC buffer (5 % (v/v) glycerol, 30 mM Tris–HCl pH 8, 200 mM KCl, 0.3 mM MgCl2, 1.7 mM DTT) on a ÄKTA Micro FPLC system (GE). Forty-three fractions (250 μL each) were collect across the column separation range, estimated to be in the range of 2–200 kDa. Urea was added to each fraction to a final concentration of 4 M, and protein were digested by addition of LysC (Wako) (1:100, 4 h at 37 °C) and trypsin (Promega) (1:50, 16 h at 37 °C), following dilution of urea to 2 M. Digestion was stopped by adding TFA to a final concentration of 0.5 % (v/v). Digested peptides were desalted using OASIS C18 96-well plates (Waters) according to manufacturer’s instructions.
Targeted proteomics assays for NuRD members (MBD2, MBD3, MTA1/2/3, and CHD3/4) were developed as described in [57] (Additional file 8). Isotopically labeled peptides corresponding to the selected endogenous peptides (Spike Tides L, JPT) were spiked into each SEC fraction and used as internal standard for quantification. For each fraction, the light-to-heavy ratio of each peptide was normalized to the median ratio of all the NuRD members’ peptides. Normalized ratios were then averaged for each complex member to derive normalized protein intensities that were used for comparison across cell lines (Fig. 5b).
Induction of an artificial paralog switch in the NuRD complex by silencing MBD3
Generation of MBD3 knockdown cell line
Modified human embryonic kidney cells 293 (HEK Flp-In™ T-REx™ 293 cell line, Life Technologies) were grown in Delbecco’s modified Eagle medium (DMEM) containing 5 g/L glucose supplemented with 10 % heat inactivated fetal bovine serum (FBS), blasticidin (15 μg/mL), and zeocin (100 μg/mL). Cell were grown in 37 °C in 5 % CO2. HEK Flp-In™ T-REx™ 293 cells encoding micro-RNA against MBD3 gene were genetically engineered using miRNA BLOCK-iT system from Life Technologies (target: AGATGCTGATGAGCAAGATGA). For stable transfection 200,000 cells were seeded in DMEM with no antibiotics. After 24 h, 100 μL of DMEM (without antibiotics and FBS) with 3 μL X-tremeGENE9 DNA Transfection Reagent (Roche), 100 ng of miRNA containing vector and pOG44 plasmid (Life Technologies) were mixed, incubated 15 min at room temperature and added to cells. Transfected cells were selected by addition of blasticidin (15 μg/mL) and hygromycinB (100 μg/mL). Expression of miRNA was induced for 96 h with 1 mg/mL tetracycline.
Quantification of transcript levels by qPCR analysis
Total RNA was isolate with RNAEasy Mini Kit (Qiagen). A total of 500 ng of RNA was reversely transcribed using QuantiTect Reverse Transcription Kit (Qiagen) following the manufacturer protocol. cDNA was diluted 10-fold in water and used as a template for qPCR with Sybr Green PCR Mater Mix. qPCR reaction was performed according to the following protocol: 1× 95 °C – 10 min (DNA denaturation and polymerase activation); 40× 95 °C −15 s (melting), 60 °C – 1 min (annealing/extension). Gene expression was normalized to a glyceraldehyde 3-phosphate dehydrogenase (GAPDH) gene. Selected primers: MBD2 For: AGCCTCAGTTGGCAAGGTAC Rev: GAGGATCGTTTCGCAGTCTC; MBD3 For: CAGCCGGTGACCAAGATTAC Rev: CATGGTCTTGACCAGCTCCT; GAPDH For:GGTCTCCTCTGACTTCAACA Rev: AGCCAAATTCGTTGTCATAC.
Quantification of protein abundance changes by targeted proteomics
Changes in NuRD member protein abundances were assessed by targeted proteomics. Nuclei were isolated and processed as described in [57]. Isotopically labeled peptides were spiked-in and used as internal standard for relative quantification between cell lines transfected with miRNA against MBD3 and a scrambled miRNA control (Life Technologies), as described in [10]. For this experiment, additional assays for lamin A/C, lamin B1, and lamin B2 were included in the panel and used for normalization (Additional file 8).
Classification of normal and colorectal cancer tissues using variable complex members
We obtained large-scale proteomic dataset from tissue samples of normal mucosa, primary colorectal carcinoma, and nodal metastases from Wiśniewski et al. [30]. From the provided MaxQuant output table, we extracted protein intensities used for label-free quantification (LFQ intensities) and retained proteins that were identified by at least two unique peptides. The original dataset contained eight, eight, and seven samples for normal, carcinoma, and metastasis tissue, respectively. We filtered out proteins that were quantified in less than four samples per group. The intensities from the remaining 6148 protein groups were log2 transformed and normalized by quantile normalization. We used the nearest-centroid approach to classify cancer versus normal tissues using their proteomic profiles [58]. For the classification purpose, protein features were pre-selected from the list of 53 “variable complex members” that were found to be variable in both reprogramming and 11 cell lines dataset (Additional file 6). Using the leave-one-out method, we evaluated the performance of variable complex members in comparison to random proteins. Variable complex members and random proteins were sampled to generate feature sets while the number of features was in the range of 4–28 in increments of 4. For each size, average accuracy was calculated from 100 sampled features. On average, 20 features from variable complex members were sufficient to classify all the cancer versus normal samples correctly. To highlight the discriminative power of variable complex members in comparison to random (n = 20 features), selected examples were visualized as dendrograms using average linkage hierarchical clustering with Euclidean distance as the similarity measure (Fig. 6b).
Availability of data and materials
The source codes used are available at http://www.bork.embl.de/Docu/variable_complexes/ under the GNU General Public License v3.0.
The list of the protein complex interfaces with calculated buried surface area accessibilities and HADDOCK/CNS energies and full simulations are available at http://www.bork.embl.de/Docu/variable_complexes/.
The targeted proteomics data for the analysis of NuRD composition in HeLa and HEK293 nuclear extracts and upon MBD3 knockdown are available at http://www.peptideatlas.org/PASS/PASS00792 and http://www.peptideatlas.org/PASS/PASS00793, respectively.