The workflow of MAnorm is summarized in Figure 1
. First, four bed files that describe the coordinates of all predefined peaks and aligned sequence reads of two ChIP-Seq samples are used as input. Second, MAnorm calculates the number of reads in a window of the same length centered at the summit of each peak. Here the window size should be comparable to the median length of ChIP-enriched regions; we recommend 2,000 bp window size for histone modifications and 1,000 bp for transcription factor binding sites. The (M, A
) value of each peak is then defined as:
is the read density at this peak region in ChIP-Seq sample 1 and R
is the corresponding read density in sample 2. To avoid log2
0, we added a value of 1 to the real number of reads for all peaks. Thus, the value of M
describes the log2
fold change of the read density at a peak region between two samples, while A
represents the average signal intensity in terms of log2
-transformed read density. To build the normalization model, each peak of the two samples being compared was further classified as a common or a unique peak, depending on whether or not it overlapped (by at least one nucleotide, as implemented in our analysis in this study) with any peak in the other sample. The downloadable MATLAB MAnorm package (Additional file 1
) also provides a parameter for users to select common peaks based on a cutoff of peak summit-to-summit distance. By default, this value is set to 500 bp for histone modifications and 250 bp for transcription factors. In addition, when a peak overlaps with multiple peaks in the other sample, MAnorm selects the peak with the smallest summit-to-summit distance to avoid potential bias in building the normalization model. Next, robust regression was applied to the M
values of common peaks using iterative re-weighted least squares with a bi-square weighting function [33
] and a linear model was derived to fit the global dependence between the M-A
values of these peaks:
To normalize the (M, A
) values of all peaks, MAnorm performed coordinate transformation to make the A
axis overlap with the linear model derived from regression. The corresponding (M, A
) value under the new coordinate system was then taken as the normalized (M, A
) value of each peak. Finally, a P
-value associated with each peak was calculated to quantify the significance of differential binding at this locus using a Bayesian model developed by Audic and Claverie [16
in which x and y specify the normalized read count at this peak in sample 1 and sample 2, respectively. Additional file 3 provides further details on P-value calculations. When the read densities at most peak regions are high, most peaks associated with absolute M values > 1 are associated with significant P-values. Then, the M value can be used to rank peaks and select differential binding regions, as was done in analyzing ENCODE ChIP-Seq data (Supplementary Table 1 in Additional file 4). When read densities at most peak regions are relatively low, some of the peaks associated with absolute M values > 1 may still fail to obtain significant P-values. In such a case, we suggest ranking peaks by P-values and defining differential binding regions using combined cutoffs of both M value and P-value, as we did when analyzing the ChIP-seq data from Taslim et. al.  (Supplementary Table 2 in Additional file 4).
The output of MAnorm includes the normalized (M, A) value and the corresponding P-value of each peak. To illustrate the normalization process, the (M, A) values of all peaks before and after normalization are plotted together with the linear model derived from common peaks. The MAnorm package will also generate three bed files presenting the genome coordinates for the non-differential binding region and two differential binding regions based on user-specified cutoffs, together with two wig files (corresponding to the two peak lists under comparison) that can be uploaded to a genome browser for visualization of the M value for each peak (Supplementary Figure 9). MATLAB and R versions of the MAnorm package are available for downloading in Additional file 1.
Application of MAnorm to ENCODE ChIP-Seq data
The performance of MAnorm was tested using ENCODE ChIP-Seq data describing histone modifications (H3K4me3 and H3K27ac)  and transcription factor binding (c-Myc and Pol II)  across three human cell lines: H1 ES cells, HeLaS3 cells, and K562 cells . Since these data were generated and processed by different laboratories associated with the ENCODE project, the data sets were reanalyzed and the ChIP-Seq peaks in each sample were redefined using MACS  using a P-value cutoff of 1e-10 for histone modifications and a P-value cutoff of 1e-6 for transcription factor binding. The peaks of histone modifications were further filtered by the false discovery rate (FDR) values modeled by MACS. The target genes of each group of peaks were defined as those RefSeq genes that have a given peak(s) in the promoter region, defined as the region from 8 kb upstream to 2 kb downstream of the transcription start site.
Gene expression data for all three cell types were collected from the Gene Expression Omnibus (GEO) database using accession numbers [GEO:GSE26312] (for H1 ES cells) , [GEO:GSE2735] (for HeLaS3 cells)  and [GEO:GSE12056] (for K562 cells) , and the raw data were reprocessed by dChip . The differentially expressed genes were subsequently identified by Significance Analysis of Microarrays (SAM)  using a combined cutoff of fold change > 2 and FDR < 0.01. In total, 3,465 genes more highly expressed in H1 ES cells and 2,224 genes more highly expressed in K562 cells were identified from the H1 ES to K562 comparison; 5,815 genes more highly expressed in H1 ES cells and 1,649 genes more highly expressed in HeLaS3 cells were identified from the H1 ES cell to HeLaS3 cell comparison; and 3,555 genes more highly expressed in HeLaS3 cells and 5,916 genes more highly expressed in K562 cells were identified from the HeLaS3 cell to K562 cell comparison. To study the relationship between binding differences in peak regions and the expression change of the corresponding target genes, we used the M values of peaks to divide the targeted genes into different groups separated by integer M values from -4 to 4, and then calculated the enrichment score of the overlap between each gene group and those differentially expressed genes. To avoid extreme enrichment scores, groups composed of < 50 genes were merged with the larger of the adjacent two gene groups.
Motif scan and hierarchical clustering of motif scores with peak Mvalue
To detect the potential binding of transcription factors in defined peak regions, we downloaded the position weight matrixes of 130 core vertebrate motifs from the JASPAR database [41
] and performed motif scan [42
] applied to a 1,000 bp window centered at the peak summit. For each motif F
, the raw motif matching score at each peak P
was calculated as:
in which S is a sequence fragment of the same length as the motif and B is the background frequency of different nucleotides estimated from 10,000 random 1,000 bp sequences sampled from the genome. The motif score of motif M in peak P was defined as the raw motif matching score divided by the maximum possible score, that is, the raw motif score obtained by the consensus sequence of the motif.
To identify transcription factors associated with cell type-specific binding of the ChIP'd proteins, we applied hierarchical clustering with Ward's linkage to cluster the M value with the motif matching score of JASPAR motifs in all peaks of cell type 1, and separately the -M value was clustered with the motif scores in all peaks of cell type 2, using '1 - ρ' as the distance metric, where ρ is the Pearson correlation coefficient. Only motifs with an enrichment score > 1.2 and Bonferroni-corrected P-value < 1.0E-5 by Fisher exact test are shown in the clustering plots.
Comparing the performance of MAnorm and other methods
For total read normalization, we divided the read intensity of each peak region by the total number of mapped sequence reads. For quantile normalization, we first divided the whole genome into non-overlapping bins of the same size as the window used in MAnorm (2,000 bp for H3K27ac), and then calculated the read count in each bin. Finally, the distribution of bin read counts was normalized to be the same by matching all quantiles between samples. For normalization by genome-wide MA plots, we first divided the whole genome into non-overlapping bins of the same size as the window used in MAnorm (2,000 bp for H3K27ac), and then calculated the M-A value of each bin. The dependence between M-A value was then removed by subtracting M values with local linear model fitted by LOWESS regression from the genome-wide M-A values.
To compare the performance of MAnorm with the model developed by Taslim et al. , we used MACS to identify peaks from the same Pol II ChIP-Seq datasets used by , and then applied MAnorm to compare Pol II binding profiles between estradiol (E2)-stimulated MCF7 cells and unstimulated MCF7 cells. The gene expression data of unstimulated and E2-stimulated MCF7 cells was obtained from the GEO database, accession number [GEO:GSE11352] . We identified 59 genes showing higher expression in unstimulated MCF7 cells and 130 genes showing higher expression in E2-stimulated (12 h) MCF7 cells using SAM with fold change > 2 and FDR < 0.1. Finally, the performance of MAnorm was evaluated by comparing the difference of Pol II binding determined by both models with the differential expression of target genes.