Multiplatform genome-wide identification and modeling of functional human estrogen receptor binding sites

Refinement of the functional human estrogen receptor binding site model using a multi-platform genome-wide approach reveals extended binding specificity signal.


Background
Estrogen receptors (ERs) are members of the nuclear receptor superfamily of transcription factors, which plays key roles in human development, physiology, and endocrine-related diseases [1]. Two ER subtypes, namely ERα (ESR1) and ERβ (ESR2), mediate cellular responses to hormone exposure in target tissues, and receptors are directed at cis-regulatory sites of target genes via interactions between the zinc finger motifs in their DNA-binding domains and specific nucleotide sequence motifs termed estrogen response elements (EREs). Specificity protein (Sp)-1 and activator protein (AP)-1 transcription factors are also known to tether with ER and regulate a smaller subset of target genes through Sp1 and AP1 binding sites. The importance of these sites to the overall ER biologic response remains unclear.
The consensus ERE sequence (5'-GGTCAnnnTGACC-3') was derived from conserved regulatory elements found in Xenopus and chicken vitellogenin genes and consists of palindromic repeats separated by a three-base spacer to accommodate interactions with receptor dimers [2,3]. Subsequent characterizations of EREs in additional target genes, however, indicate that the majority of response elements deviate from the described consensus sequence [4]. Furthermore, ERE-like sequences are ubiquitous in the human genome, and evidence for ER binding among the majority of ERE-like sites in estrogen response gene expression studies is apparently absent; these factors suggest that additional sequence motifs and/or chromatin features may contribute to the specificity of ER binding and transcriptional response. Recent efforts to model better the ERE by using position weight matrices (PWMs [5]) in order to describe all previously published EREs have resulted in more complete models but with a limited ability to predict bona fide ER binding [6,7]. We posited that the current major challenge with construction of ERE models is the limited datasets available, both for experimentally determined ER-bound sites and for ERE-like sites that do not bind ER.
In addition to compiling the known sites reported in the literature, we pursued a combined experimental and informatics approach to identify additional ER binding sites and their associated direct target genes. This information was analyzed to develop a more faithful model of the ER binding site motifs. To accomplish this, we applied three experimental strategies for ER-binding sites discovery. First, we predicted putative EREs in the promoter regions of direct target genes discovered by microarray analysis [8] and then tested for ER binding at predicted sites of responsive genes by chromatin immunoprecipitation (ChIP) assays [9]. Second, we surveyed ER-binding sites in promoter regions of the human genome by hybridizing fluorescently-labeled ChIP DNA fragments to high-density oligonucleotide arrays ('ChIP-on-chip') with probes against about 30,000 proximal promoters (-1 kilobase [kb] to +0.2 kb relative to the transcription start sites [TSSs]). Third, we detected ER-binding sites across the genome by ChIP, followed by cloning and sequencing of bound fragments ('ChIP-and-clone'). ERE-like sites that have been validated, for binding and nonbinding, by conventional ChIP followed by quantitative polymerase chain reaction (qPCR) using site-specific primers were then used to train and test a model for functional EREs (summarized in Figure 1). In the present study, we focused on functional human EREs to minimize potential noise introduced by species-specific variation, which we have previously observed [8].

Functional estrogen receptor binding sites
We used a combination of literature search and direct experimentation to generate a list of qualified ER-binding sites. In this study we constrained ourselves to using only sites that have been validated for the modeling of functional EREs. We first extracted human ERE sequences that have been experimentally validated in the literature to either bind or not to bind ER. Klinge [4] and Bourdeau and coworkers [10] each described EREs that have been validated by electrophoretic mobility shift assays, transient transfection with reporter gene constructs, or ChIP assays.
Supplementing the list of confirmed EREs gleaned from the literature, we experimentally identified functional ER-binding sites using two whole-genome experimental strategies. The first strategy was to extract candidate ER-binding sites computationally from a list of putative direct ER target genes. Eighty-nine putative direct target genes were identified as genes expressed in MCF-7 cells that were responsive to estradiol treatment, sensitive to inhibition by Faslodex (ICI 182,780), and insensitive to cycloheximide [8]. We then computationally surveyed 3.5 kb regions flanking the TSSs (-3 kb Schematics of ERE discovery and validation for model training and testing    to +0.5 kb) of these 89 genes to identify proximate consensus EREs (allowing for deviations in up to two conserved positions of the consensus motif). Each site was then tested by ChIP assays and qPCR with site-specific primers to determine the true nature of ER binding. Eight EREs were found to be bound by ER, whereas 41 others were not found to be bound by ER.
In our second approach, we performed ChIP assays on estradiol-treated breast tumor cells and detected ER-binding sites using high-density oligonucleotide microarrays (NimbleGen, Madison, WI, USA) containing probes against proximal promoter regions (-1 kb to +0.2 kb from TSS; 12 probes per promoter) of over 30,000 human known gene and RefSeq transcripts annotated in the human genome sequence hg16 (July 2003), NCBI build 34 annotation of the UCSC genome browser. The ChIP-on-chip studies were performed using duplicate array experiments on the ChIP samples and on input control DNA. The promoters that appeared among the top 5% of the binding ratio range (ER antibody versus control) for both replicates, that had at least a 15% increase, and that were supported by consistent binding ratio enrichment across more than four probes or additional evidence of ER regulation from the microarray data were selected. Putative EREs (allowing for up to two mismatches from the consensus) were then identified in the selected promoters, and some were further validated by additional ChIP and qPCR (see Materials and methods, below, for more detail). Out of the total 28 sites tested, 13 were found to bind ER whereas 15 were not. From the literature sources and experiments described above, a total of 45 validated ER-binding sites and 58 validated non-ER-binding were identified, all of which bore close resemblance to the consensus ERE ( Table 1). Each of the 45 binders and 58 non-binders was associated with a gene and most were located in the genes' upstream regulatory regions. This list of 103 genes were used as the training set to assess the significance of ancillary sequence signals beyond the core ERE that might better predict ER binding.

Ancillary signals for ER binding around the core ERE
ER is known to interact with the 10 base pair (bp) long consensus ERE (hereafter referred to as the 'core ERE'). Presence of the consensus site (or its acceptable variants) is required for the direct binding of the ER dimer to the DNA. However, it is still unclear whether the core site alone is sufficient to signal activated ER for such binding or whether additional ERbinding signals in the sequences flanking the core can be used to distinguish binders from nonbinders. An in silico supervised learning experiment was devised to explore these possibilities.
We modeled the problem of finding additional signals for ER binding among the sequences surrounding the core ERE as a binary classification problem (binders versus nonbinders). The features were position-specific motifs surrounding the core ERE. In other words, we asked whether there is any motif (m) within a definitive distance (p) to the core ERE that could help distinguish the binders from nonbinders. The robust and versatile naïve Bayesian classification approach was employed, with binary tuple <m,p> as features, where m is a k-bp long motif and p is the distance between motif m and the core ERE. Two sets of experiments were set up. The first consisted of the core plus its flanking regions, whereas the second considered only the flanking regions of core ERE. The

Sequence logos
motif length k and the size of flanking regions were similarly varied in both setups. The goal was to learn whether motifs of certain length at particular distances from the core could contribute to the discrimination of binders from nonbinders. Although the results indicated that window size (k) of 1 bp generally outperformed the rest (Additional data file 1), the span of flanking regions did not appear to affect significantly the outcome of the two experiments.
These observations suggested that additional signal for ERbinding might lie in the distribution of single nucleotides adjacent to the core ERE. This hypothesis was initially investigated by visually inspecting the sequence logo [11]  To determine the functionality for the conserved cytosine and guanine three bases upstream of the first ERE half-site and downstream of the second ERE half-site, respectively, we examined the interactions between ER and wild-type and mutant binding sites using surface plasmon resonance (SPR) spectroscopy. Purified ER was incubated with either the previously validated ERE (wild-type) adjacent to the GREB1 gene or mutants containing substitutions in the conserved guanine (mutant 1), the canonical half-sites (mutant 2), in the conserved guanine and the cytosine in the symmetrical position upstream of the first ERE half-site (mutant 3; see Figure 3a), and at the sixth bases upstream of the core ERE (mutant 4; see Figure 3a) as the negative control. Substitution of the conserved guanine (mutant 1) disrupted ER binding by about 40%, and, as expected, mutations in the consensus half-sites reduced binding significantly (see Figure 3b). Interestingly, substitution of the cytosine three bases upstream of the first half-site with an adenine (Figure 3b, mutant 3), in addition to the substitution in the conserved guanine adjacent to the second half-site, further diminished binding. As was also expected, the substitution outside the three bases flanking the ERE did not perturb the binding significantly. These results indicate that the conserved guanine outside of the canonical ERE, discovered by modeling novel ER binding site, is involved in mediating ER binding to the ERE.

Modeling functional EREs
The model we propose, h-ERE, exploits the above observation and consists of two PWMs representing the models for bind-ers and nonbinders. The model relies on a decision tree for classifying sites into binders or nonbinders, based on the scores obtained from the individual PWMs. Two sets of 19 bp sequences, one for binders and the other for nonbinders, were formed from the core sites plus three adjacent nucleotides. We further optimized the binding EREs by minimizing the total entropy of the aligned sites (see Materials and methods), while augmenting the nonbinding EREs by taking both strands of the validated nonbinding loci when constructing the weight matrix.
With this information we constructed a decision tree for the selection of high-likelihood binding EREs versus nonbinding EREs. Each matrix was used to calculate the log-likelihood of a given 19 bp site to be a binder or a non-binder. For each site two scores can be calculated, the binding score (S B ) and nonbinding score (S NB ). Complementing the matrices, a decision tree for distinguishing binders and nonbinders based on S B and S NB was constructed from all of the training dataset using the CART algorithm [12] implemented in R, with 100 crossvalidation runs. Figure 4 depicts the resultant tree. Putative binders are further subcategorized into three groups, from weak binding (group 1) to strong binding (group 3). Apart from these groupings, sites whose raw log-likelihood binding score (S B ) is greater than its nonbinding (S NB ) scores are potentially functional sites. Additionally, to reflect the nature Substitution of the conserved guanine outside of the canonical ERE disrupts ER binding of the validated sites, the model considers sequences whose core EREs have more than 4 bp mismatches with the consensus ERE, GGTCAnnnTGACC, to be non-binding.
In all, given a 19 bp sequence, the proposed h-ERE first checks whether the core 13 bp nucleotides contains at most four mismatches to the consensus ERE. Next, based on the computed PWM scores, predictions can be made based on four stringency levels: stringent (considers only sites in group 3 to be binders), medium (predicts sites in group 3 and group 2 to be as binders), relaxed (considers sites of groups 1-3 to be binders), and loose (defines sites whose S B > S NB as binders).

Unbiased mapping of EREs
In previously described studies conducted to identify EREs, the analyses have largely focused on the 5' cis-regulatory regions of direct target genes. However, ChIP analysis of predicted EREs in the extended promoters of 89 putative direct target genes defined by hormone and inhibitor treatments and microarray expression data [8] indicated ER binding in only 9% of the promoter regions from genes apparently directly regulated by ER. These results suggest that ER may target binding sites outside of the canonical 5' promoter regions. Therefore, to discover additional EREs in an unbiased manner and to generate a dataset for testing model performance, we employed the 'ChIP-and-clone' strategy of cloning precipitated DNA fragments into a bacterial plasmid vector, followed by direct sequencing of the inserts to identify ER binding sites. This approach has the potential to sample any region of the genome, as opposed to PCR-based or microarray-based directed strategies, which target specific sites or functional regions, respectively. Anti-ER ChIP was performed on nuclear lysates from estradiol-treated MCF-7 cells, followed by cloning of precipitated binding sites into the pCR-Blunt (Invitrogen, Carlsbard, CA, USA) vector. From the ChIP library, a total of 1006 clones were successfully sequenced and specifically mapped to the human genome. Based on the presence of ERE-like sequences or supporting microarray expression data for ER regulation of the adjacent transcript, 33 clones were selected for subsequent validation by ChIP and site-specific qPCR. An additional 75 clones were randomly selected from those that have neither EREs nor adjacent transcript expression data for further validation (data not shown). Thus, a total of 108 clones were validated (five contained EREs and are supported by microarray expression data, 23 with only EREs and no supporting expression data, five supported by microarray but no EREs, and 75 with neither EREs nor expression data).
The validation results indicate that ERE-like sequences remain the predominant feature of functional ER-binding sites. In the five clones with EREs and supporting microarray expression data for ER regulation, the validation rate was 100%; for the 23 clones that encode EREs but lack supporting expression data, the validation rate was 57% (13/23). In contrast, clones for which no ERE-like sequences were detected, the validation rates were 40% (2/5) and 9% (7/75), respectively, for those with and without supporting expression data for the adjacent gene. A total of 19 EREs were found in the 18 empirically verified ER-bound clones. Interestingly, the five validated clones that contain EREs and are adjacent to genes that were shown to be hormone regulated map to intronic regions of the target genes. This is consistent with our hypothesis that ER may bind outside of the 5' cis-regulatory regions of target genes. Moreover, when we tested ERE-like sequences in the promoter region of one of the target genes, SIAH2, we did not detect ER binding, suggesting that the intronic ERE is the functional ER binding site (data not shown) for this particular target gene. From this analysis, all EREs that bind ER and did not bind ER in the validation experiments were then used to test model performance (Table 2).
Currently, three other models have been widely used to predict functional EREs: consensus sequence search (allowing for certain mismatches), TRANSFAC matrices using MATCH [13] search algorithm, and Dragon ERE finder [6]. The performance of these models (under different settings) is compared with h-ERE in Table 3. Although h-ERE was not the most sensitive or the most specific, it offered the best balance between the two criteria. With the interest of having a single performance measure that captures the balance between sensitivity and specificity, harmonic means of the two were computed (see van Rijsbergen [14] and Materials and methods). By this measure, h-ERE offers the best balance in performance, even under different stringency settings.

Whole-genome predictions of ER-binding sites
In order to assign specific ERE predictions, we constructed a decision tree using binding and nonbinding scores from the PWMs (see Materials and methods). The parameters were selected to minimize error on the classification of the training set. We scanned the human genome (UCSC hg17) using the h-ERE decision tree and detected 38,024 putative sites under the 'stringent' criteria, including 3607 EREs encoded by Alu Decision tree for ERE prediction Figure 4 Decision tree for ERE prediction. Group 3 EREs would be predicted to be the highest likelihood binders of ER. ER, estrogen receptor; ERE, estrogen response element; S B , binding score; S NB , nonbinding score. repeats. To assess further the performance of our predictive algorithm, we randomly selected 60 sites predicted to be ER binders by h-ERE (group 3 sites) and 60 nonbinders (group 0 sites) for further experimental validation by ChIP and qPCR. Of the 120 sites, specific primers for qPCR could be designed for only 64 sites, 44 of which are binders whereas 20 are nonbinders. Fourteen per cent (6/44) of the predicted binding sites were shown to bind ER (more than twofold enrichment over control) whereas no binding was detected in any of the sites classified as nonbinders (0/20), suggesting that the false-negative rate is less than 5%. The low rate of false negatives allows us to demarcate in the human genome the global set of EREs that contain the universe of putative true binding motifs. This suggests that, taking into account the 14% validation rate, there would be 5363 validated ER-binding sites within the global optimized ERE set for the MCF-7 cells, under conditions similar to our experimental setup.
We then considered how much of the predictions could be attributed to random occurrences simply by chance alone. A series of Monte Carlo simulations were carried out to estimate the false positive rate of h-ERE. One thousand nucleotide sequences 1 megabase (Mbp) long were generated randomly, governed by the empirical single nucleotide distribution of the human genome (UCSC hg17), and were run through h-ERE. The numbers of predicted binders divided by 1 Mbp was reported as the h-ERE false discovery rate per base pair. Taking a conservative estimate of the noise and extrapolating it, for the human genome (about 3 gigabases [Gbp]) about 33,000 (approximately 86%) were estimated to be false positives, and hence approximately 5000 ER-binding sites are present in the human genome.
Taken together, the convergence of these two analyses suggest that binding site motifs will be subject to statistical noise

Discussion
In this report we describe a combinatorial experimental approach for transcription factor binding site discovery and demonstrate superior performance of the resultant computational model. The experimental strategies presented here address the major problem in binding site modeling, namely the small size of experimental datasets for model training and testing. The unique use of validated nonbinding EREs and examining flanking sequences allowed us to identify a novel feature of the ERE.
Previous efforts to characterize the ERE have included mutagenesis studies and electrophoretic mobility shift assays or DNase footprinting experiments. For example, Driscoll and colleagues [15,16] demonstrated that single mutations in the core ERE can greatly disrupt ER binding. Furthermore, they found that changes in the flanking sequences can also either enhance or disrupt binding, depending on corresponding changes in the core ERE. Their experiments examined up to two bases flanking the core ERE, and they found that an A or T in the position immediately flanking the core ERE is important for optimal ER binding. Their observation is supported by the model we present here (Figure 2). In our study we found additional single nucleotide features flanking the consensus ERE that are associated with binding site functionality. In particular, there is a prevalence of guanines in the third position downstream (or equivalently cytosines in the third position upstream) of the core ERE motif in binders but not in the nonbinders. The functional significance of these newly discovered conserved bases were verified by SPR analysis of ER interaction with wild-type and mutant binding sites (Figure 3). These additional features were included in the h-ERE decision tree and probably contributed to improved model performance. Having both the binding and the nonbinding ERE sequences enabled us to assess the sensitivity and specificity of the h-ERE model as compared with the consensus sequence, TRANSFAC database ERE PWM [7], or the previously published Dragon ERE model [6]. Under the four stringency parameters tested, the h-ERE model exhibited the optimal combination of sensitivity and specificity, as measured using the harmonic means of these two factors, with 44-68% improvements over the other models.
A genome-wide scan for putative functional EREs using the h-ERE models yielded more than 38,000 predicted highprobability ER binding sites (group 3), which we have shown should represent the set of all high-likelihood ER-binding EREs. Experimental validation of randomly selected predicted sites indicated that 14% of the sites bound ER under the conditions tested, which agreed with the conservative estimate of an approximate 86% false discovery rate for ERElike sequences in the human genome. From the two approaches, we project there to be approximately 5000 functional ER-binding sites in the MCF-7 genome. That only one out of seven of the high-likelihood binding EREs are functionally used may be attributed to several possibilities. First is that flanking sequences more distal than where assessed in the present study may contribute to the selection of a functional ERE. For example, the nature of the chromatin around the ERE, the relative location of basal transcriptional complexes, and the density of adjacent binding of other transcription factors are candidate modulators of ER-binding site selection. Second, we only tested for ER binding using one standard condition and in a single breast tumor cell line. It is probably the case that certain tissue-specific and conditionspecific binding events are modulated by the presence or absence of ER co-regulators and epigenetic modifications.
The MCF-7 cell line is known to have high levels of ER and to over-express of AIB1 (amplified in breast cancer 1), which is a specific co-regulator of ER [17]. Moreover, cancer cell lines have accumulated many genetic rearrangements and point Table 3 Performance comparison of various prediction algorithms for ER binding using the independent dataset shown in Table 2 Prediction mutations in their passages, which would further confound the results by rendering good binding sites inactive.
In our strategy, the approximately 38,000 high-likelihood ER-binding sites were identified using a training set biased to the 5' cis-regulatory regions of genes. However, when we mapped these approximately 38,000 candidate sites to the genome, only 1821 (about 4.78%) resided within 5 kb upstream and 500 bp downstream of the TSS. The majority (about 36.5%) fell inside genes, about 21.4% were within 100 kb upstream of the TSSs, whereas about 21.3% were located up to 100 kb downstream of the 3' terminus. Approximately 20% were mapped to pure intergenic regions. These findings suggest that the standard mode of identifying transcription factor binding by concentrating on immediate cis-regulatory elements will be unrewarding. In addition, these data collectively question the assignment of physiologic functionality to an ERE site using only gel shift and transient transfection assays with the extracted element, because these in vitro approaches ignore many of the relevant physiologic conditions.
Previously, we found that many functional ERE binding sites around responsive genes are poorly conserved between human and mouse [8]. Moreover, both evolutionarily conserved and nonconserved ERE sites appeared to be equally functional for ER binding in ChIP assays; therefore, there appears to be little advantage in using evolutionary history to identify functional EREs. For this reason, we did not take ERE conservation across species into consideration, as was introduced by Jin and colleagues [18] in their recent report. Instead, we focused on the rules governing functional ER binding in the human genome.
Our observations raise the intriguing possibility that evolution of estrogen response relies on having a large pool of highquality candidate EREs widely scattered in the genome, some of which are potentially generated by transposable elements (about 9% of high-likelihood EREs were within Alu elements). With mutational drift and under evolutionary pressures, different binding sites around the same genes could be alternatively used and would not have detrimental effects on overall survival. If these alternative binding cassettes prove beneficial to the organism, then these secondary sites will undergo further positive mutations to enhance the ER interaction. Conservation of mechanisms and functions across species may be a reasonable assumption for highly conserved biologic processes. However, in the case of EREs and estrogen functions in development and physiology, phenotypic and experimental analysis suggest species-specific mechanisms and hormone responses, including binding site usage. Therefore, using conservation as a filter for function is likely to introduce a significant number of false-negative findings in ERE predictions. This view is further supported by two recent studies [19,20] that found that many functional transcription factor binding sites are not conserved in evolution but there is no apparent functional divergence of the cognate regulated genes. With the binding site database that we present here, such hypotheses can now be computationally examined with increased confidence.

Conclusion
The availability of larger experimentally validated binding site sets allows the construction of more robust binding site prediction algorithms. The proposed h-ERE algorithm employed genome-wide binding site data collected from various types of experiments. It outperformed other existing algorithms for predicting ER binding. That only 14% of the predicted optimal binding sites were utilized under the experimental conditions suggests that there are other selective criteria not related to ERE. Overall, although h-ERE is able to demarcate better the universe of ERE-like sequences that are potential ER binders, factors other than primary nucleotide sequence will ultimately determine binding site selection.

Identification of additional functional EREs
To enlarge the set of validated EREs, we employed a twopronged approach: ChIP-qPCR validation of putative ERE in the promoters of putative direct target genes; and ChIP-qPCR validation of putative ERE found in promoters identified from ChIP-chip experiment (GEO series ID: GSE5405).
For the first approach, we took the 89 putative direct target genes identified earlier in a gene expression microarray study [8], extracted their 3.5 kb extended promoter regions, and scanned the sequences for ERE-like motifs, allowing for up to two-base variation from the consensus ERE. Only those with specific PCR primers flanking the EREs were included in ChIP validations by qPCR. There were 49 EREs from 35 promoters hat met the above criteria. Of these, eight EREs from seven putative direct target genes were validated to bind ER and the remaining 41 EREs did not bind ER under the experimental conditions tested in this study.
In the second experiment, the ChIP-chip experiments, only promoters appearing among the top 5% of both replicate experiment were selected, amounting to 196 promoters (binomial P value = 1.42 × e -33 ). We further increased the stringency by requiring at least a 15% increase of the IP (immunoprecipitation) over the input control in two consecutive probes to further filter out potential noise in the system. This resulted in 111 promoters that met the selection criteria.
Out of the total 111 promoters, we performed ChIP and qPCR validation on 28 promoters that bore putative EREs and had either microarray data supporting their regulation by ER or had consistent binding across consecutive probes (more than four). Of these, 13 were validated to bind ER and 15 did not bind ER.
Altogether, the study validated 21 EREs to be bound by ER and 56 EREs not to be bound by ER. They are indicated by the words 'this study' or by the citation of reference 8, respectively, in the references (right-most) column of Table 1.

Chromatin immunoprecipitation assays
MCF7 cells were estrogen deprived for 24 hours and treated with 10 nmol/l of estradiol for 45 min prior to 1% formaldehyde treatment to crosslink the transcription machinery and the chromatin. Immunoprecipitations were carried out overnight with anti-ERα (HC-20) or irrelevant control antiglutathione S-transferase (GST) antibodies (Santa Cruz Biotechnology, Santa Cruz, CA, USA) and protein A-sepharose beads (Zymed, San Francisco, CA, USA). Washing and extraction protocols were modified from methods described previously [9], and PCR reactions were carried out in an ABI Prism 7900 sequence detection system (Applied Biosystems, Foster City, CA, USA). Forty cycles of PCR were carried out on precipitated DNA and control input DNA. Amplification products were also assayed for specificity by melting curve analysis at the end of each run. Relative quantifications were carried out by building standard curves for each primer set and using genomic DNA, similar to the input, as the template. Enrichment of ER binding was determined by comparing the relative quantities of anti-ER and control anti-GST products. Sites with more than twofold enrichment over control were considered to be bound by ER (or 'binders').

SPR analysis of ER-ERE binding
Biotinylated ERE strands (5'-end labeling) and the antistrands were annealed to form DNA duplexes. The DNA duplexes were then immobilized on the SPR disk (gold-coated glass) using biotin-streptavidin-biotin bridge chemistry. Protein was then applied to bind to the immobilized DNA. The end attachment of DNA ensured sufficient strand flexibility. For DNA immobilization, the gold disks were first cleaned in a ultraviolet/ozone chamber for 5 min, followed by immersing in hot piranha solution (a 3:1 mixture of H 2 SO 4 and H 2 O 2 ) for 2 min. After rinsing with de-ionized water and drying using nitrogen, the disks were immersed overnight in a binary biotin-containing thiol mixture (10% biotin-thiol and 90% ethylene glycol-thiol at a net concentration of 1 mmol/l in ethanol). After rinsing with ethanol followed by a drying step using nitrogen, the disks were ready for streptavidin (Sigma, St. Louis, MO, USA) immobilization (0.1 mg/ml in phosphate-buffered saline) and subsequent biotinylated DNA assembly (1 µmol/l in phosphate-buffered saline). ERα (58-708 nmol/l in 40 mmol/l HEPES-KOH binding buffer [pH 7.4], containing 10 mmol/l MgCl 2 , 200 mmol/l KCl, 0.2% Triton X-100, 2 mmol/l DTT) was then applied to bind to the immobilized DNA. After one cycle of protein binding (about 25 min), 0.1% sodium dodecyl sulfate solution was applied to disassociate the protein-DNA complex and to expose the immobilized DNA for new cycles of ER binding.
The SPR measurements were conducted using a double channel, AutoLab ESPR (Eco Chemie, Utrecht, The Netherlands). In a kinetic measurement mode, molecular adsorption on the gold disks was detected as SPR angle shifts (∆θ in mDeg) over time. The measured ∆θ was proportionally related to the amount of adsorbed material, with a mass sensitivity of 120 mDeg = 100 ng/cm 2 for protein and DNA. The AutoLab SPR equipment was equipped with a two-channel cuvette, with the sensor disk forming the base of the cuvette. Two DNA strands (50 µl) were then immobilized in different channels, allowing protein binding to two different DNA sequences to be monitored in parallel. The measurements were conducted at room temperature and the noise level was 0.2 mDeg. It is worth noting that the SPR experiment were done under varying concentrations of ERα, and the reported relative binding affinities of Figure 3 were averages obtained from the varied concentration of ERα. The same overall relative profiles were observed for the different mutants.

Probing for auxiliary signals around core ERE
To detect and identify whether additional ER-binding signals were flanking the core ERE site, an in silico experiment was devised. We considered whether the surrounding sequences of core ERE sites can be used to distinguish binders from nonbinders. A naïve Bayesian classification [12] approach was employed, with binary tuples <m,p> as the feature, where m is a motif that is k bp long and p is the location of motif m relative to the core. One can imagine constructing, for each sequence S, a binary matrix M, with m as the row index and p as the column index, and the value M m,p indicates whether motif m is present at position p. Such a matrix can be built by running a fixed window of size k over the sequence S and noting down the location of each motif. The class of sequence S, whether it is a binder (B) or nonbinder (NB), can be predicted through the equation C(S) below. In our set of experiments, k was varied from 1 to 5 bp. During the training, the probability distribution was constructed from the raw motif frequency counts with Laplacian smoothing of adding pseudocount L to the raw count.
A cross-validation like supervised classification experiment was performed upon a sequence set, grouped into two or more distinct classes, by randomly splitting the sequences into 4:1 training and test sets, training the classifier using the training set, and reporting the accuracy of the trained classifier over the test set. One hundred runs of such training/testing were carried out and the accuracy was averaged out. The figure in Additional data file 1 summarizes the outcomes under varied parameter settings.

Assessing the significance of flanking nucleotide positions
Outcomes of the previous experiments indicated that single nucleotides surrounding the core ERE motif might carry additional discriminating power between binding and nonbinding EREs. In particular, complementary distinguishing nucleotides appeared to be present at the third base pairs after the core ERE, under visual analysis using sequence logos (Additional data file 2). We designed and carried out a Monte Carlo experiment to quantify its significance. Employing information entropy, or roughly the degree of randomness, as the statistics for each nucleotide position, we asked whether the surrounding nucleotide positions of arbitrary ERE-like sites (up to two mismatches to the consensus) in the human genome contains more information (less random or lower entropy) than those observed surrounding the 45 binding EREs. The exact formula for information entropy is as follows: Where s is the set of nucleotides and f x is the frequency of nucleotide x in the set s. Let s n be the set of nucleotides found n bp from the core ERE motif. For each flanking nucleotide up to 5 bp upstream and downstream, denoted as n = '-5' to n = '+5', we took 45 random loci flanking the ERE-like sites in the genome and computed its entropy. This was done 1000 times and the fraction of times it was lower than the observed entropy for the corresponding position of the 45 binding EREs was reported as the estimated P value (Additional data file 3).

Optimizing the sequence set for model building
With the assumption that bindings most probably occur only on one of the strands and that nonbindings mean that none of the strands were bound, we opted to optimize the binder and nonbinder PWMs by minimizing the total information entropy of the binders while augmenting the nonbinder sequence set by taking both strands of the validated nonbinding loci. The total information entropy (TE) can be calculated as follows: Where F is a 4 × N matrix of relative frequency of each nucleotide at each position, which can be derived from the PWM of the aligned sites. The overall entropy of the binders was minimized through selectively reverse complementing some of the binders using a greedy hill-climbing approach, which resulted in 15 binder sequences being reverse complemented.

A balanced measure based on sensitivity and specificity
The trade off between achieving high sensitivity and high specificity for a prediction system is well appreciated. Here, we propose the use of a single measure to evaluate the balanced performance between specificity and sensitivity. One simple option is to calculate the arithmetic mean of specificity and sensitivity. Arithmetic means, however, might be misleading when rates are being averaged. Inspired by the usage of the Fmeasure [14] in the field of information retrieval, calculated as the harmonic mean of precision and recall, we opted for using harmonic mean to quantify the balance between sensitivity (sn) and specificity (sp), which can be easily calculated using the following equation:

Estimating the amount of false positives
A pertinent question any high-throughput in silico prediction scheme is the degree of false positivity. The number of falsely predicted binding sites, from among the 38,024 predicted sites in the human genome (about 3 Gbp), we devised a Monte Carlo simulation to estimate the distribution of false-discovery rate. Nucleotide sequences of 1 million base pairs long were generated by drawing random nucleotides from the same nucleotide distribution as the human genome (UCSC hg17). The sequences were then run through the h-ERE. Sites on the random sequences predicted to be binders by the h-ERE represent the false positives. As the single nucleotide based random sequence generation may not faithfully reflect all the inherent properties of the human genome, conservative estimation the false-positive rate was made, by reporting the 99th percentile. The above simulation was iterated 1000 times to approximate the rate of false positives per million base pairs. At the 99th percentile, the false positive rate was 11 false binders per million base pairs. This is roughly 33,000 false positive sites for the approximate 3 Gbp human genome, or about 86% of the total approximately 38,000 predicted binders.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a document summarizing the results of preliminary Naïve Bayesian analysis on the validated binding sites and their immediate surrounding sequences. Additional data file 2 is a document showing the sequence logos for the final binder and nonbinder sets. Additional data file 3 is a document tabulating the Monte Carlo P values for the information entropy significance of each base pair location immediately flanking the core ERE.
Additional data file 1 Accuracy of Naïve Bayesian classification using position specific motifs as features to classify sequences bound and not bound by ER The motif width (window size), size of the flanking regions, and the Laplacian pseudocount (L) were varied. Two sets of sequences were used: (A) with the core ERE intact, and (B) without the core ERE.
Overall, single nucleotides appeared to hold certain discriminative power for separating ER binding from ER non-binding sequences.