Computational analysis of core promoters in the Drosophila genome
© Ohler et al., licensee BioMed Central Ltd 2002
Received: 7 October 2002
Accepted: 27 November 2002
Published: 20 December 2002
The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools.
We identified TSS candidates for about 2,000 Drosophila genes by aligning 5' expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5'-end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates.
There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila.
Transcription initiation is one of the most important control points in regulating gene expression [1,2]. Recent observations have emphasized the importance of the core promoter, a region of about 100 base-pairs (bp) flanking the transcription start site (TSS), in regulating transcription [3,4]. The core promoter serves as the recognition site for the basal transcription apparatus, which comprises the multisubunit RNA polymerase II and several auxiliary factors. Core promoters show specificity both in their interactions with enhancers and with sets of general transcription factors that control distinct subsets of genes. Although there are no known DNA sequence motifs that are shared by all core promoters, a number of motifs have been identified that are present in a substantial fraction. The most familiar of these motifs is the TATA box, which has been reported to be part of 30-40% of core promoters .
Prediction and analysis of core promoters have been active areas of research in computational biology , with several recent publications on prediction of human promoters [7,8,9,10]. In contrast, prediction of invertebrate promoters has received much less attention and has focused almost exclusively on Drosophila. Reese  described the application of time-delay neural networks, and in our previous work  we used a combination of a generalized hidden Markov model for sequence features and Gaussian distributions for the predicted structural features of DNA. Structural features were also examined by Levitsky and Katokhin , but they did not present results for promoter prediction in genomic sequences.
As with computational methods for predicting the intron-exon structure of genes , the computational prediction of promoters has been greatly aided by cDNA sequence information. However, promoter prediction is complicated by the fact that most cDNA clones do not extend to the TSS. Recent advances in cDNA library construction methods that utilize the 5'-cap structure of mRNAs have allowed the generation of so-called 'cap-trapped' libraries with an increased percentage of full-length cDNAs [15,16]. Such libraries have been used to map TSSs in vertebrates by aligning the 5'-end sequences of individual cDNAs to genomic DNA [17,18]. However, it is estimated that even in the best libraries only 50-80% of cDNAs extend to the TSSs [16,19], making it unreliable to base conclusions on individual cDNA alignments.
We describe here a more cautious approach for identifying TSSs that requires the 5' ends of the alignments of multiple, independent cap-selected cDNAs to lie in close proximity. We then examine the regions flanking these putative TSSs, the putative core promoter regions, for conserved DNA sequence motifs. We also use this new set of putative TSSs to retrain and significantly improve our previously described probabilistic promoter prediction method. Finally, we report the results of promoter prediction on whole Drosophila melanogaster chromosomes, and discuss the different challenges of computational promoter recognition in invertebrate and vertebrate genomes.
Results and discussion
Selection of expressed sequence tag (EST) clusters to determine transcription start sites
Stapleton et al.  report the results of aligning 237,471 5' EST sequences, including 115,169 obtained from cap-trapped libraries, on the annotated Release 2 sequence of the D. melanogaster genome. They examined these alignments for alternative splice forms and grouped them into 16,744 clusters with consistent splice sites, overlapping 9,644 known protein-encoding genes. We applied the following set of criteria to select those 5' EST clusters most likely to identify TSSs. Clusters were required to either overlap a known protein-coding gene or have evidence of splicing. One of the three most 5' ESTs in the cluster had to be derived from a cap-trapped library. In some cases, disjoint clusters overlap the annotation of a single gene; here, we only considered the most 5' cluster. We required the distance to the next upstream cluster to be greater than 1 kb. This requirement, together with the selection of only the most 5' cluster, leads to the selection of only one start site per gene. By doing so, we minimize the erroneous inclusion of ESTs which are not full-length, but also exclude alternative start sites or start sites of genes with overlapping transcript. Because the 5' ends of ESTs derived from full-length cDNAs are expected to lie in a narrow window at the TSS, we required that the 5' ends of at least three ESTs fall within an 11-bp window of genomic sequence, and that the number of ESTs whose 5' ends fall within this window comprise at least 30% of the ESTs in the cluster. With a single EST we cannot be sure to have reached the true start site, even if it was generated by a method selecting for the cap site of the mRNA [17,19]; with a cluster of ESTs within a small range, we can be more confident that we have defined the actual TSS. By requiring selected clusters to have at least three ESTs we are, however, introducing a bias against genes with low expression levels. The requirement that 30% or more of the 5' ESTs in a cluster terminate within the 11-bp window was introduced because, for large EST clusters, a simple numerical requirement is insufficiently stringent.
Influence of the parameter values on the number of selected clusters
Number of clusters
Minimum distance to next upstream cluster (bp)
Window size (bp)
Percentage of 5' ends in window
We defined the start of the most 5' EST in each of the 1,941 clusters as the predicted TSS and refer to this as position +1 in the analyses reported below. We extracted the genomic sequences from 250 bp upstream to 50 bp downstream of each of these sites as a set of putative proximal promoter regions to compare with previous collections of promoters, to identify possible core promoter motifs, and to use as training set for computational promoter prediction. To study the motifs in core promoters with more sensitivity, we also carried out analyses on subsequences from -60 to +40.
Comparison with previous collections of core promoters
Two small collections of curated Drosophila TSSs have been assembled previously on the basis of information carefully extracted from the literature. The Drosophila Promoter Database (DPD) was the set of 247 TSSs used to train earlier computational promoter finding systems such as NNPP  and McPromoter . This DPD was assembled by combining Drosophila promoters in the Eukaryotic Promoter Database release 63 , and a set of promoters extracted according to similar criteria . The second set was the Drosophila Core Promoter Database (CPD ) with 205 start sites.
To assess the quality of our inferred TSSs, we aligned the 1,941 300 bp sequences against sequences flanking the TSSs in the DPD and CPD using BLAST . The derivation of our TSS set, which corresponds to just over 14% of all Drosophila genes, did not depend on the scientific literature and thus we expect it to be largely non-overlapping with the DPD and CPD sets. It was therefore not surprising that of the 247 core promoter regions in the DPD, only 44 (18%) could be aligned to those in our set. The positions of the TSSs in 28 of these alignments differed by less than 10 bp and are considered identical for our purposes; in five cases, the DPD entries lie more than 10 bp upstream, and in 11 cases, a newly derived putative TSS was more than 10 bp 5' of the corresponding TSS in the DPD. Of the 205 core promoter regions in the CPD, 32 sites (16%) belonging to 30 genes could be aligned successfully; in 21 out of 30 cases, the difference was again smaller than 10 bp, in six cases, a CPD entry was more 5', in three cases a newly derived TSS. This simple assessment suggests that our new set of putative TSSs is of similar accuracy to the DPD and CPD. However, our set is eight times larger, containing the predicted TSS for one in seven Drosophila genes.
Identification of over-represented sequence motifs in core promoters
Core promoters are known to contain binding sites for proteins important for transcription initiation, and our first analysis of the sequence content of our set of 1,941 core promoters was to assess the representation of two well-established core promoter sequence motifs, the initiator (Inr) and the TATA box. We used the CPD consensus strings for the Drosophila Inr and TATA box, TCA(G/T)T(C/T) and TATAAA, respectively , permitting up to one mismatch. A search with these consensus strings in random sequences of equal length would result in a frequency of 29.3% for the initiator and 11.6% for the TATA box. Of the CPD promoters, 67.3% have a match to the Inr consensus in the region from -10 to +10 and 42.4% have a TATA box in the region from -45 to -15. We observed that 62.8% of our core promoters had a match to the Inr consensus in the -10 to +10 interval, an almost identical fraction as observed for the CPD. However, we observed a frequency for the TATA box consensus of 28.3%, only about two-thirds of the frequency observed in the CPD; extending the region over which we allowed matches to -60 to -15 only increased the frequency to 33.9%.
The ten most significant motifs in the core promoter sequences from -60 to +40, as identified by the MEME algorithm
From the location distribution it is apparent that motif 1 is preferentially found close to the TSS, although not as tightly localized as the Inr motif. Motif 2, which shows a broad spatial distribution within the core promoter regions, corresponds to the target of the DNA replication-related-element binding factor (DREF). At the same time our study was being carried out, DREF was interestingly found to be part of a complex with Drosophila TBP-related factor (TRF) 2 . TRF2 replaces the TATA-box-binding element TBP in a distinct subset of promoters, and our data suggest that it is used in a larger fraction of promoters than previously thought.
Because different algorithms for detecting overrepresented motifs have different properties (see Materials and methods), we compared the motifs identified using a Gibbs sampling algorithm [27,28] with those identified by MEME. Gibbs sampling is non-deterministic and generally delivers different results each time it is run; therefore, the motifs that are most significant are those that are reported multiple times with high likelihood scores. We performed 100 iterations of the algorithm; each iteration was stopped after a maximum of ten motifs were reported. We sorted the resulting 986 motifs by their log-likelihood score (see ), which ranged from negative infinity (the logarithm of zero) to 1,353. Among the 354 motifs with scores higher than 700, we observed no motifs that were unrelated to the ten most significant motifs found by MEME (Table 2). Instead, these 354 motifs were comprised of several variants each of motifs 1-6 and 8 of Table 2; the best variant of motif 1 scored 1,353 and the best variant of the TATA box scored 742. We still consider it likely that the three motifs in Table 2 that were not also identified by Gibbs sampling are biologically meaningful. For example, motif 9, is similar in both sequence and positional restriction to the well-established DPE motif.
Frequency of occurrence of pairs of the ten most significant motifs in the same core promoter
Percent of promoters with motif
Percent of promoters with each motif that also containing the indicated second motif
To determine whether any of the motifs were enriched in promoters for genes associated with a specific function, process or cellular component, we retrieved the gene ontology (GO) terms  associated with the group of genes whose core promoter contained a particular motif. Even though some differences can be seen, it is too early to say whether any are biologically significant. DREF/TRF2 promoters were reported to control genes involved in DNA replication and cell proliferation, but we see no such clear restriction of promoters containing DREs. The most frequent GO terms associated with each motif are given in Supplementary Table 3 of the additional data files.
The most significant motifs in the extended promoter sequences spanning from -250 to +50, as identified by the MEME algorithm
Number of sequences
Using the core promoter set to retrain the McPromoter TSS prediction tool
McPromoter is a probabilistic promoter prediction system that identifies likely TSSs in large genomic sequences [12,32]. To a certain extent, the performance of such probabilistic systems can be improved by increasing the size of the training set used to estimate the system parameters. The dataset we originally used to train McPromoter consisted of only 247 promoter, 240 non-coding and 711 coding sequences. We took advantage of our new large TSS set to retrain McPromoter (see Materials and methods). We used a slightly smaller set of 1,841 promoter sequences that eliminated instances of related promoters in the 1,941 promoter set, along with a newly extracted representative set of non-promoter sequences taken from Drosophila genes (2,635 coding and 1,755 non-coding sequences). The tests we describe below document a markedly improved performance. The largest part of the improvement was due to the increase in the size of the promoter set. Cross-validation experiments in which only a subset of promoters was used suggest that additional sequences similar to those in the current set of 1,841 would not further improve the results significantly (data not shown). However, it is possible that assembling a more representative set of promoters that includes promoters from genes expressed at low levels would improve performance. It might also be useful to make training sets consisting of subsets of promoters that share combinations of motifs and use a collection of McPromoter variants each trained on a different class of core promoter.
Evaluating the performance of the retrained McPromoter
Analysis of the 2.9 Mb Adhregion
McPromoter results on the Adh test set
By comparison with another predictor for Drosophila promoters, NNPP , the retrained McPromoter system has 10-15 times fewer false positives, but some of this improvement simply results from the larger training set and not from differences in the underlying algorithm. When McPromoter was trained on the same, smaller, dataset as NNPP, the reduction in false positives was only about three- to sixfold (data not shown; for details see ). The McPromoter system output ranges from 0 to 1, and Table 5 gives the results for a number of thresholds between 0.98 and 0.8; additional predictions below 0.7 are no longer distinguishable from randomly spaced predictions (see  for discussion). Consider, for example, the predictions obtained at a threshold of 0.9; 21 out of 48 true positives were located within +/-50 bases of the annotated 5' end, and the average distance of all 48 true positives was 109 bases. As the real TSS is likely to be further upstream than the annotated 5' end, we believe this result indicates that McPromoter performs well in predicting the precise location of a TSS.
The complete genome: a case study of chromosome arm 2R
We therefore counted a hit as positive if it fell within -1,000/+100 of the annotated 5' end, a region twice as large as for the Adh set. Because of the large number of genes with more than one annotated 5' end, which often lie closely together, we did not evaluate success on the level of individual TSSs, but rather on the level of genes; that is, if at least one prediction falls within the positive region of any of its annotated start sites, it is considered as positive hit. At threshold 0.8, 1,232 TSSs referring to 1,176 (55.2%) genes are correctly identified, with one additional prediction every 5,663 bases across the whole chromosome. This is comparable to the performance on the Adh region (see Table 5). We did not pre-filter for low-complexity regions, and 101 genes were left out of the test set because of missing UTR information, so the real false-positive rate is most likely to be lower.
Although it is desirable for a promoter recognition system to be as good as possible when used with no additional information on naked genomic sequence, tools are not used in isolation in a genome annotation project. For example, we used information about the position of genes and considered only the first hit upstream of the translation start codon of each gene, but less than 5 kb upstream of the annotated 5' UTR end. We therefore make at most one prediction per gene - zero if there is no hit above the predefined threshold within the scanned region. Leaving the threshold at 0.8, McPromoter predicts TSSs for 1,017 genes (47.8%) within the -1,000/+100 region, makes no prediction for 392 genes (18.4%), and delivers predictions outside the -1,000/+100 region with respect to the annotated TSSs in 721 cases (33.8%).
As more experimental data from mapping TSSs are obtained, it will be interesting to see if the percentage of successful prediction increases. One particularly promising approach will be to use the large amount of additional information we have from 5' ESTs. As described above, we applied very strict criteria for identifying the 1,941 TSSs used in our training set. We have, however, at least one 5' EST for an additional 8,000 genes . Looking for coincidences between the TSSs predicted by McPromoter and the genomic positions of the 5' ends of these ESTs is likely to be a powerful approach. As an initial test of this idea, we ran McPromoter with a very low threshold of 0.75 on the whole genome, and retained all the hits as long as they were more than 100 bp apart. As above, no filtering for repeats and low-complexity regions was carried out. Because of the very low threshold, we generated an average of one predicted TSS every 3,000 nucleotides in the genome. We found that 11,160 - approximately one in seven - of these predictions is on the appropriate strand and within 500 bp of an EST 5' end. If these predictions are valid, we would expect them to be closer than 500 bp to an EST end. Indeed, we found that 56% of the 11,160 predictions in fact lie within 50 bp of an EST, much higher than would be expected by chance, and in accordance with the results in the much smaller Adh region set. Even when we require co-localization within 10 bp, we still retain 25% of the predictions originally selected to be within 500 bp. The next step is to optimize both the threshold we use for predicting the TSSs and the window size we use for assessing their correlation with 5' EST ends. In the end, we will still need to conduct experiments such as primer extension to distinguish real from false-positive predictions.
A comparison with vertebrate promoter finding
Results of McPromoter on human data
The good results obtained on human data with even a simple sequence model are apparently largely due to the strong correlation of vertebrate promoters with CpG islands. 60% of the 5' ends of known genes on chromosome 22 are located within CpG islands, regions in the genome that are not depleted of CG dinucleotides and are associated with an open chromatin structure . At 64% sensitivity, 82% of the true predictions by McPromoter are located within CpG islands, and the correlation gets stronger as the specificity increases. This was also reported for other, more recently developed promoter-finding systems [9,10]. Therefore, promoter-finding methods based on sequence information successfully identify almost the exact same subset of promoters, owing to the high correlation with the subset of promoters located within CpG islands . Vertebrate promoter recognition thus appears to be reaching its limit when the models use the proximal promoter sequence as the only information source.
In the only system so far with a significantly better performance and smaller correlation with CpG islands, promoter recognition is guided by a simultaneous recognition of first exons . When we used the version of McPromoter that includes analysis of the physical properties of the genomic DNA, as we do with Drosophila, the true positives are much less correlated with CpG islands (61%), and therefore constitute a broader subset of vertebrate promoters. Unfortunately, the false-positive rate increased roughly fivefold, and the predictions tend to be located farther away from the 5' end of genes . The promoters of vertebrate and invertebrate organisms differ in that invertebrate genomes do not contain CpG islands, a feature of more than half of vertebrate genes . This makes computational recognition of invertebrate promoters, and those vertebrate promoters not found in CpG islands, more difficult.
In this paper, we present a strategy for annotating core promoters in the complete Drosophila genome by a two-step process of 5' EST cluster selection and computational prediction. With the help of a larger training set, we were able to significantly improve the performance of McPromoter, our computational TSS prediction tool. Probably for the first time in invertebrate promoter prediction, the results are sensitive and specific enough to guide verification by subsequent wet lab experiments such as primer extension.
A first analysis of motifs prevalent in core promoters revealed that less than one-third of Drosophila promoters have a consensus TATA box. In contrast, the DRE motif that is part of the recognition site for an alternative transcription initiation complex that utilizes TRF2  is more frequent than previously thought. One surprising result of our work is that there are relatively few recognizable binding sites for known general transcription factors in Drosophila core promoters. Our analysis did, however, reveal previously undescribed or underappreciated motifs, an encouraging sign that there are distinct features in Drosophila core promoter regions.
McPromoter is accessible at . The training sets are available from the Drosophila Genome Project website . McPromoter predictions are part of the analysis results in the Genome Annotation Database Gadfly .
Materials and methods
Alignment and clustering of ESTs on the complete Drosophila sequence are described in detail in .
We used the MEME system, version 3.0 [24,25] to look for core promoter motifs using a motif width range of 5-15 bp and zero or one occurrence per sequence. That is, MEME decides during the motif identification whether or not to include a sequence; the numbers of sequences that MEME used for the alignments are given in Tables 2 and 4. We observed that MEME is rather restrictive in deciding whether to use a particular sequence or not; thus, the number of sequences in any given motif alignment is rather small, and searching for binding sites as described below resulted in larger numbers of sequences with a particular motif. We used the standard background model of MEME, which corresponds to the single-nucleotide frequencies in the set. For the analysis of the extended sequences from -250 to +50, we used a third-order Markov chain as background model, which should prevent the algorithm from reporting ubiquitous short repeat motifs. The background parameters were estimated on the -250 to +50 sequences.
We also applied the Gibbs sampling algorithm for motif identification . Our data set was too large to be submitted to the available web-based sites. We therefore used an implementation adapted to DNA sequence analysis that we could install locally  to analyze sequences from -60 to +40 in the same set of promoters used with MEME. As with other implementations of Gibbs sampling, this implementation  does not automatically estimate the width of motifs, which we fixed at 10 bases. Also, there is no equivalent of the 'zero or one occurrence per sequence' statistics of MEME; instead, we used a prior probability of 0.075 for motif occurrence. This property makes the identification of rarely occurring motifs intrinsically harder.
Searching for binding sites
To look for binding sites of the motifs, we used the weight matrices as computed by MEME with the program Patser v3d . As the searched regions are rather small, we set a rather low threshold corresponding to a p-value of 1e-3; that is, once every 1,000 nucleotides in a random sequence we can expect to see a score equal or better than the threshold, assuming that the sequence has the same base composition as our core promoter set.
To retrain the McPromoter system, we used the newly determined TSSs, and extracted 250 bp upstream and 50 bp downstream of each TSS as training sequences. We constructed a representative set starting from these 1,941 sequences similar to the guidelines of the Eukaryotic Promoter Database : Sequences are only included if they share less than 50% identity in the core promoter region (in our case, -60 to +40) to each of the promoters in the growing set. This resulted in a data set of 1,864 sequences. We aligned this set to the Adh region that we use as test sequence, and 23 promoters within the Adh region were removed from that set.
For non-promoter sequences, we started with a representative set of genes compiled in 1999 that comprised 416 single-and multi-exon genes . In addition, we used the genes contained in the Drosophila Gene Collection as of 31 December, 2001. In a first step, we eliminated genes that showed evidence of alternative splicing. The remaining set was aligned to the SWISS-PROT database  using TBLASTX , and we retained all cDNA entries where the protein completely aligned with the cDNA, and the similarity was above 90% for Drosophila proteins and 40% for non-Drosophila entries. 5' and 3' UTRs were truncated, and the remaining sequences were aligned to the genomic sequence using Sim4 , to map the exon-intron structure. This resulted in a set of 264 entries with hits to Drosophila SWISS-PROT entries, and 29 against proteins of other organisms. These sequences were then added one by one to the existing initial set of 416 genes, provided that the similarity of the coding sequence was less than 80% to genes already in the set. A total of 229 sequences were added in this way, resulting in a final set of 645 single- and multi-exon genes. Exons and introns of these genes were then divided into 300-bp long non-overlapping stretches to generate 2,635 coding and 1,755 intronic sequences of 300 bp each.
Computational promoter prediction
The McPromoter system consists of a generalized hidden Markov model of a promoter sequence, with six interpolated Markov chain submodels representing different segments of the promoter sequence from -250 to +50: upstream 1 and 2, TATA box, spacer, initiator and downstream. It is augmented by a set of Gaussian densities describing profiles of DNA physicochemical properties  in the six segments. The likelihoods of sequence and profile segments are finally combined in a neural network classifier, which is trained to output values of 1 for promoter and 0 for non-promoter sequences. Cross-validation tests of 14 different properties such as DNA bendability and nucleosome positioning (compiled in ) showed that stacking energy delivered the best classification results for both human and Drosophila sequences when combined with the sequence probabilities. On our previous small Drosophila dataset, a combination of all features by principal-component analysis (PCA, see for example ) led to an improvement in classification; however, on the current larger dataset, no additional improvement is observed when using PCA.
We carried out five cross-validation experiments on the new dataset, setting aside one-fifth as a test set and taking the remaining four-fifths for training. The best average cross-correlation value (CC) of promoter versus non-promoter classification was 0.79, with an equal recognition rate of 89.9% and a receiver operating characteristics integral of 9,575.
To localize promoters in genomic sequences, an input window of 300 bp is shifted along both sides of the sequence, and the neural network score as well as the position of the initiator segment is stored for each window. After smoothing the score graph along the sequence, the best local maximum within a certain range is reported if it exceeds a preset threshold on the score. A detailed description can be found in [12,34].
For classification into promoter/non-promoter, we used the equal recognition rate (ERR) and the integral over the receiver operating characteristics (ROC) as measures. A ROC curve gives the recognition rate (true positives) for preselected values of false positives, which was in our case 0 to 100% in 1% steps. Then, the trapezoid rule is used to compute the integral numerically. The highest achievable value is 10,000 (100 × 100, the perfect recognition for all rates of false positives); a random classification results in a value of 5,000. ERR gives the recognition performance at the point where the rate of true positives equals the rate of true negatives; the ROC integral judges the performance in a more global manner. We also give the correlation coefficient (CC), defined as:
where TP stands for true positives, TN for true negatives, FP for false positives, and FN for false negatives; these numbers denote the absolute numbers of correctly and incorrectly classified sequences. CC values should be considered with caution, as they depend on the relative size of the dataset for each class and can therefore generally not be compared across different datasets.
Additional data files
Additional data files containing the following supplementary tables are available with the online version of this paper: Supplementary table 1, showing alignment positions of the most 5' ESTs of the 1,941 selected EST clusters, for Release 2 and the realignment to Release 3; Supplementary table 2, listing the weight matrixes for the 10 motifs shown in Table 2; Supplementary table 3, listing the most frequent GO terms associated with each of the 10 motifs shown in Table 2; and Supplementary table 4 with the raw data for the positional distribution of motif hits (Figure 1).
U.O. thanks the Berkeley Drosophila Genome Project informatics group, especially Chris Mungall and ShengQiang Shu for help with the Sim4 alignments and gene sets, the FlyBase curators for sneak peeks at the chromosomes as they were annotated, and many others for helpful discussions. A big thank-you also to Chris Burge and his lab at MIT for support, and for use of the pictogram web server . Georg Stemmer at the University of Erlangen provided the programs for principal-component analysis. We thank Jim Kadonaga and former and current members of his lab at University of California San Diego for detailed and helpful comments on the core promoter analysis and an early draft of this paper. We also thank Robert Tjian and his lab at University of California Berkeley for pointing out the DRE motif and their work on DREF/TRF2 before publication, as well as Audrey Huang, Suzanna Lewis and Mike Eisen for comments on the manuscript. G.M.R. is supported by the Howard Hughes Medical Institute. This work was supported by NIH grant HG00750 to G.M.R.
- Latchman DS: Gene Regulation - A Eukaryotic Perspective. 2002, Cheltenham, UK: Nelson Thornes, 4Google Scholar
- Lemon B, Tjian R: Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 2000, 14: 2551-2569. 10.1101/gad.831000.PubMedView ArticleGoogle Scholar
- Smale ST: Core promoters: active contributors to combinatorial gene expression. Genes Dev. 2001, 15: 2503-2508. 10.1101/gad.937701.PubMedView ArticleGoogle Scholar
- Butler JEF, Kadonaga JT: The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002, 16: 2583-2592. 10.1101/gad.1026202.PubMedView ArticleGoogle Scholar
- Kutach AK, Kadonaga JT: The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol Cell Biol. 2000, 20: 4754-4764. 10.1128/MCB.20.13.4754-4764.2000.PubMedPubMed CentralView ArticleGoogle Scholar
- Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition. Genome Res. 1997, 7: 861-878.PubMedGoogle Scholar
- Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29: 412-417. 10.1038/ng780.PubMedView ArticleGoogle Scholar
- Down TA, Hubbard TJ: Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 2002, 12: 458-461. 10.1101/gr.216102.PubMedPubMed CentralView ArticleGoogle Scholar
- Hannenhalli S, Levy S: Promoter prediction in the human genome. Bioinformatics. 2001, 17: S90-S96.PubMedView ArticleGoogle Scholar
- Scherf M, Klingenhoff A, Frech K, Quandt K, Schneider R, Grote K, Frisch M, Gailus-Durner V, Seidel A, Brack-Werner R, et al: First pass annotation of promoters on human chromosome 22. Genome Res. 2001, 11: 333-340. 10.1101/gr.154601.PubMedPubMed CentralView ArticleGoogle Scholar
- Reese MG: Application of a time-delay neural network to the annotation of the Drosophila melanogaster genome. Comput Chem. 2001, 26: 51-56. 10.1016/S0097-8485(01)00099-7.PubMedView ArticleGoogle Scholar
- Ohler U, Niemann H, Liao GC, Rubin GM: Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics. 2001, 17: S199-S206.PubMedView ArticleGoogle Scholar
- Levitsky VG, Katokhin AV: Computational analysis and recognition of Drosophila melanogaster gene promoters. Mol Biol. 2001, 35: 826-832. 10.1023/A:1013278000196.View ArticleGoogle Scholar
- Haas BJ, Volfovsky N, Town CD, Troukhan M, Alexandrov N, Feldmann KA, Flavell RB, White O, Salzberg SL: Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 2002, 3 (6): research0029.1-0029.12. 10.1186/gb-2002-3-6-research0029.View ArticleGoogle Scholar
- Carninci P, Shibata Y, Hayatsu N, Sugahara Y, Shibata K, Itoh M, Konno H, Okazaki Y, Muramatsu M, Hayashizaki Y: Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes. Genome Res. 2000, 10: 1617-1630. 10.1101/gr.145100.PubMedPubMed CentralView ArticleGoogle Scholar
- Suzuki Y, Yoshitomo-Nakagawa K, Maruyama K, Suyama A, Sugano S: Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. Gene. 1997, 200: 149-156. 10.1016/S0378-1119(97)00411-3.PubMedView ArticleGoogle Scholar
- Suzuki Y, Taira H, Tsunoda T, Mizushima-Sugano J, Sese J, Hata H, Ota T, Isogai T, Tanaka T, Morishita S, et al: Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep. 2001, 2: 388-393.PubMedPubMed CentralView ArticleGoogle Scholar
- Suzuki Y, Tsunoda T, Sese J, Taira H, Mizushima-Sugano J, Hata H, Ota T, Isogai T, Tanaka T, Nakamura Y, et al: Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001, 11: 677-684. 10.1101/gr.GR-1640R.PubMedPubMed CentralView ArticleGoogle Scholar
- Sugahara Y, Carninci P, Itoh M, Shibata K, Konno H, Endo T, Muramatsu M, Hayashizaki Y: Comparative evaluation of 5'-end-sequence quality of clones in CAP trapper and other full-length-cDNA libraries. Gene. 2001, 263: 93-102. 10.1016/S0378-1119(00)00557-6.PubMedView ArticleGoogle Scholar
- Stapleton M, Liao GC, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, et al: The Drosophila Gene Collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12: 1294-1300. 10.1101/gr.269102.PubMedPubMed CentralView ArticleGoogle Scholar
- Cavin Perier R, Praz V, Junier T, Bonnard C, Bucher P: The Eukaryotic Promoter Database (EPD). Nucleic Acids Res. 2000, 28: 302-303. 10.1093/nar/28.1.302.View ArticleGoogle Scholar
- Arkhipova I: Promoter elements in Drosophila melanogaster revealed by sequence analysis. Genetics. 1995, 139: 1359-1369.PubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.PubMedView ArticleGoogle Scholar
- Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol. 1995, 3: 21-29.PubMedGoogle Scholar
- The MEME/MAST system. [http://meme.sdsc.edu]
- Hochheimer A, Zhou S, Zheng S, Holmes MC, Tjian R: TRF2 associates with DREF and directs promoter-selective gene expression in Drosophila. Nature. 2002, 420: 439-444. 10.1038/nature01167.PubMedView ArticleGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.PubMedView ArticleGoogle Scholar
- Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol. 2002, 9: 447-464. 10.1089/10665270252935566.PubMedView ArticleGoogle Scholar
- Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.PubMedView ArticleGoogle Scholar
- The Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11: 1425-1433. 10.1101/gr.180801.View ArticleGoogle Scholar
- Katsani KR, Hajibagheri MA, Verrijzer CP: Co-operative DNA binding by GAGA transcription factor requires the conserved BTB/POZ domain and reorganizes promoter topology. EMBO J. 1999, 18: 698-708. 10.1093/emboj/18.3.698.PubMedPubMed CentralView ArticleGoogle Scholar
- McPromoter MM:II. [http://genes.mit.edu/McPromoter.html]
- Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000, 10: 483-501. 10.1101/gr.10.4.483.PubMedPubMed CentralView ArticleGoogle Scholar
- Ohler U: Computational Promoter Recognition in Eukaryotic Genomic DNA. 2001, PhD thesis, University of Erlangen-NurembergGoogle Scholar
- Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002, 3: research0083.1-0083.22. 10.1186/gb-2002-3-12-research0083.View ArticleGoogle Scholar
- Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, et al: The DNA sequence of human chromosome 22. Nature. 1999, 402: 489-495. 10.1038/990031.PubMedView ArticleGoogle Scholar
- Ohler U, Harbeck S, Niemann H, Noth E, Reese MG: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics. 1999, 15: 362-369. 10.1093/bioinformatics/15.5.362.PubMedView ArticleGoogle Scholar
- Bird A: DNA methylation patterns and epigenetic memory. Genes Dev. 2002, 16: 6-21. 10.1101/gad.947102.PubMedView ArticleGoogle Scholar
- Ponger L, Duret L, Mouchiroud D: Determinants of CpG islands: Expression in early embryo and isochore structure. Genome Res. 2001, 11: 1854-1860.PubMedPubMed CentralGoogle Scholar
- Representative benchmark data sets of D. melanogaster DNA sequences. [http://www.fruitfly.org/sequence/drosophila-datasets.html]
- GadFly: genome annotation database of Drosophila. [http://www.fruitfly.org/annot]
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.PubMedPubMed CentralView ArticleGoogle Scholar
- Gish W, States D: Identification of protein encoding regions by database similarity search. Nature Genet. 1993, 3: 266-272. 10.1038/ng0393-266.PubMedView ArticleGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic sequence. Genome Res. 1998, 8: 967-974.PubMedPubMed CentralGoogle Scholar
- Pedersen AG, Baldi P, Chauvin Y, Brunak S: DNA structure in human RNA polymerase II promoters. J Mol Biol. 1998, 281: 663-673. 10.1006/jmbi.1998.1972.PubMedView ArticleGoogle Scholar
- Liao GC, Rehm EJ, Rubin GM: Insertion site preferences of the P transposable element in Drosophila melanogaster. Proc Natl Acad Sci USA. 2000, 97: 3347-3351. 10.1073/pnas.050017397.PubMedPubMed CentralView ArticleGoogle Scholar
- Duda RO, Hart PE, Stork DG: Pattern Classification. 2000, New York: John Wiley, 2Google Scholar
- Pictogram. [http://genes.mit.edu/pictogram.html]