Genome-wide promoter extraction and analysis in human, mouse, and rat
© Xuan et al.; licensee BioMed Central Ltd. 2005
Received: 29 March 2005
Accepted: 11 July 2005
Published: 1 August 2005
Large-scale and high-throughput genomics research needs reliable and comprehensive genome-wide promoter annotation resources. We have conducted a systematic investigation on how to improve mammalian promoter prediction by incorporating both transcript and conservation information. This enabled us to build a better multispecies promoter annotation pipeline and hence to create CSHLmpd (Cold Spring Harbor Laboratory Mammalian Promoter Database) for the biomedical research community, which can act as a starting reference system for more refined functional annotations.
Number of genes and transcripts of different types in the three mammalian genomes
We have integrated sequence conservation with our promoter prediction program FirstEF  to improve the accuracy of prediction. FirstEF was developed as an ab initio human first-exon prediction program, which is capable of predicting noncoding first exons together with the corresponding promoters. It has been used in conjunction with mRNA/expressed sequence tags (EST) transcript information to produce an initial human promoter annotation pipeline (R. Davuluri and I. Gross, personal communication) because gene transcripts and models can be used to identify promoters with high confidence . At the same time, TWINSCAN  and other studies  have shown that integrating genomic homology information can increase gene-prediction accuracy by about 10% compared with the use of ab initio methods alone, and conserved features in promoters have also been used to improve promoter identification in a small dataset . Here, we set out to test if, and to what degree, integrating homology information from mouse and rat genomes can help to further improve human promoter prediction. We found that homologous sequence comparison can substantially increase the prediction accuracy. This enables us to build an improved multispecies promoter annotation pipeline by extracting known and predicted promoters, and to create a comprehensive mammalian promoter database (CSHLmpd) with on-the-fly analysis tools as a valuable public resource to facilitate future mammalian gene-regulatory network studies. As a convenient operational definition, we refer to 'promoter' in this paper as the genomic region (-700, +300) bp with respect to the transcription start site (TSS).
We used orthologous genes to detect sequence conservation in promoter regions. To do this, we first identified all genic regions in the genomes on the basis of known and predicted transcripts, then collected all known promoters from present promoter annotations in the public databases and all predicted promoters produced by the original FirstEF. These promoters were then linked to downstream genes (see below). We took known promoters from the human-rodent orthologous genes and observed significant conservation in promoter sequences. We then used this conservation signal to improve de novo promoter prediction, and in the end constructed a reference promoter database for each of the three mammalian genomes.
Human, mouse and rat genes and orthologous gene sets
By aligning all known and predicted transcripts to the latest human, mouse and rat genomes we obtained 34,949, 35,073, 30,679 genes (see Materials and methods), which include 29,360, 25,571 and 22,643 canonical genes (based on RefSeq  mRNA and Ensembl  prediction) in these genomes, respectively. The orthologous relationship of these canonical genes is defined using EnsMart , which is based on similarity analysis of Ensembl transcripts and genes. We obtained 19,179 human-mouse-rat three-species orthologous gene triplets, and 1,967, 1,420 and 2,268 human-mouse, human-rat and mouse-rat two-species orthologous gene pairs respectively. Promoter conservation was studied in these orthologous genes.
Known promoter collection and promoter prediction in human, mouse and rat genomes
For each species we collected known promoters from EPD and DBTSS. We also collected known promoters from GenBank  by keyword search (see Materials and methods), and the promoter regions identified by luciferase assay and ChIP of TAF250 and RNA polymerase II in the Encyclopedia of DNA Elements (ENCODE) regions. These known promoter sequences were aligned with the genome by BLAT  to get the locations of TSSs. The total unique known TSSs in human, mouse and rat are 14,314, 8,141 and 943, respectively . We also predicted 608,057, 449,132 and 427,130 promoters in these genomes separately using FirstEF with default parameter setting. Repeats in the genome were not masked. TSS locations of all known and predicted promoters were compared with the identified gene regions. A TSS is assigned to a gene when it is located in the genic region or upstream of the 5' end of the gene by no more than 5 kb (for RefSeq genes) or 20 kb (for other genes). By doing so, we obtained such 'gene-related' TSSs/promoters for further analysis. Predicted 'gene-related' promoters are also defined as 'transcript-supported promoters' if they overlap the 5' end of any transcript in a gene. Other predicted TSSs that were not gene-related were potential 'novel TSSs' and were not further analyzed. We used known promoters as training data to detect promoter conservation signal and then compared it with the signal in predicted promoters to reduce false-positive promoter predictions.
Statistical similarity among known promoters of orthologous genes
Pairwise comparison of known promoters
We then defined a promoter pair as a homologous promoter pair, and the promoters as homologous promoters, if the conservation score is higher than 1PET (the pairwise cutoff rule). Using these cutoffs, we found 2,841 of 4,140 human known promoters in those 3,649 human-rodent orthologous gene pairs, and 152 of 229 mouse known promoters in those 214 mouse-rat orthologous gene pairs. In total, around 66-68% of known promoters can match highly conserved counterparts in the orthologous genes. The average conservation score is around 55% between human-rodent homologous promoter pairs, and 85% between mouse-rat homologous promoter pairs (Figure 1c).
Three-species promoter comparisons
We also analyzed known promoter conservation in 158 human-mouse-rat three-way orthologous gene triplets, which have 249 all-species promoter triplets. Using ClustalW to randomly align selected 1 kb sequences from human, mouse and rat genomes, we found that only 1% of the 1 kb triplets had conservation score higher than 21.8%. Here, the conservation score is defined as the percentage of identical base-pairs in the multiple alignments of 1 kb sequences. Using this cutoff, we identified 76 known promoter triplets, and the distribution of conservation score is shown in Figure 1d.
In the genome, functional regions (such as coding regions) are usually conserved under selection pressure during evolution. Hence the significantly higher conservation of homologous promoter pairs and triplets encouraged us to test whether it could be used to improve promoter prediction.
Improving promoter prediction by incorporating both mRNA annotation and promoter conservation information
We collected 8,949 well annotated human genes, each of which has at least one known TSS and has at least one orthologous gene in mouse or rat, to do the test. There are in total 13,313 unique known TSSs for these human genes, with 9,806 being at least 500 bp apart (see Materials and methods). In both sets, we shortened each gene by 5 kb (or half of the gene length if the gene is shorter than 5 kb) from its 5' end to simulate 5' incomplete genes that are most common in the current gene annotations.
Sensitivity and specificity of promoter prediction with different methods
(a) 13,313 unique TSSs in 8,949 human genes
(b) 9,806 TSSs of 500 bp apart in 8,949 human genes
Method 1 + script¶
Method 2 + script
Method 3 + script
(c) 6,356 TSSs of 500 bp apart in 5,893 human genes with homologous promoters
Method 1 + script
Method 2 + script
Method 3 + script
Incorporation of cross-species conservation in whole-genome promoter/TSS prediction
Statistics of promoters and genes in CSHLmpd
Known genes (RefSeq and mRNA)
Canonical genes (RefSeq, mRNA, and Ensembl)
Genes with promoters
Genes with homologous promoters
Predicted genes with promoters
FirstEF predicted promoters
Transcript-supported FirstEF predicted promoters
RefSeq END promoters
Bidirectional gene promoters
Homologous known promoters
CpG-island related RefSeq genes
CpG-island related other mRNA genes
CpG-island related canonical genes
CpG-island related promoters
CpG-island related known promoters
CpG-island related predicted promoters
CpG-island related RefSeq END promoters
CpG-island related bidirectional gene promoters
CpG-island related homologous promoters
From the above promoter/TSS sets, we found 21,594, 21,501 and 17,257 homologous promoters for 13,432, 14,626 and 12,302 genes in human, mouse and rat. Of the mammalian canonical genes with orthologous genes, 60% to 70% have homologous promoters. However, our methods can assign promoters for only a small portion of the TWINSCAN and GenomeScan  predicted genes (42%), compared to 82% of the canonical genes (data not shown). This may be due either to the sensitivity of FirstEF, or to the fact that most predicted genes start from putative translational initiation sites (ATG) and the missing 5' exons and intron regions can span beyond our promoter search limit (20 kb upstream of the predicted gene boundary). The lack of complete 5' ends in non-RefSeq genes can also explain why we saw them to be less likely to be CpG-island related.
Cold Spring Harbor Laboratory Mammalian Promoter Database
To store the information about all the genes and promoters we annotated, we have constructed the Cold Spring Harbor Laboratory Mammalian Promoter Database (CSHLmpd ). It consists of three species-specific promoter sub-databases for human (HSPD), mouse (MMPD) and rat (RNPD). They are linked by homologous promoters wherever orthologous gene information is available. Each is currently equipped with two basic front-end components: a genome-wide browser, Gbrowse , to display information graphically; and a query-fetch system to query and extract promoters based on a gene identifier (such as GenBank accession number, UniGene  cluster ID, LocusLink  ID or gene name). In CSHLmpd, users can either search for promoters of their genes of interest in one species or get homologous promoters from other species. To make the database both a data resource and an analysis platform, we provide two sequence-alignment tools for homologous promoter analysis. ClusterW is for global multiple sequence alignment in the regions of user-selected promoters, and PromoterWise, a local alignment tool, is embedded to align each pair of promoter regions (E. Birney, unpublished data). We have also used MLAGAN  to do global multiple sequence alignment in the regions that include genes and their 5,000-bp upstream sequences to show the conservation at a larger scale. More promoter-analysis tools will be added in the future.
Facilitating large-scale gene regulation studies and promoter array construction
Expression microarray and ChIP-chip (ChIP followed by microarray analysis of DNA) technologies have become important and widely used approaches to study gene expression and regulation at large scales. Being able to extract a large set of mammalian promoter sequences is a critical step for such studies.
To demonstrate the use of CSHLmpd, we have extracted a promoter sequence dataset for the Affymetrix human array HG-U133A. Out of the total of 22,283 probe sets for most known human genes  on this array, from the annotation we were able to obtain promoters from CSHLmpd for 20,903 of them. Because multiple probe sets can belong to the same gene, 13,014 promoters were retrieved. These include 6,052 known promoters and 4,550 predicted homologous promoters. No promoter could be assigned for only 1,380 probe sets. Among these, 448 were mapped to 353 genes without promoter information in our database, and 932 were created from poorly aligned mRNAs and ESTs, which were not used to construct the genes in the first place, or from other ESTs that do not overlap with any gene in our database (see Materials and methods). This HG-U133A Affymetrix promoter set can be freely downloaded from our FTP server , where one can also find separately prepared promoter sequence sets for all human, mouse and rat RefSeq genes. These RefSeq gene promoter sets include all DBTSS-defined promoters and RefSeq END TSS. Users can also create other customized promoter sequence sets for different arrays (or gene indices) using the CSHLmpd query tools. We also plan to provide more customized promoter sequence sets for making promoter chips that can be used for large-scale ChIP-chip studies or epigenetic mapping projects (such as for DNA methylation).
Our method first collected known and predicted promoters in the whole genome. Then transcript and conservation information were used to filter the false positives from the predictions. Our test presented in this paper has proved that using both transcript and conservation information, together with FirstEF, will improve the accuracy of promoter prediction compared with the use of transcript information alone (for example, PromSer, Source). To our knowledge, this is the first attempt to integrate conservation information with de novo first-exon prediction on a genome-wide scale.
In collaboration with an experimental group (L. Stubbs, personal communication), we previously tested our FirstEF prediction on 48 human genes in chromosome 19 using reporter assays. Among these, 26 genes had promoters correctly predicted, and eight did not. This gave a sensitivity and specificity of 54% and 65%, respectively, at the gene level. However, there were a total of 105 predicted promoters around these genes, which led to a specificity of only 25% at the promoter level (data not shown). Therefore, while the experimental evaluation proves that de novo FirstEF performs well in predicting promoters for novel genes, it also shows its limitations on prediction specificity. A more systematic experimental test of 300 mouse promoters will be found in . Our work presented here shows that both mRNA information and cross-species conservation can significantly improve the specificity of promoter prediction.
We have also demonstrated that conservation signal can be integrated with promoter models to improve the accuracy of promoter prediction. Our method uses conservation signal in the potential promoter regions, which can greatly reduce false positives when comparing using just mRNA or conservation information alone, especially when known mRNAs only have partial coding regions. Furthermore, without mRNA information, homologous information by itself cannot produce better overall prediction (data not shown), partly as because of a higher degree of conservation in exons. To decrease false predictions caused by exon conservation as much as possible, we not only used the information from known genes, but also predicted genes from some well known gene-finding methods. In this way, we can reduce the promoter search regions for known genes, and may obtain additional theoretical evidence for predicted genes when their promoters are predicted . These potential novel genes with predicted promoters, especially when the promoters are evolutionarily conserved, could be valuable candidates for experimental validation. In our recent experiments, we have shown that about 25% of those novel genes have spliced transcripts .
To detect the conservation in promoter regions, we tested several different promoter definitions. They included upstream 200 bp of TSSs, -400 to +100 bp, -700 to +300 bp, and -1,500 to +500 bp around TSS. We found that the peak of the conservation score is closer to that of the control sequence set when promoter regions are too short or too long. Among these four promoter definitions, -700 to +300 bp around TSSs gave the best discrimination between the known promoter-training set and the control set. This indicated that many conserved TFBSs tend to cluster in the approximately 1 kb region near the TSS .
In our studies, we have observed that, if lower thresholds of the original FirstEF (such as Pexon = 0.3, Ppromoter = 0.25, Pdonor = 0.25) are used, the prediction sensitivity can be increased at the expense of specificity. In this case, however, even though mRNA and conservation information could help regain some specificity, the overall accuracy would actually be worse than that with default FirstEF thresholds (data not shown).
We cannot identify conservation signal for 27% of known human promoters and 17% of known rodent promoters (see our FTP site ). This may be due to the faster promoter divergence in the corresponding genes. The percentage of predicted promoters without homology that were detected was higher than that of known promoters because of the bias of existing known promoter data and false positives of promoter prediction. We hope to develop more sensitive methods for promoter-specific conservation detection in order to improve promoter prediction in the future.
Materials and methods
Human, mouse and rat genome releases
Human NCBI build 35 (May 2004), mouse mm5 (May 2004), and rat assembly rn3 (June 2003), were downloaded from the University of California at Santa Cruz (UCSC) website .
Genic region identification in the genomes
mRNAs from RefSeq and GenBank (mRNA), and transcripts predicted by Ensembl, TWINSCAN and GenomeScan (RefSeq XM) in the annotation of UCSC genome assemblies were obtained. They were aligned to the genomes by BLAT and Sim4  programs. Transcripts with more than 10% nucleotides unaligned or with less than 95% identity in the aligned regions were excluded. Transcripts were regarded as overlapping if their exons shared at least 1 bp, and a genic region was defined as a continuous genomic DNA region that covers all overlapped transcripts. Gene type was based on the most reliable transcript for this gene, and the order of transcript reliability is: RefSeq > mRNA > Ensembl > RefSeq XM > TWINSCAN. All ESTs were also mapped to the genomes in the same way. ESTs that overlap an identified genic region were included as transcripts of this gene without changing the genic region boundary. The UniGene ID was linked to the gene on the basis of its transcripts. For genes with Ensembl transcript ID, using the information from Ensembl's EnsMart, we marked the orthologous gene sets in our identified genes.
Known promoter collection
All promoter sequences in EPD (release 74) and DBTSS (release 2.0) were extracted. Promoter information and sequences were also retrieved from GenBank (dated 21 February 2003) using 'exon number = 1', 'prim_transcript', 'precursor_mRNA', and 'promoter' as keywords. The promoter regions identified by luciferase assay and ChIP of TAF250 and RNA polymerase II in the ENCODE regions were obtained from the UCSC genome browser and included. All sequences were mapped to the genomes by BLAT to obtain their locations of TSSs. Two identical TSSs were regarded as one unique TSS.
Whole-genome promoter prediction
With default thresholds (Pexon = 0.5, Ppromoter = 0.4, Pdonor = 0.4), original FirstEF was run on each chromosome of the three genomes without repeat masking, and the output was filtered by different methods described below. Predicted and known TSSs were linked to the closest gene if they were located either in the gene region or in the 20 kb upstream of the gene (if the gene has RefSeq mRNA, the distance was limited to 5 kb), and these promoters/TSSs were collected as 'gene-related promoters/TSSs'. Predicted promoters overlapping the 5' end of any transcript in a gene are defined as 'transcript-supported promoters'.
Conservation in control sets
Regions of 1,000 bp were randomly extracted from the genome of each species to make sequence pairs or triplets. Control set I included 1 million such sequence pairs for every two species, and 1 million triplets for the three species. We also selected genes from different species that are not orthologs, and randomly picked promoters belonging to these genes to make 1 million promoter pairs and 1 million triplets for control set II. One million high-GC content (>65%) pseudo promoter pairs were also selected. ClustalW was used to carry out multiple sequence global alignment for each pair or triplet with the conservation score defined as the ratio of identical base-pairs divided by 1,000.
Calculation of conservation for known promoters in orthologous genes
For genes with known TSSs, we extracted (-700, +300) bp regions with respect to the TSSs from the genomes as promoter sequences. We aligned each promoter of a gene in one species with each of the known promoters of its orthologous genes by ClustalW and calculated the conservation scores. The maximum score of all these promoter pairs or triplets was used to describe the conservation of this promoter.
CpG island relationship
We used the new CpG-island definition  to search genomes of the three species to collect CpG islands. A gene is considered as CpG-island-related only if there is at least one CpG island overlapping the region of (-2,000 to around +500) bp at its 5' end. A TSS/promoter is considered as CpG-island-related if at least one CpG island can overlap the region of (-2,000, +500) bp with respect to the TSS.
Post-clustering script for selecting promoters at least 500 bp apart
For all the gene-related promoters, we first ordered the known ones on the basis of the distance between TSSs defined in the promoters to the gene 5' end defined by mapped transcripts. The promoters with shorter distances were then selected, and the rest were compared to the selected ones. Only those that were separated by at least 500 bp from any of the selected promoters were kept. The same selection procedure was used for homologous promoters, transcript-supported promoters and other promoters. As a result of such post-clustering, all the selected promoters of a gene were separated by at least 500 bp.
Evaluation of promoter prediction by simulation
The test set comprised 8,949 genes with 13,313 known TSSs. To simulate the 'partial genes' that often exist in the databases, we truncated each identified genic region by 5 kb (or half of the gene length if the gene is shorter than 5 kb) at the 5' end, including the parts of cDNAs that extend into this region. On the basis of such new gene boundaries, we reselected all gene-related promoters from the predictions by original FirstEF (Method 0). Each promoter was compared with promoters of the orthologous genes (if available) by ClustalW to calculate the conservation score, and they were defined as the homologous promoters if the conservation score obeyed the pairwise or three-way cutoff rules.
De novo FirstEF (Method 1) selected the best-predicted promoters (with the highest probability in the promoter region) from the original FirstEF predictions in a 1,000 bp region. Method 2 compared RNAs or predicted transcripts with original FirstEF predictions that were gene-related to filter out predicted promoters that were neither located in the upstream of the genic region nor transcript-supported, and Method 3 first used Method 2 to select promoters, and then for a gene with homologous promoters, only those homologous promoters were selected as output for the gene (see also Figure 2). Post-clustering was used in promoter selection from the output of Method 1, Method 2 and Method 3 for tests in the 9,806 known TSSs of 500 bp apart, and such combined methods were called Method 1s, Method 2s, and Method 3s respectively. A predicted TSS was regarded as a 'correct TSS' if its distance to a known TSS was shorter than 500 bp, and this known TSS was regarded as 'correctly predicted' simultaneously. The sensitivity of prediction (Sn) was defined as the ratio between the numbers of correctly predicted and known TSSs used in the validation. Specificity (Sp) was the number of correct TSSs divided by the total number of predicted promoters.
Cold Spring Harbor Laboratory mammalian promoter database construction
We first collected all gene-related TSSs in human, mouse and rat genomes. For genes with RefSeq mRNAs but no known or predicted promoters, the 5' ends of the RefSeq sequences were considered as their TSSs and called Refseq ND TSS. They were also defined as transcript-supported. For two adjacent divergent genes with their 5' ends less than 2 kb apart, we defined their 5' gene boundaries as 'bidirectional TSSs' if no other type of TSS could be found in the intergenic region between them. All promoters of the orthologous genes were aligned by ClustalW to find homologous promoters in the same way as done in the evaluation step. Method 3s was used to select the final promoter set. Known promoters filtered out by the post-clustering script were also included in the database after the selection to make the known promoter data as complete as possible. All these selected promoters were stored in a MySQL database. Gene features contained in the database include genome location, overlapping transcripts, UniGene ID, LocusLink ID, and gene name if available. Promoter features included TSS location, first donor and acceptor sites if available, corresponding gene, overlapped transcript for a transcript-supported promoter, and promoter type. Promoter type refers to the source type, which was also used to represent their reliability in the order of: known promoters (EPD, DBTSS, GenBank annotation, promoters identified by luciferase assay or ChIP), RefSeq END promoters, promoters of divergent genes (bidirectional TSS), transcript-supported promoters, as well as other gene-related promoters that were predicted. Homologous promoters were also marked. In addition to gene-related promoters, all other predicted promoters located in the intergenic regions were included in the database. They were regarded as predicted novel promoters and were of the lowest reliability.
Promoter set for the Affymetrix microarray
For each probe set in the gene chip, its gene index and/or chromosome location information were used to find the corresponding gene in our promoter database. The most reliable promoter of this gene was reported for this probe set. If no gene could be assigned to a probe set, the closest predicted novel promoter in its upstream region was taken if the distance between the promoter and probe set was less than 20 kb.
All 8,949 human genes and 13,313 human known promoters used in the test can be downloaded from our FTP site at , the promoter set for Affymetrix array HG-U133A is in , the promoter set of all RefSeq genes is in , all known promoters in CSHLmpd can be downloaded from .
We thank Lisa Stubbs for providing experimental testing results before publication. We thank Ewan Birney for providing PromoterWise software, Lincoln Stein for providing Gbrowse. This work is supported by NIH grants HG01696, GM60513, CA88351, and HG002600.
- Cavin PR, Junier T, Bucher P: The Eukaryotic Promoter Database EPD. Nucleic Acids Res. 1998, 26: 353-357. 10.1093/nar/26.1.353.View ArticleGoogle Scholar
- Suzuki Y, Yamashita R, Nakai K, Sugano S: DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 2002, 30: 328-331. 10.1093/nar/30.1.328.PubMedPubMed CentralView ArticleGoogle Scholar
- Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol. 2004, 22: 1467-1473. 10.1038/nbt1032.PubMedView ArticleGoogle Scholar
- Scherf M, Klingenhoff A, Frech K, Quandt K, Schneider R, Grote K, Frisch M, Gailus-Durner V, Seidel A, Brack-Werner R, Werner T: First pass annotation of promoters on human chromosome 22. Genome Res. 2001, 11: 333-340. 10.1101/gr.154601.PubMedPubMed CentralView ArticleGoogle Scholar
- Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 2003, 31: 219-223. 10.1093/nar/gkg014.PubMedPubMed CentralView ArticleGoogle Scholar
- Halees AS, Weng Z: PromoSer: improvements to the algorithm, visualization and accessibility. Nucleic Acids Res. 2004, 32: W191-W194.PubMedPubMed CentralView ArticleGoogle Scholar
- Coleman SL, Buckland PR, Hoogendoorn B, Guy C, Smith K, O'Donovan MC: Experimental analysis of the annotation of promoters in the public database. Hum Mol Genet. 2002, 11: 1817-1821. 10.1093/hmg/11.16.1817.PubMedView ArticleGoogle Scholar
- Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM: Identification and functional analysis of human transcriptional promoters. Genome Res. 2003, 13: 308-312. 10.1101/gr.794803.PubMedPubMed CentralView ArticleGoogle Scholar
- Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, et al: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002, 420: 563-573. 10.1038/nature01266.PubMedView ArticleGoogle Scholar
- Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, Wakamatsu A, Hayashi K, Sato H, Nagai K, et al: Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004, 36: 40-45. 10.1038/ng1285.PubMedView ArticleGoogle Scholar
- Gerhard DS, Wagner L, Feingold EA, Shenmen CM, Grouse LH, Schuler G, Klein SL, Old S, Rasooly R, Good P, et al: The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. 2004, 14: 2121-2127. 10.1101/gr.2596504.PubMedView ArticleGoogle Scholar
- Davuluri R, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29: 412-417. 10.1038/ng780.PubMedView ArticleGoogle Scholar
- Liu R, States DJ: Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Res. 2002, 12: 462-469. 10.1101/gr.198002.PubMedPubMed CentralView ArticleGoogle Scholar
- Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, 17 (Suppl 1): S140-S148.PubMedView ArticleGoogle Scholar
- Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigó R: Comparative gene prediction in human and mouse. Genome Res. 2003, 13: 108-117. 10.1101/gr.871403.PubMedPubMed CentralView ArticleGoogle Scholar
- Solovyev VV, Shahmuradov IA: PromH: Promoters identification using orthologous genomic sequences. Nucleic Acids Res. 2003, 31: 3540-3545. 10.1093/nar/gkg525.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-140. 10.1093/nar/29.1.137.PubMedPubMed CentralView ArticleGoogle Scholar
- Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ, Mulder N, O'Donovan C, Parkinson H, Tuli MA, Apweiler R, et al: The European Bioinformatics Institute's data resources. Nucleic Acids Res. 2003, 31: 43-50. 10.1093/nar/gkg066.PubMedPubMed CentralView ArticleGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004, 14: 160-169. 10.1101/gr.1645104.PubMedPubMed CentralView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2003, 31: 23-27. 10.1093/nar/gkg057.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Higgins DG, Thompson JD, Gibson TJ: Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 1996, 266: 383-402.PubMedView ArticleGoogle Scholar
- FirstEF. [http://rulai.cshl.org/tools/FirstEF]
- Takai D, Jones PA: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA. 2002, 99: 3740-3745. 10.1073/pnas.052410099.PubMedPubMed CentralView ArticleGoogle Scholar
- Yeh RF, Lim LP, Burge CB: Computational inference of homologous gene structures in the human genome. Genome Res. 2001, 11: 803-816. 10.1101/gr.175701.PubMedPubMed CentralView ArticleGoogle Scholar
- CSHL Mammalian Promoter Database (CSHLmpd). [http://rulai.cshl.edu/CSHLmpd2]
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003, 31: 28-33. 10.1093/nar/gkg033.PubMedPubMed CentralView ArticleGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S, NISC Comparative Sequencing Program: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13: 721-731. 10.1101/gr.926603.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao F, Xuan Z, Liu L, Zhang MQ: TRED: a Transcription Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acid Res. 2005, 33: D103-D107. 10.1093/nar/gki004.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res. 2003, 31: 82-86. 10.1093/nar/gkg121.PubMedPubMed CentralView ArticleGoogle Scholar
- Promoter sets. [ftp://cshl.edu/pub/science/mzhanglab/PromoterSet]
- Dike S, Balija VS, Nascimento LU, Xuan Z, Ou J, Zutavern T, Palmer LE, Hannon G, Zhang MQ, McCombie WR: The mouse genome: experimental examination of gene predictions and transcriptional start sites. Genome Res. 2004, 14: 2424-2429. 10.1101/gr.3158304.PubMedPubMed CentralView ArticleGoogle Scholar
- Suzuki Y, Yamashita R, Shirota M, Sakakibara Y, Chiba J, Mizushima-Sugano J, Nakai K, Sugano S: Sequence comparison of human and mouse genes reveals a homologous block structure in the promoter regions. Genome Res. 2004, 14: 1711-1718. 10.1101/gr.2435604.PubMedPubMed CentralView ArticleGoogle Scholar
- UCSC Genome browser. [http://genome.ucsc.edu]
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMedPubMed CentralGoogle Scholar
- Human genes and promoters. [ftp://cshl.org/pub/science/mzhanglab/PromoterSet/HumanKnownPromoter4Test]
- Promoter set for Affymetrix array HG-U133A. [ftp://cshl.org/pub/science/mzhanglab/PromoterSet/HG-U133A]
- Promoter set of all RefSeq genes. [ftp://cshl.org/pub/science/mzhanglab/PromoterSet/Refseq]
- All known promoters in CSHLmpd. [ftp://cshl.org/pub/science/mzhanglab/PromoterSet/KnownPromoter]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.