PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation
- Elodie Portales-Casamar†1,
- Stefan Kirov†2, 3,
- Jonathan Lim1,
- Stuart Lithwick1,
- Magdalena I Swanson1,
- Amy Ticoll1,
- Jay Snoddy2, 4 and
- Wyeth W Wasserman1Email author
© Portales-Casamar et al.; licensee BioMed Central Ltd. 2007
Received: 30 April 2007
Accepted: 28 September 2007
Published: 28 September 2007
PAZAR is an open-access and open-source database of transcription factor and regulatory sequence annotation with associated web interface and programming tools for data submission and extraction. Curated boutique data collections can be maintained and disseminated through the unified schema of the mall-like PAZAR repository. The Pleiades Promoter Project collection of brain-linked regulatory sequences is introduced to demonstrate the depth of annotation possible within PAZAR. PAZAR, located at http://www.pazar.info, is open for business.
The study of gene regulation has emerged as a focus of efforts to understand how genome sequences give rise to diverse and complex cells and tissues. From gene-centric dissection of promoter sequences  to regulon-based analysis of cis-regulatory modules  through to genome-scale chromatin probes , researchers across the subdisciplines of modern biology strive to understand how cells regulate the flow of genetic information from DNA to RNA via the process of transcription. This developing knowledge, and more critically the data produced, has unleashed a wealth of computational-driven approaches to predict the locations of regulatory sequences, as well as to discover classes of binding sites for transcription factors and models of regulatory programs [4–8]. Annotated sets of regulatory sequences, with well understood and independently confirmed function, are necessary to serve as gold standards to support the validation of new molecular techniques and computational algorithms. As confidence in regulatory annotation and prediction advances, researchers will increasingly draw on such knowledge to design sequences capable of directing targeted gene expression in molecular applications such as gene therapy.
Existing regulatory sequence data collections are generated primarily in a need-driven manner. A dedicated researcher pursuing an idea will extract from the scientific literature a sufficient set of annotations to support their own studies. For example, the widely used JASPAR collection of transcription factor binding profiles  was developed initially for the study of binding pattern similarities across families of structurally related transcription factors . Similarly the ORegAnno database  was compiled initially for the study of genetic variations known to alter binding sites of transcription factors. The best of these reference collections are subsequently used by researchers within bioinformatics to improve and assess the performance and efficiency of computational methods. These boutique data collections are the backbone of the current generation of regulatory sequence analysis studies (examples include [9, 11–20]). It is our perception that boutique reference databases will likely remain the primary sources for regulatory sequence annotations for much time to come. While large centrally curated database have emerged for proteins (UniProt ) or human genetics (OMIM ), funding for large-scale curation of an open-access regulatory sequence collection appears unlikely.
The existing pool of annotated data for transcriptional regulation is not optimal. There is an unfortunate long-term problem that stems in part from the fact that database maintenance is tiresome. The operators of the boutique databases quickly move on to other tasks, motivated equally by a dearth of monetary support and the excitement of the next project. Few regulatory sequence collections have endured for long periods of time with evidence of substantial expansion. The widely used TRANSFAC collection of regulatory sequences has been a central tool for bioinformatics . However, the transfer of the collection to a commercial funding model makes it difficult for the system to build on community participation. The scientific community is less likely to add to and improve upon data annotation distributed in a for-profit tool. Limited commercial curation may tend to focus on commercially relevant annotation rather than basic science research needs.
The boutique model of database development suffers from several fundamental problems. As mentioned, collections can stagnate after the initial enthusiasm of the creator wanes. For current research, reference collections must increasingly map onto genome sequence coordinates, and thus the utility of the collections rapidly diminishes if such coordinates are not kept up to date. Furthermore, data need to be delivered in a dynamic manner accessible by web interfaces, programming interfaces and emergently via support of semantic interfaces. Flat file data models are too rigid and cannot capture data at its full granularity.
Database organization and controlled vocabularies
This flexible design enables PAZAR to represent data consistent with our current understanding of transcriptional regulation. First, the system refers to 'transcription start region' instead of 'transcription start site' as increasing evidence shows that transcription start sites are more 'fuzzy' than previously thought and often cannot be confined to unique nucleotides [25, 26]. Second, it takes into account the fact that TFs often act as complexes containing more than one subunit. For instance, members of the bZIP family of TFs, including Fos, Jun, Maf/Nrl, CREB/ATF and CEBP/NFIL-6, display subtle differences in DNA binding specificity depending on the dimers formed . PAZAR is the first system to acknowledge this fact and to allow the annotator to differentiate between different dimer compositions. Furthermore, PAZAR is the first database to capture mutation data in an efficient way, enabling the user to correlate each base pair change with a change in regulatory sequence activity. We anticipate that this 'negative' information will allow for the development of more diverse TF binding models. PAZAR not only captures information on individual TF binding sites but also on the longer cis-regulatory modules at which TFs interact. In addition, to better represent data, the PAZAR system allows for the storage of TF binding profiles in matrix format. This is important in order to accommodate external data that do not provide individual binding site information, such as JASPAR  or computational motif predictions .
The aforementioned design features have been implemented using the mySQL relational database. The current database structure is developed and maintained through the DBDesigner software application, which provides an integrated graphic development interface and tools for automatic SQL script generation and data exchange.
The wide array of PAZAR hostable datasets contains a great heterogeneity of information. To overcome the challenges imposed by such data diversity, we incorporate controlled vocabularies as a means to consistently annotate regulatory sequences and expression patterns. Bio-ontologies offer common semantics for biological functional annotations . Two topics requiring controlled vocabularies in PAZAR are: cell types and tissues; and experimental methods. For the former, we chose the BRENDA Tissue Ontology as our reference  and are providing updates to the BRENDA developers on a periodic basis as PAZAR users expand the vocabulary. With respect to the experiment type ontology, we are collaboratively working with the developers of the ORegAnno database .
PAZAR web interface and programming tools
As illustrated in Figure 1, the PAZAR database can be viewed as a mall bringing together independent boutiques. The CGI-based interface builds on this theme through the incorporation of a mall map that serves as the entry to the search interface. Users can search by gene ('Genes' department store), TF ('TFMART' department store) or TF binding profile ('TF PROFILES' department store). If interested in only one specific dataset hosted in PAZAR, users can also search this specific store by clicking either on the store on the map or on its name in the mall directory.
Use-case number 1
Use-case number 2
Use-case number 3
PAZAR provides a submission interface that one can access by clicking on 'Submit' in the left menu. This web-based streamlined user interface provides a simplified entry point to the database for non-professional curators, such as scientists that want to deposit their own experimental data to the public repository.
We have developed a Perl API (application programming interface) that hides the intrinsic complexity of the schema from database users. The object-oriented approach provides programmers with different layers of abstraction, allowing advanced users to create 'high-layer' objects and methods to suit project-specific needs.
To best serve users, PAZAR must frequently retrieve data from external sources. For example, sequence coordinates must be updated when genome assemblies are released, updated, or re-annotated. The API pazar::talk modules make this possible by delegating all external queries to an appropriate pazar::talk::database module. Currently, three modules have been developed to interact with the GeneKeyDB , JASPAR , and EnsEMBL  databases. The open source nature of this project allows users to develop or adapt additional modules to work with any database of their choice.
A PAZAR-specific exchange format has been implemented in XML (extensible markup language). In addition to facilitating data transfer between 'boutiques' and the central master database, the XML format can support custom stand-alone user interfaces that do not have direct database access. Some basic sequence features can also be exported in GFF (general feature format). API methods are available to parse PAZAR XML or GFF format data for importation into the database.
Each data collection within PAZAR is called a project and is identified by a project ID, a project name, a status and a list of users. The project status can be 'restricted' (only the project-specific users have read and write access), 'published' (only the project-specific users have write privileges, but everyone has read access) or 'open' (everyone has read and write privileges). For this purpose, each record in the database is linked to a project ID, allowing all projects to share the same tables within the database schema, yet retaining their project identity so that they remain independent data collections.
PAZAR database content on 13 July 2007*
Regulatory sequence (genomic)
Regulatory sequence (artificial)
Transcription factor profiles
ORegAnno STAT1 lit
PAZAR availability and distribution
Conclusion: growth and development
A large fraction of gene regulation data comes from high-throughput techniques such as gene expression and chromatin immunoprecipitation microarrays. Unfortunately, the observed data are difficult to interpret as they often reflect contributions from overlapping processes. One means to improve the interpretation of results is to incorporate prior knowledge of regulatory processes [39, 40]. The JASPAR database of TF binding profiles is widely used for such purposes , yet provides merely a fraction of the information necessary to support the research community. An excellent and extensive comparison of the existing binding site prediction tools  suggests that one of the biggest hurdles in evaluating these tools objectively is the lack of an adequate reference collection. Thus, access to a larger pool of experimentally derived reference data, such as provided by PAZAR, could facilitate both improved interpretation of high-throughput data and assessment of computational methods.
Considering the future of gene regulation databases, three things are apparent. First, the motivation and expertise of individual researchers, as well as their focus on deep annotation of specific pathways and processes, make boutique operators a key resource in long-term compilation of regulatory sequences and annotations. Second, based on principles shared by the authors, any database should provide data and software in an open, unrestricted manner to all researchers in all settings. Third, the ongoing technical challenges for databases require a long-term commitment of talented technical staff. PAZAR was developed based on these observations.
While our laboratory will maintain PAZAR for the long-term as it is necessary for our on-going research, ideally the project would expand through the engagement of a cooperative research community. Recent events suggest that the global research community is prepared to participate in regulatory sequence annotation projects. In late 2006, a group of open-access motivated scientists contributed regulatory sequence annotations to the ORegAnno database . While PAZAR and ORegAnno differ substantially in mission and approach, both address the need for open-access data collections and the developers are working together on common components such as controlled vocabularies. Contributions to a shared system could be combined synergistically to provide the research community with a valued resource.
Development of PAZAR will require ongoing effort to expand the data represented, the means to access the data and the quality of the data curation tools. At present, existing data collections are being added to PAZAR with the permission and collaboration of the boutique operators. We anticipate the boutique database creators will be strongly motivated to use the system as it eases their own work. For instance, most high-throughput datasets currently generated never become available through a database and web interface because of the limited time researchers want to put into this effort. PAZAR provides an easy way to make these data available and to maintain them. Readers of this paper are encouraged to consider opening a boutique or working with the PAZAR team to move an existing data collection into the system.
Our goal is for PAZAR to become the public repository for data and annotations pertaining to transcriptional regulation. By promoting strong integration with tools for computational analysis and prediction of cis-regulatory sequences, boutique database operators will be motivated to participate in the expansion of the system.
application programming interface
general feature format
extensible markup language.
We acknowledge Dimas Yusuf for the drawing of the PAZAR mall map and Jerome Bacconnier for the PAZAR logo and Web interface design. This project is supported by funding from the GenomeCanada Pleiades Promoter Project, the Canadian Institute of Health Research (CIHR), Canada Foundation for Innovation, Merck and IBM. WWW is a CIHR New Investigator and a Scholar of the Michael Smith Foundation for Health Research.
- Farhadi HF, Lepage P, Forghani R, Friedman HC, Orfali W, Jasmin L, Miller W, Hudson TJ, Peterson AC: A combinatorial network of evolutionarily conserved myelin basic protein regulatory sequences confers distinct glial-specific phenotypes. J Neurosci. 2003, 23: 10214-10223.PubMedGoogle Scholar
- Kirchhamer CV, Yuh CH, Davidson EH: Modular cis-regulatory organization of developmentally expressed genes: two genes transcribed territorially in the sea urchin embryo, and additional examples. Proc Natl Acad Sci USA. 1996, 93: 9322-9328. 10.1073/pnas.93.18.9322.PubMedPubMed CentralView ArticleGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al: Genome-wide location and function of DNA binding proteins. Science. 2000, 290: 2306-2309. 10.1126/science.290.5500.2306.PubMedView ArticleGoogle Scholar
- Kel AE, Kel-Margoulis OV, Farnham PJ, Bartley SM, Wingender E, Zhang MQ: Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. J Mol Biol. 2001, 309: 99-120. 10.1006/jmbi.2001.4650.PubMedView ArticleGoogle Scholar
- Fickett JW: Quantitative discrimination of MEF2 sites. Mol Cell Biol. 1996, 16: 437-441.PubMedPubMed CentralView ArticleGoogle Scholar
- Levy S, Hannenhalli S, Workman C: Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics (Oxford, England). 2001, 17: 871-877. 10.1093/bioinformatics/17.10.871.View ArticleGoogle Scholar
- Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 2001, 11: 1559-1566. 10.1101/gr.180601.PubMedPubMed CentralView ArticleGoogle Scholar
- Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998, 278: 167-181. 10.1006/jmbi.1998.1700.PubMedView ArticleGoogle Scholar
- Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006, 34: D95-97. 10.1093/nar/gkj115.PubMedPubMed CentralView ArticleGoogle Scholar
- Sandelin A, Wasserman WW: Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol. 2004, 338: 207-215. 10.1016/j.jmb.2004.02.048.PubMedView ArticleGoogle Scholar
- Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJ: ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics (Oxford, England). 2006, 22: 637-640. 10.1093/bioinformatics/btk027.View ArticleGoogle Scholar
- Schmid CD, Praz V, Delorenzi M, Perier R, Bucher P: The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 2004, 32: D82-85. 10.1093/nar/gkh122.PubMedPubMed CentralView ArticleGoogle Scholar
- Bergman CM, Carlson JW, Celniker SE: Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics (Oxford, England). 2005, 21: 1747-1749. 10.1093/bioinformatics/bti173.View ArticleGoogle Scholar
- Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res. 2006, 34: D63-67. 10.1093/nar/gkj116.PubMedPubMed CentralView ArticleGoogle Scholar
- Sun H, Palaniswamy SK, Pohar TT, Jin VX, Huang TH, Davuluri RV: MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data. Nucleic Acids Res. 2006, 34: D98-103. 10.1093/nar/gkj096.PubMedPubMed CentralView ArticleGoogle Scholar
- Grienberg I, Benayahu D: Osteo-Promoter Database (OPD) - promoter analysis in skeletal cells. BMC Genomics [computer file]. 2005, 6: 46-10.1186/1471-2164-6-46.View ArticleGoogle Scholar
- Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics (Oxford, England). 1999, 15: 607-611. 10.1093/bioinformatics/15.7.607.View ArticleGoogle Scholar
- Kanamori M, Konno H, Osato N, Kawai J, Hayashizaki Y, Suzuki H: A genome-wide and nonredundant mouse transcription factor database. Biochem Biophys Res Comm. 2004, 322: 787-793. 10.1016/j.bbrc.2004.07.179.PubMedView ArticleGoogle Scholar
- Kolchanov NA, Podkolodnaia OA, Anan'ko EA, Ignat'eva EV, Podkolodnyi NL, Merkulov VM, Stepanenko IL, Pozdniakov MA, Belova OE, Grigorovich DA, et al: Regulation of eukaryotic gene transcription: description in the TRRD database. Molekuliarnaia Biologiia. 2001, 35: 934-942.PubMedGoogle Scholar
- Gallo SM, Li L, Hu Z, Halfon MS: REDfly: a Regulatory Element Database for Drosophila. Bioinformatics (Oxford, England). 2006, 22: 381-383. 10.1093/bioinformatics/bti794.View ArticleGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34: D187-191. 10.1093/nar/gkj161.PubMedPubMed CentralView ArticleGoogle Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, 33: D514-517. 10.1093/nar/gki033.PubMedPubMed CentralView ArticleGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34: D108-110. 10.1093/nar/gkj143.PubMedPubMed CentralView ArticleGoogle Scholar
- The PAZAR Database of Transcription Factor and Regulatory Sequence Annotation. [http://www.pazar.info]
- Kawaji H, Kasukawa T, Fukuda S, Katayama S, Kai C, Kawai J, Carninci P, Hayashizaki Y: CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res. 2006, 34: D632-636. 10.1093/nar/gkj034.PubMedPubMed CentralView ArticleGoogle Scholar
- Kasai Y, Hashimoto S, Yamada T, Sese J, Sugano S, Matsushima K, Morishita S: 5'SAGE: 5'-end Serial Analysis of Gene Expression database. Nucleic Acids Res. 2005, 33: D550-552. 10.1093/nar/gki085.PubMedPubMed CentralView ArticleGoogle Scholar
- Ryseck RP, Bravo R: c-JUN, JUN B, and JUN D differ in their binding affinities to AP-1 and CRE consensus sequences: effect of FOS proteins. Oncogene. 1991, 6: 533-542.PubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434: 338-345. 10.1038/nature03441.PubMedPubMed CentralView ArticleGoogle Scholar
- Bodenreider O, Stevens R: Bio-ontologies: current trends and future directions. Briefings Bioinformatics. 2006, 7: 256-274. 10.1093/bib/bbl027.View ArticleGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res. 2004, 32: D431-433. 10.1093/nar/gkh081.PubMedPubMed CentralView ArticleGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Sys Mol Biol. 1994, 2: 28-36.Google Scholar
- Kirov SA, Peng X, Baker E, Schmoyer D, Zhang B, Snoddy J: GeneKeyDB: a lightweight, gene-centric, relational database to support data mining environments. BMC Bioinformatics [computer file]. 2005, 6: 72-10.1186/1471-2105-6-72.View ArticleGoogle Scholar
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, et al: Ensembl 2006. Nucleic Acids Res. 2006, 34: D556-561. 10.1093/nar/gkj133.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang H, Zhang Y, Cheng Y, Zhou Y, King DC, Taylor J, Chiaromonte F, Kasturi J, Petrykowska H, Gibb B, et al: Experimental validation of predicted mammalian erythroid cis-regulatory modules. Genome Res. 2006, 16: 1480-1492. 10.1101/gr.5353806.PubMedPubMed CentralView ArticleGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651-657. 10.1038/nmeth1068.PubMedView ArticleGoogle Scholar
- The Pleiades Promoter Project: Genomic Resources Advancing Therapies for Brain Disorders. [http://www.pleiades.org/]
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-2504. 10.1101/gr.1239303.PubMedPubMed CentralView ArticleGoogle Scholar
- The PAZAR Development Website. [http://sourceforge.net/projects/pazar]
- Seifert M, Scherf M, Epple A, Werner T: Multievidence microarray mining. Trends Genet. 2005, 21: 553-558. 10.1016/j.tig.2005.07.011.PubMedView ArticleGoogle Scholar
- Dohr S, Klingenhoff A, Maier H, Hrabe de Angelis M, Werner T, Schneider R: Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res. 2005, 33: 864-872. 10.1093/nar/gki230.PubMedPubMed CentralView ArticleGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.